Page MenuHomeFreeBSD

Add VV_CROSSLOCK vnode flag to avoid cross-mount lookup LOR
ClosedPublic

Authored by jah on Apr 25 2022, 3:49 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 16, 12:45 PM
Unknown Object (File)
Sat, Nov 16, 10:45 AM
Unknown Object (File)
Thu, Nov 14, 6:44 PM
Unknown Object (File)
Thu, Nov 14, 3:02 PM
Unknown Object (File)
Thu, Nov 14, 1:54 PM
Unknown Object (File)
Mon, Nov 4, 3:03 AM
Unknown Object (File)
Mon, Oct 28, 6:08 AM
Unknown Object (File)
Fri, Oct 25, 3:54 PM
Subscribers

Details

Summary

When a lookup operation crosses into a new mountpoint, the mountpoint
must first be busied before the root vnode can be locked. When a
filesystem is unmounted, the vnode covered by the mountpoint must
first be locked, and then the busy count for the mountpoint drained.
Ordinarily, these two operations work fine if executed concurrently,
but with a stacked filesystem the root vnode may in fact use the
same lock as the covered vnode. By design, this will always be
the case for unionfs (with either the upper or lower root vnode
depending on mount options), and can also be the case for nullfs
if the target and mount point are the same (which admittedly is
very unlikely in practice).

In this case, we have LOR. The lookup path holds the mountpoint
busy while waiting on what is effectively the covered vnode lock,
while a concurrent unmount holds the covered vnode lock and waits
for the mountpoint's busy count to drain.

Attempt to resolve this LOR by allowing the stacked filesystem
to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary.
Upon observing this flag, the vfs_lookup() will leave the covered
vnode lock held while crossing into the mountpoint. Employ this flag
for unionfs with the caveat that it can't be used for '-o below' mounts
until other unionfs locking issues are resolved.

Reported by: pho
Tested by: pho

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jah requested review of this revision.Apr 25 2022, 3:49 AM

I don't really like this change, it feels like a hack to me. At the same time, I don't have another idea that seems clearly better, so I thought I would post this to at least start a discussion.

I did consider some other options:

--Changing unionfs_root() to use a try-lock/re-lock approach similar to vn_lock_pair() but involving the busy count and the root vnode instead of two vnodes. I'm not convinced this would actually work, and in any case the idea seemed very ugly and likely to be detrimental to performance given how often VFS_ROOT() is used.

--Changing the order of dounmount() so that it drains the busy count before locking the covered vnode. I actually started implementing this one; it felt less hacky than VV_CROSSLOCK but also much more dangerous given the delicate ordering of dounmount(). I also very quickly realized that for this approach to have any chance of working, we would need to change the vfs_busy() call in vfs_lookup() to pass MBF_NOWAIT, otherwise we'll simply get a slightly different kind of LOR deadlock. I'm not sure MBF_NOWAIT would be acceptable there either.

sys/kern/vfs_lookup.c
118

Should we perhaps make this a KASSERT?

1249

Would you please update the comment above vfs_busy()'s definition explaining this new special case?

sys/kern/vfs_lookup.c
118

sounds good to me.

1249

i'll definitely do that before committing, if in fact everyone is ok with this approach.

sys/kern/vfs_lookup.c
1249

I don't think there is any need to pessimize the common case here.

I would argue the thing to do here is to:

  1. factor mount point traversal out into a dedicated routine
  2. refactor the code, most notably to stop constantly testing for NOCROSSMOUNT
  3. for the problem at hand roll with NOWAIT vfs_busy, which will almost always succeed anyway
  4. ... should it fail, you can fallback to a __noinline routine which will to the dance
sys/kern/vfs_lookup.c
1249

instead of checking dp->v_type == VDIR && (mp = dp->v_mountedhere) you can (vn_irflag_read(vp) & VIRF_MOUNTPOINT) != 0)

sys/kern/vfs_lookup.c
1249

This point, as well as your points 1) and 2) above are definitely good suggestions, but I think they should be done as part of a separate change.

As for points 3) and 4), in this case NOWAIT vfs_busy() wouldn't help. The deadlock happens after vfs_busy() has already successfully completed (almost certainly without blocking), while the following call to VFS_ROOT() blocks while trying to lock the root vnode. At the same time, dounmount() on another thread blocks waiting for the mp busy count to drain while holding the covered vnode lock (which in this case happens to be the same as the root vnode lock being waited on by the first thread).

On the other hand, I did mention this alternative approach in my earlier comment, which would reqiure NOWAIT vfs_busy():

--Changing the order of dounmount() so that it drains the busy count before locking the covered vnode. I actually started implementing this one; it felt less hacky than VV_CROSSLOCK but also much more dangerous given the delicate ordering of dounmount(). I also very quickly realized that for this approach to have any chance of working, we would need to change the vfs_busy() call in vfs_lookup() to pass MBF_NOWAIT, otherwise we'll simply get a slightly different kind of LOR deadlock. I'm not sure MBF_NOWAIT would be acceptable there either.

  • Replace runtime check with KASSERT, add comment on CROSSLOCK behavior
sys/kern/vfs_lookup.c
123

If you are dropping support for LK_INTERLOCK for crossmp vnode, then should it be asserted that LK_INTERLOCK is never specified for it?

Also, I think that crossmp_vop_lock1() changes should be a separate commit.

sys/kern/vfs_lookup.c
123

I need to re-test, because it's been so long since I wrote this patch that I no longer remember the exact set of issues I was having with this function. But I do recall that dropping the interlock here did not make sense to me, especially given that the corresponding vop_unlock doesn't re-acquire it, so it seemed as though that would cause problems for any caller holding the interlock. Is there some non-obvious (to me) reason why this would have been done?

sys/kern/vfs_lookup.c
123

EDIT: nevermind, I'd forgotten that dropping the interlock is the documented behavior of VOP_LOCK().
It may not matter in practice, but it wasn't my intent to effectively drop support for LK_INTERLOCK, so let me revise this function.

  • Fix typo, restore support for LK_INTERLOCK on vp_crossmp
sys/kern/vfs_lookup.c
123

I've restored support for LK_INTERLOCK, this does not cause any issues in my testing.
The reason I removed the WITNESS_CHECKORDER directive is that, with VV_CROSSLOCK the covered vnode lock may be upgraded to exclusive (depending on the requirements of the mounted FS) after locking vp_crossmp. This results in a crossmp->covered ordering instead of the usual covered->crossmp ordering and a corresponding witness LOR warning. But since crossmp_vop_lock1() doesn't actually take a lock and requires all consumers to pass LK_SHARED | LK_NOWAIT, the witness directive did not seem useful here.
Is there some reason we should keep it around? If not, should the WITNESS_[UN]LOCK directives also be removed?

sys/kern/vfs_lookup.c
123

The directive is useful because it records a 'logical' lock owner, visible for instance with the db 'show alllocks' command. I believe actual lockmgr() call was removed for the sake of optimization. So I am not sure this is very useful now.

I have not observed any problems with D35054.108486.patch

sys/kern/vfs_lookup.c
123

Would you prefer that I also remove the WITNESS_LOCK and WITNESS_UNLOCK calls here and in crossmp_vop_unlock (as part of a separate commit of course)?

sys/kern/vfs_lookup.c
123

I suspect that yes, the calls have to be removed.

Remove remaining WITNESS directives and no-longer used variables

kib added inline comments.
sys/kern/vfs_lookup.c
1237

Outer () are not needed

This revision is now accepted and ready to land.Aug 6 2022, 11:47 AM

D35054.108900.patch looks good to me.

This revision now requires review to proceed.Aug 13 2022, 5:08 AM

Fix wording in comment, incoporate some suggested performance optimizations

Apologies for letting this review stall, $WORK got in the way in a big way for a while.
I've rebased, incorporated some of @mjg's optimization suggestions, and retested.

This revision is now accepted and ready to land.Oct 21 2022, 11:05 PM