Page MenuHomeFreeBSD

vfs_lookup(): remove VV_CROSSLOCK logic
Needs ReviewPublic

Authored by jah on Jun 19 2023, 1:23 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Sep 18, 12:14 PM
Unknown Object (File)
Mon, Sep 16, 6:55 PM
Unknown Object (File)
Sun, Sep 8, 5:16 PM
Unknown Object (File)
Wed, Sep 4, 2:23 PM
Unknown Object (File)
Aug 29 2024, 7:11 AM
Unknown Object (File)
Aug 11 2024, 12:32 AM
Unknown Object (File)
Jul 29 2024, 1:07 PM
Unknown Object (File)
Jul 28 2024, 10:42 PM
Subscribers

Details

Reviewers
kib
markj
mjg
olce
Summary

Avoid the LOR that originally motivated VV_CROSSLOCK by instead
changing the unmount() path to not lock the covered vnode until
after draining the mountpoint's busy count. Doing this alone would
cause a different LOR between the unmount path and the blocking
vfs_busy() call in vfs_lookup(). Avoid this by first attempting
a nonblocking vfs_busy(). This should succeed in the overwhelming
majority of cases, but if it fails (which can only happen due to
concurrent unmount) fall back to dropping the vnode lock before
attempting a blocking vfs_busy().

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 52211
Build 49102: arc lint + arc unit

Event Timeline

jah requested review of this revision.Jun 19 2023, 1:23 AM

Right now this change is just a proposal. I've successfully run the unionfs and nullfs stress2 tests against it, and have also been running it on my -current machine for the last couple of months.
Of course it's entirely possible I've missed something that would make this change unworkable. But if you guys think this approach has merit, then I'll finish this patch by doing the following:

--Actually remove the VV_CROSSLOCK flag and the logic in unionfs/nullfs that manages it.
--Change the following places that assume the current coveredvp/vfs_busy() lock order to use logic similar to vfs_lookup (which suggests making it an inline helper function):

sys_fchdir (sys/kern/vfs_syscalls.c)
nfsrvd_readdirplus (sys/fs/nfsserver/nfs_nfsdport.c)
zfsctl_mounted_here (sys/contrib/openzfs/module/os/freebsd/zfs/zfs_ctldir.c)
zfs_snapshot_vptocnp (sys/contrib/openzfs/module/os/freebsd/zfs/zfs_ctldir.c)

Hi Jason,

Glad to see this!

I think this approach is good. Indeed, vfs_busy() with MBF_NOWAIT will almost never fail and the complexity introduced by the fallback case can easily be masked by creating a separate utility function that walks through mounts (creating this function has been on my todo list for a while as well).

I have a couple of concerns though, please see inline comments. I'm planning to test your patch modified according to my comments tomorrow. To this end, I'll have to complete the patch anyway to avoid LORs. So I'll perhaps create another review to show the result in order not to interfere with your diff (I don't think Phabricator supports alternate diffs for a single review?) and from which you'll be able to pick up what you want.

Thanks.

sys/kern/vfs_lookup.c
1312–1315

I would remove resetting error to 0. Not functionally necessary, since we are going to restart the loop, which unconditionally will set error again.

sys/kern/vfs_mount.c
2268–2269

This code needs to be kept here so that the VFS cache will fallback to regular lookups as soon as it is clear that the unmount process has started, and more precisely this has to be atomic with respect to setting MNTK_UNMOUNT, so under mp's (inter)lock.

Not doing so would allow a race where the cache still resolves through the mountpoint even after other processes/threads get EBUSY when trying to unmount.

2289–2308

Most of this code is now unnecessary. The whole block can be removed, and locking can happen later, when coveredvp is effectively marked as not being a mount point anymore.

sys/kern/vfs_mount.c
2268–2269

good catch!

2289–2308

I was tempted to just delete this block as well when I initially wrote the patch, but I think we actually still need it. Though it's not explicitly stated in the man page, I believe coveredvp being locked is part of the contract with filesystems for VFS_UNMOUNT(). I do think the re-check of coveredvp->v_mountedhere can be removed though, because by this point we've already successfully MNTK_UNMOUNT, so no else should be able to successfully unmount.

  • Return the write sequence on coveredvp to the correct place, replace
sys/kern/vfs_mount.c
2289–2308

I've checked all in-tree filesystems and the exclusive lock on mnt_vnodecovered is not assumed by any vfs_mount implementation. Only 3 filesystems actually deal with mnt_vnodecovered there: tarfs, tmpfs and unionfs. The first two acquire a shared lock on their own just to be able to call VOP_GETATTR(), and the last one for this reason as well but also at another place, apparently without reason, and then exclusively to set and then unset VV_CROSSLOCK which we are removing. Anway, these acquisitions are always explicit. So I think we should remove locking here, and move it solely around the block diassociating coveredvp with the mount point.

Additionally, an active reference (v_usecount) is grabbed on the covered vnode since the mount and is released in vfs_mount_destroy(). So could you remove the vholdl()/vdrop() on coveredvp?

Thanks.

Update (2023/06/22): I've just noticed I left this comment as unsubmitted yesterday. I should be able to come back with some code later this day.

sys/kern/vfs_mount.c
2289–2308

Heh, I added that (recursive) lock of mnt_vnodecovered in unionfs_unmount() out of caution since I wasn't sure if mnt_vnodecovered locking was intended to be part of the contract with VFS_UNMOUNT or not. I also remember being somewhat surprised to find that mnt_vnodecovered is locked across VFS_UNMOUNT but not across VFS_MOUNT.

I wonder if we actually need to lock coveredvp at all in this function? We do clear VIRF_MOUNTPOINT later, but that only requires the interlock.

In any case, I would rather make those changes to coveredvp locking separately. If there aren't any in-tree consumers relying on the locking then I suspect there aren't any out-of-tree ones either. But it's also not related to the point of this PR, which is to kill VV_CROSSLOCK, and ideally I'd like to have a better idea of why the locking was done this way originally before I go about changing it.

I agree that the hold on coveredvp is superfluous, so I'll get rid of it here.

  • Remove extraneous vhold()

Sorry, I was a bit ill today, so couldn't work much. Still, going to submit a few differentials so that you can see where I stand. My code is currently incomplete since the mount point traversal factorization is missing.

Thanks.

sys/kern/vfs_mount.c
2289–2308

I wonder if we actually need to lock coveredvp at all in this function? We do clear VIRF_MOUNTPOINT later, but that only requires the interlock.

Yes, you're probably right since I think it's the case for setting v_mountedhere (I'll have to double check).

In any case, I would rather make those changes to coveredvp locking separately. If there aren't any in-tree consumers relying on the locking then I suspect there aren't any out-of-tree ones either. But it's also not related to the point of this PR, which is to kill VV_CROSSLOCK, and ideally I'd like to have a better idea of why the locking was done this way originally before I go about changing it.

Sure. I saw that in passing, and since it simplifies the code a bit, I thought I would mention it. I'll propose a later patch for that.

Please see https://reviews.freebsd.org/D40719 (essentially your code with my suggestions, so currently with locking of coveredvp removed around VFS_UNMOUNT()) and https://reviews.freebsd.org/D40720, which is about removing VV_CROSSLOCK.

I also have separate small patches prompted by reviewing all this code, but we can see that later.

Hope this helps.

And sorry, I was a bit ill today, so couldn't work much. That's why there is a missing part: factorization of mount point traversal and calling it from the places you mentioned. So the patches cannot be run as is. I'll finish them ASAP.

Regards.

I've spotted two significant problems in the current proposal. Please see inline comments.

I've re-analyzed the relevant generic code, and modified and completed it so that the initial scheme can work. In the end, the most time-consuming part was how to factor in and cut the mount crossing code in several parts so that it is re-usable by all consumers, and dealing with the references and locking complexities in the different code paths.

Incidentally, I've also checked all calls to vfs_busy() and I indeed confirm your initial list of functions to change (IIRC, all other calls also involving a vnode want to acquire the vnode's v_mount).

The final result survives stress2's all *mount*.sh tests without problems (there is just a spurious LOR) as well as filesystem tests.

Please see the full stack of reviews, starting with https://reviews.freebsd.org/D40847 which is exactly the current diff you proposed here but only for vfs_mount.c. I've made my own changes to vfs_lookup.c after having devised the crossing functions.

I would like to draw the attention of everybody involved that this development is not only about removing VV_CROSSLOCK but also about removing an actual timing-dependent deadlock between lookup and unmount, as should become apparent after reading the commit messages/comments.

sys/kern/vfs_lookup.c
1309–1310

After the unlock, there is no guarantee that mp is still referenced, and thus that the right mount point is busied. Because of mount points type stability, at least this won't crash, but a completely unrelated mount point may then be busied, i.e., one that was never mounted on the vnode.

1311

vn_lock() with LK_RETRY can return a doomed vnode in case of a forced unmount, which may then proceed to the rest of the lookup with interesting side-effects.

sys/kern/vfs_lookup.c
1309–1310

Shouldn't the 'else if (dp->v_mountedhere != mp)' check below handle that case?

1311

Yes, we probably need to bring back the VN_IS_DOOMED() check from the old VV_CROSSLOCK implementation.

sys/kern/vfs_lookup.c
1309–1310

I don't think that's enough, the struct mount may have been reused completely for another mount in case of a forced unmount. I see two solutions, either using again mnt_gen or keeping a hold on the mountpoint. I chose the second one in https://reviews.freebsd.org/D40851.

sys/kern/vfs_lookup.c
1309–1310

For that to happen, wouldn't dp need to be completely recycled for a different file, with v_mountedhere happening to point to an also-recycled version of the same struct mount?
How would that happen, given that we hold a reference on dp during this entire sequence (the lock is dropped using VOP_UNLOCK() rather than vput() intentionally to avoid losing the reference)

sys/kern/vfs_lookup.c
1309–1310

Indeed I was wrong saying that the wrong mount point can be traversed, but I think not for the reason you've just mentioned. Let me explain things in detail, so everyone can double check.

The scenario you've just described can indeed happen. Having an active reference on dp doesn't alone prevent (non-forced) unmounts and mounts to occur concurrently. In particular, unmounts can proceed up to "freeing" the mp structure (vfs_mount_destroy()), after having updated v_mountedhere to NULL on dp which only needs acquiring a lock (the rest of the process is independent of dp). So it is possible that, after the VOP_UNLOCK() call but before the vfs_busy(), the mount point mounted on dp is unmounted, and another mount remounted on it with mp being recycled to represent it.

In spite of this, no wrong mount point can be traversed because, once vfs_busy() returns successfully, no further unmount of mp can happen and its structure can't be recycled. So, subsequently checking that mp is still equal to v_mountedhere under the vnode lock guarantees that the good mount point (that which is currently mounted on the vnode) is traversed, even if between VOP_UNLOCK() and vfs_busy() the mount point represented by mp actually changed.

The test dp->v_mountedhere != mp catches the case I've just described, as well as a sequence of mounts/unmounts resulting in dp not being covered or being covered by a mount represented by a different structure (i.e., at another address in memory). The only thing it doesn't catch is the failure of vfs_busy() (with similar scenarios as above, dp->v_mountedhere != mp can be false after mp having being recycled, in which case the call to VFS_ROOT() would proceed but without an active reference on mp). But this is the point of the if (error != 0) test, which restarts the loop.

This code however relies on mount point structures never being deallocated, an invariant I personally dislike. I know it has its advantages, but we might want (or perhaps even need, with the multiplication of mounts with ZFS and file mounts) to get rid of it in the future. We'll see.

sys/kern/vfs_lookup.c
1309–1310

Right, I think we're just describing two different scenarios here.

You're describing the scenario in which the filesystem mounted at dp is unmounted (not necessarily forcibly) and mp recycled to a different mount. I agree with your analysis above.

I was describing the scenario in which dp itself is doomed by a forcible unmount of *its* filesystem. If both dp and mp were allowed to be recycled while dp's lock is dropped, this would be problematic for lookup because it could allow the lookup to walk into a completely unrelated directory hierarchy. But that can't happen: mp may be recycled, which as you point out above isn't by itself fatal to the integrity of the lookup, but the reference prevents dp from being recycled.

sys/kern/vfs_lookup.c
1309–1310

Ah, yes, I read a bit too quickly what you wrote and took it as the scenario I presented because to me that was the possible path to misbehavior. I knew from the start that the other one was impossible, as you've just described. Sorry for the confusion caused.