Paths

Table of Contentst

vfs_lookup(): remove VV_CROSSLOCK logic
Needs ReviewPublic
Actions

Authored by jah on Jun 19 2023, 1:23 AM.

Details

Reviewers

kib
markj
mjg
olce

Summary

Avoid the LOR that originally motivated VV_CROSSLOCK by instead
changing the unmount() path to not lock the covered vnode until
after draining the mountpoint's busy count. Doing this alone would
cause a different LOR between the unmount path and the blocking
vfs_busy() call in vfs_lookup(). Avoid this by first attempting
a nonblocking vfs_busy(). This should succeed in the overwhelming
majority of cases, but if it fails (which can only happen due to
concurrent unmount) fall back to dropping the vnode lock before
attempting a blocking vfs_busy().

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 52211
Build 49102: arc lint + arc unit

Event Timeline

jah created this revision.Jun 19 2023, 1:23 AM

Herald added a subscriber: imp. · View Herald TranscriptJun 19 2023, 1:23 AM

jah requested review of this revision.Jun 19 2023, 1:23 AM

Harbormaster completed remote builds in B52130: Diff 123396.Jun 19 2023, 1:23 AM

Right now this change is just a proposal. I've successfully run the unionfs and nullfs stress2 tests against it, and have also been running it on my -current machine for the last couple of months.
Of course it's entirely possible I've missed something that would make this change unworkable. But if you guys think this approach has merit, then I'll finish this patch by doing the following:

--Actually remove the VV_CROSSLOCK flag and the logic in unionfs/nullfs that manages it.
--Change the following places that assume the current coveredvp/vfs_busy() lock order to use logic similar to vfs_lookup (which suggests making it an inline helper function):

sys_fchdir (sys/kern/vfs_syscalls.c)
nfsrvd_readdirplus (sys/fs/nfsserver/nfs_nfsdport.c)
zfsctl_mounted_here (sys/contrib/openzfs/module/os/freebsd/zfs/zfs_ctldir.c)
zfs_snapshot_vptocnp (sys/contrib/openzfs/module/os/freebsd/zfs/zfs_ctldir.c)

jah added a reviewer: olce.Jun 19 2023, 1:30 AM

Hi Jason,

Glad to see this!

I think this approach is good. Indeed, vfs_busy() with MBF_NOWAIT will almost never fail and the complexity introduced by the fallback case can easily be masked by creating a separate utility function that walks through mounts (creating this function has been on my todo list for a while as well).

I have a couple of concerns though, please see inline comments. I'm planning to test your patch modified according to my comments tomorrow. To this end, I'll have to complete the patch anyway to avoid LORs. So I'll perhaps create another review to show the result in order not to interfere with your diff (I don't think Phabricator supports alternate diffs for a single review?) and from which you'll be able to pick up what you want.

Thanks.

sys/kern/vfs_lookup.c
1312–1315	I would remove resetting `error` to 0. Not functionally necessary, since we are going to restart the loop, which unconditionally will set `error` again.
sys/kern/vfs_mount.c
2268–2269	This code needs to be kept here so that the VFS cache will fallback to regular lookups as soon as it is clear that the unmount process has started, and more precisely this has to be atomic with respect to setting `MNTK_UNMOUNT`, so under `mp`'s (inter)lock. Not doing so would allow a race where the cache still resolves through the mountpoint even after other processes/threads get EBUSY when trying to unmount.
2289–2308	Most of this code is now unnecessary. The whole block can be removed, and locking can happen later, when `coveredvp` is effectively marked as not being a mount point anymore.

jah added inline comments.Jun 20 2023, 9:23 PM

sys/kern/vfs_mount.c
2268–2269	good catch!
2289–2308	I was tempted to just delete this block as well when I initially wrote the patch, but I think we actually still need it. Though it's not explicitly stated in the man page, I believe coveredvp being locked is part of the contract with filesystems for VFS_UNMOUNT(). I do think the re-check of coveredvp->v_mountedhere can be removed though, because by this point we've already successfully MNTK_UNMOUNT, so no else should be able to successfully unmount.

Return the write sequence on coveredvp to the correct place, replace

Harbormaster completed remote builds in B52211: Diff 123568.Jun 20 2023, 9:57 PM

olce added inline comments.Jun 22 2023, 8:19 AM

sys/kern/vfs_mount.c
2289–2308	I've checked all in-tree filesystems and the exclusive lock on `mnt_vnodecovered` is not assumed by any `vfs_mount` implementation. Only 3 filesystems actually deal with `mnt_vnodecovered` there: `tarfs`, `tmpfs` and `unionfs`. The first two acquire a shared lock on their own just to be able to call `VOP_GETATTR()`, and the last one for this reason as well but also at another place, apparently without reason, and then exclusively to set and then unset `VV_CROSSLOCK` which we are removing. Anway, these acquisitions are always explicit. So I think we should remove locking here, and move it solely around the block diassociating `coveredvp` with the mount point. Additionally, an active reference (`v_usecount`) is grabbed on the covered vnode since the mount and is released in `vfs_mount_destroy()`. So could you remove the `vholdl()`/`vdrop()` on `coveredvp`? Thanks. Update (2023/06/22): I've just noticed I left this comment as unsubmitted yesterday. I should be able to come back with some code later this day.

jah added inline comments.Jun 22 2023, 3:21 PM

sys/kern/vfs_mount.c
2289–2308	Heh, I added that (recursive) lock of mnt_vnodecovered in unionfs_unmount() out of caution since I wasn't sure if mnt_vnodecovered locking was intended to be part of the contract with VFS_UNMOUNT or not. I also remember being somewhat surprised to find that mnt_vnodecovered is locked across VFS_UNMOUNT but not across VFS_MOUNT. I wonder if we actually need to lock coveredvp at all in this function? We do clear VIRF_MOUNTPOINT later, but that only requires the interlock. In any case, I would rather make those changes to coveredvp locking separately. If there aren't any in-tree consumers relying on the locking then I suspect there aren't any out-of-tree ones either. But it's also not related to the point of this PR, which is to kill VV_CROSSLOCK, and ideally I'd like to have a better idea of why the locking was done this way originally before I go about changing it. I agree that the hold on coveredvp is superfluous, so I'll get rid of it here.

Remove extraneous vhold()

Harbormaster completed remote builds in B52251: Diff 123655.Jun 22 2023, 5:50 PM

Sorry, I was a bit ill today, so couldn't work much. Still, going to submit a few differentials so that you can see where I stand. My code is currently incomplete since the mount point traversal factorization is missing.

Thanks.

sys/kern/vfs_mount.c
2289–2308	I wonder if we actually need to lock coveredvp at all in this function? We do clear VIRF_MOUNTPOINT later, but that only requires the interlock. Yes, you're probably right since I think it's the case for setting `v_mountedhere` (I'll have to double check). In any case, I would rather make those changes to coveredvp locking separately. If there aren't any in-tree consumers relying on the locking then I suspect there aren't any out-of-tree ones either. But it's also not related to the point of this PR, which is to kill VV_CROSSLOCK, and ideally I'd like to have a better idea of why the locking was done this way originally before I go about changing it. Sure. I saw that in passing, and since it simplifies the code a bit, I thought I would mention it. I'll propose a later patch for that.

Please see https://reviews.freebsd.org/D40719 (essentially your code with my suggestions, so currently with locking of coveredvp removed around VFS_UNMOUNT()) and https://reviews.freebsd.org/D40720, which is about removing VV_CROSSLOCK.

I also have separate small patches prompted by reviewing all this code, but we can see that later.

Hope this helps.

And sorry, I was a bit ill today, so couldn't work much. That's why there is a missing part: factorization of mount point traversal and calling it from the places you mentioned. So the patches cannot be run as is. I'll finish them ASAP.

Regards.

I've spotted two significant problems in the current proposal. Please see inline comments.

I've re-analyzed the relevant generic code, and modified and completed it so that the initial scheme can work. In the end, the most time-consuming part was how to factor in and cut the mount crossing code in several parts so that it is re-usable by all consumers, and dealing with the references and locking complexities in the different code paths.

Incidentally, I've also checked all calls to vfs_busy() and I indeed confirm your initial list of functions to change (IIRC, all other calls also involving a vnode want to acquire the vnode's v_mount).

The final result survives stress2's all *mount*.sh tests without problems (there is just a spurious LOR) as well as filesystem tests.

Please see the full stack of reviews, starting with https://reviews.freebsd.org/D40847 which is exactly the current diff you proposed here but only for vfs_mount.c. I've made my own changes to vfs_lookup.c after having devised the crossing functions.

I would like to draw the attention of everybody involved that this development is not only about removing VV_CROSSLOCK but also about removing an actual timing-dependent deadlock between lookup and unmount, as should become apparent after reading the commit messages/comments.

sys/kern/vfs_lookup.c
1309–1310	After the unlock, there is no guarantee that `mp` is still referenced, and thus that the right mount point is busied. Because of mount points type stability, at least this won't crash, but a completely unrelated mount point may then be busied, i.e., one that was never mounted on the vnode.
1311	`vn_lock()` with `LK_RETRY` can return a doomed vnode in case of a forced unmount, which may then proceed to the rest of the lookup with interesting side-effects.

olce mentioned this in D40847: dounmount(): Acquire covered vnode lock after draining mount point references.Jul 3 2023, 1:02 PM

jah added inline comments.Jul 3 2023, 3:09 PM

sys/kern/vfs_lookup.c
1309–1310	Shouldn't the 'else if (dp->v_mountedhere != mp)' check below handle that case?
1311	Yes, we probably need to bring back the VN_IS_DOOMED() check from the old VV_CROSSLOCK implementation.

olce added inline comments.Jul 3 2023, 3:24 PM

sys/kern/vfs_lookup.c
1309–1310	I don't think that's enough, the `struct mount` may have been reused completely for another mount in case of a forced unmount. I see two solutions, either using again `mnt_gen` or keeping a hold on the mountpoint. I chose the second one in https://reviews.freebsd.org/D40851.

olce added inline comments.Jul 3 2023, 3:27 PM

sys/kern/vfs_lookup.c
1309–1310	https://reviews.freebsd.org/D40850, sorry.

jah added inline comments.Jul 3 2023, 3:41 PM

sys/kern/vfs_lookup.c
1309–1310	For that to happen, wouldn't dp need to be completely recycled for a different file, with v_mountedhere happening to point to an also-recycled version of the same struct mount? How would that happen, given that we hold a reference on dp during this entire sequence (the lock is dropped using VOP_UNLOCK() rather than vput() intentionally to avoid losing the reference)

olce added inline comments.Jul 3 2023, 5:25 PM

sys/kern/vfs_lookup.c
1309–1310	Indeed I was wrong saying that the wrong mount point can be traversed, but I think not for the reason you've just mentioned. Let me explain things in detail, so everyone can double check. The scenario you've just described can indeed happen. Having an active reference on `dp` doesn't alone prevent (non-forced) unmounts and mounts to occur concurrently. In particular, unmounts can proceed up to "freeing" the `mp` structure (`vfs_mount_destroy()`), after having updated `v_mountedhere` to `NULL` on `dp` which only needs acquiring a lock (the rest of the process is independent of `dp`). So it is possible that, after the `VOP_UNLOCK()` call but before the `vfs_busy()`, the mount point mounted on `dp` is unmounted, and another mount remounted on it with `mp` being recycled to represent it. In spite of this, no wrong mount point can be traversed because, once `vfs_busy()` returns successfully, no further unmount of `mp` can happen and its structure can't be recycled. So, subsequently checking that `mp` is still equal to `v_mountedhere` under the vnode lock guarantees that the good mount point (that which is currently mounted on the vnode) is traversed, even if between `VOP_UNLOCK()` and `vfs_busy()` the mount point represented by `mp` actually changed. The test `dp->v_mountedhere != mp` catches the case I've just described, as well as a sequence of mounts/unmounts resulting in `dp` not being covered or being covered by a mount represented by a different structure (i.e., at another address in memory). The only thing it doesn't catch is the failure of `vfs_busy()` (with similar scenarios as above, `dp->v_mountedhere != mp` can be false after `mp` having being recycled, in which case the call to VFS_ROOT() would proceed but without an active reference on `mp`). But this is the point of the `if (error != 0)` test, which restarts the loop. This code however relies on mount point structures never being deallocated, an invariant I personally dislike. I know it has its advantages, but we might want (or perhaps even need, with the multiplication of mounts with ZFS and file mounts) to get rid of it in the future. We'll see.

jah added inline comments.Jul 3 2023, 6:03 PM

sys/kern/vfs_lookup.c
1309–1310	Right, I think we're just describing two different scenarios here. You're describing the scenario in which the filesystem mounted at dp is unmounted (not necessarily forcibly) and mp recycled to a different mount. I agree with your analysis above. I was describing the scenario in which dp itself is doomed by a forcible unmount of its filesystem. If both dp and mp were allowed to be recycled while dp's lock is dropped, this would be problematic for lookup because it could allow the lookup to walk into a completely unrelated directory hierarchy. But that can't happen: mp may be recycled, which as you point out above isn't by itself fatal to the integrity of the lookup, but the reference prevents dp from being recycled.

olce added inline comments.Jul 3 2023, 7:12 PM

sys/kern/vfs_lookup.c
1309–1310	Ah, yes, I read a bit too quickly what you wrote and took it as the scenario I presented because to me that was the possible path to misbehavior. I knew from the start that the other one was impossible, as you've just described. Sorry for the confusion caused.