Page MenuHomeFreeBSD

VFS lookup: New vn_cross_single_mount() and vn_cross_mounts()
Needs ReviewPublic

Authored by olce on Jul 3 2023, 12:53 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 15, 7:34 PM
Unknown Object (File)
Fri, Nov 15, 7:18 PM
Unknown Object (File)
Fri, Nov 15, 5:21 PM
Unknown Object (File)
Fri, Oct 25, 5:32 PM
Unknown Object (File)
Sep 30 2024, 1:49 AM
Unknown Object (File)
Sep 28 2024, 12:18 AM
Unknown Object (File)
Sep 23 2024, 11:03 PM
Unknown Object (File)
Sep 8 2024, 7:07 PM

Details

Reviewers
jah
kib
markj
mjg
Summary

These new functions are helpers to cross mount points covering a given vnode.
They make sure that there is no lock-like ordering between obtaining a busy
reference on the mount point and the original vnode lock.

Doing so solves two problems at once. The first is a possible deadlock with
stacked filesystems (nullfs, unionfs) for which locking the covered vnode is
equivalent to locking the mounted filesystem's root vnode, which was previously
solved via VV_CROSSLOCK. If the covered vnode is still locked while calling
vfs_busy(), the "(covered) vnode lock" -> "mount point reference" lock-order is
established, while finishing crossing the mount point establishes the reverse
"mount point reference" -> "(root) vnode lock" since the covered vnode lock was
released in-between (VV_CROSSLOCK's solution was to refrain from releasing the
latter until the end of the whole process, at the price of complexity in both
the generic and stacked filesystems code, since some locks would be acquired
in advance and recursively).

The second is a possible deadlock at unmount (with or without stacked
filesystems), where dounmount() establishes the order "mount point reference" ->
"covered vnode lock", which is incompatible with the first one on lookup
described in the previous paragraph.

This commit is part of a series whose goal is to remove the VV_CROSSLOCK vnode
flag and logic and fix a timing-dependent deadlock between lookup and unmount.

No functional change.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 52525
Build 49416: arc lint + arc unit

Event Timeline

olce requested review of this revision.Jul 3 2023, 12:53 PM
sys/kern/vfs_lookup.c
205

Why did you choose to take this approach instead of first trying vfs_busy(..., MBF_NOWAIT) as in the initial patch?

This would seem to add a fair amount of lock twiddling and refcounting to the common vfs_lookup() path, which is undesirable in terms of performance.

sys/kern/vfs_lookup.c
205

It might seem that foregoing vfs_busy(..., MBF_NOWAIT) causes more locking but in fact it's not the case (and even the opposite in the contention case).

In the end, the covered vnode's lock has to be released anyway (for a traversal that goes well, i.e., no concurrent mounts/unmounts, but in some other cases as well), so trying to optimize acquisition of the mount point through vfs_busy(..., MBF_NOWAIT) for the sole purpose of not releasing the vnode lock while it is anyway due to be unlocked just after didn't seem to make sense to me.

Actually, the current code in https://reviews.freebsd.org/D40600 causes even more lock acquisitions (and double calls to vfs_busy()) in the case with contention on the mount point, e.g., if vfs_busy(..., MBF_NOWAIT) fails, since the vnode is first unlocked to call vfs_busy(..., 0), then relocked just to check whether dp->v_mountedhere == mp and then unlocked again to continue with the lookup. Not that this case matter that much, but anyway...

The recheck of mp is never necessary in the approach presented here thanks to the initial vfs_ref() (stemming also from my reluctance to rely on struct mount objects never been freed). More generally, I find this approach simpler to reason about (in part also because it has been split in a few functions).

vfs_ref(), as well as vfs_busy(), execute with no locks at all in the common case. Their cost is not zero but I'm willing to bet it is insignificant compared to vnode lock acquisitions.

This gives me an idea. Re-introducing vfs_busy(..., MBF_NOWAIT) could still have an advantage, that is, foregoing the need for vfs_ref() at all if the call works. It would be, in a sense, a fusion of both approaches. I'll try that and report tomorrow.

sys/kern/vfs_lookup.c
195

Why do you need to check VN_IS_DOOMED() here? A non-doomed vnode should be either a precondition of the lookup's initial call to this function, or of the prior iteration's call to VFS_ROOT() with the mount busied.

198

Why check VIRF_MOUNTPOINT here? We shouldn't enter this path at all when not dealing with a mount point, which is the common case.

205

I think it's definitely worthwhile to use vfs_busy(..., MBF_NOWAIT) to remove vfs_ref()/vfs_rel() from the common path. As you mention, we don't really care about performance for the non-common contended path. But even for the contended case, I'll note that dropping the vnode lock early requires you to: 1) acquire the vnode interlock, which partially negates any advantage from not needing to re-acquire the vnode lock(), and 2) requires all the 'unlocked' logic which adds complexity to vn_cross_mounts().

Please see inline comments.

So far, I've tested the whole stack of commits with stress2, using all.sh on all tests matching *mount*.sh, *nullfs*.sh, *unionfs*.sh and fs.sh (modified to avoid the full FS parts), in a VirtualBox machine with 4 CPUs and 4G. All tests pass.

In terms of performance, I see no significant difference of elapsed time: most of the time the exact same duration is reported and fluctuations do not seem to be significant. This certainly has to be taken with a grain of salt, given the test conditions (virtualized environment), the fact that I didn't run the tests more than a dozen times (which is not statistically enough) and the fact that stress2 tests are randomized. Also, these tests most likely measure the performance of the contended case only.

Ideas, and actual hardware for proper benchmarking would be welcome. I plan to dig into other stress2 cases and have a look at fsx as well to see if it can fit the bill, all these on top of a very fast FS such as tmpfs and with the VFS cache disabled.

I would like also to remind the larger context:

  • We are talking about optimizing the fast path of vfs_lookup(), which is (currently) itself the slow path for name resolution compared to the VFS cache (although some may have different plans for the future).
  • This fixes a timing-dependent deadlock between lookup and unmount, and does so by changing the locking protocol around mount point traversal. As such, it is probably better (or even mandatory?) that it goes into -CURRENT only, so assuming it can be pushed into 14, the time frame is short.
sys/kern/vfs_lookup.c
195

Saying this you're assuming that vn_busy_mountedhere() is called in fine by vn_cross_mounts(), but it's not always the case as you can see in subsequent differentials. In standalone uses, this check is necessary.

See next comment as well.

198

That's the contract of this function, which has been carefully chosen to simplify callers, some of which need the check. And, conceptually, it doesn't matter if the check is done by the caller or the callee, it has to be done anyway.

I'm highly skeptical that the two tests above can have any noticeable performance impact even in the full lookup case (calls initiated by vn_cross_mounts()) where they are indeed not necessary. With the proper annotations (I can add __predict_true() around that last check), both tests are predicted correctly, implying we are talking about a few cycles at most (on x86 at least). If really necessary, I can always split the function even more.

What may matter is avoiding function calls in the common path (I've read some past comments from @mjg hinting at that in different contexts). This is not difficult to arrange by providing a fast-path inline function that tests whether we are indeed at a mounted vnode, although it will increase complexity a bit. I guess that the question is rather whether it is worth it, which I again have doubts on given the context (reminded in the added global comment on this differential). At the same time, clearing these doubts may very well require implementing and benchmarking it anyway.

Thus, perhaps the most pressing issue is now to find good benchmarks (with at least one for the non-contended case and another for the contended one) in the hope we can see clear differences. In my tests so far (see the global comment), I do not see any significant difference for the first proposed round of changes alone, but they are no more than mere hints.

205

Contended case.

  1. acquire the vnode interlock, which partially negates any advantage from not needing to re-acquire the vnode lock()

Yes, but the key point here may well be "partially". Contending on a non-sleepable lock normally has all chances to be more scalable then doing so on a sleepable one (the only exception here is if everybody almost always lock the vnode shared, while the interlock is a simple mutex). Plus, there is no need to wonder about which lkflags to use with the vnode lock, another advantage with standalone calls where we simply don't (want to) bother with that.

  1. requires all the 'unlocked' logic which adds complexity to vn_cross_mounts().

The unlocked logic is here to optimize locking by avoiding unnecessary unlock/relock cycles on vnode locks as much as possible. I consider it a first order optimization for the contended case, and probably for the non-contended one as well. Splitting the cross mount functionality in a few functions indeed may have made it more desirable in terms of performance. However, this has to be put in balance with code duplication reduction and the alternative is to use a constellation of functions meant to directly optimize all paths (with peephole for locks), which I suspect will lead to code of comparable, if not greater, complexity, albeit of a different form. I certainly could try it but I'm wondering if it's worth the trouble.

205

I think it's definitely worthwhile to use vfs_busy(..., MBF_NOWAIT) to remove vfs_ref()/vfs_rel() from the common path.

I doubt vfs_ref() and vfs_rel() cost that much, but this is easy to do, so I'll do it. I suspect the biggest gain may rather come from the fact that unlocked will stay to true, avoiding a superfluous unlock/relock cycle. Doing so also has some benefit in terms of semantics (see the added global comment).

Please see also the test report (global comment on this differential). I'll report new test results with this change, with a bit of lag (I won't be able to work on it tomorrow).

The entire vp_crossmp, vfs_busy and crosslock flag usage constitutes complexity which can be avoided afaics. While your patch removes the flag, it keeps the rest. I don't have all the kinks sorted out, so I'm just going to give a brief overview.

Key observation is that at the end of the day, in the absolute worst case, one can land in a state where:

  1. both ni_dvp and the vnode being the root vnode of the top of mount stack are unlocked but have guaranteed liveness with v_usecount
  2. they can be locked without establishing an ordering with vn_lock_pair
  3. the entire state can be re-validated either by redoing the walk in some "check mode" (for lack of a better word) or sequence counters

So all the locks can be taken and the state can be guaranteed as if no relocking happened. Perhaps the critical bit to stress is that one can get the root vnode of the next filesystem *without* busying it.

Conceptually it would look like this:

enum vget_state rvs;
struct vnode *rvp;
int error = VFS_ROOT_REF(vp->v_mount, &rvp, &vs);

mount point is guaranteed to not disappear nor be reused because vp is locked. assuming it worked out:

error = vget_finish_noabort(rvp, lockflags | LK_NOWAIT, vs);
`

If this worked out and rvp is not mounted on, this is it. Note there is no need to use vp_crossmp anywhere in here nor the flag.

If these nowaits fail, you vn_lock_pair and validate nothing changed. if something is going on with the mount point, you unlock stuff, deref the found vnode and wait for the mount to settle with vfs_busy. but at that time you don't hold any vnode locks so there is no deadlock potential

I may end up hacking this up, but it will require addition of new routines to filesystems. I do think it is worth it.

Reduced complexity aside, the upshot is that the strange bug of vp_crossmp escaping namei will be fixed by whacking its usage.

In D40850#930395, @mjg wrote:

The entire vp_crossmp, vfs_busy and crosslock flag usage constitutes complexity which can be avoided afaics. While your patch removes the flag, it keeps the rest.

Yes, for the current patch series, but not in my plans. As a matter of fact, my immediately next planned move is to remove vp_crossmp, which I also had in mind from the start and has been prepared by the presented modifications. Indeed, unless I'm missing something, I don't think any LORs with the covered vnode's parent are now possible with the new protocol (locking first the mount point and then the covered vnode/root vnode) even in the presence of stacked file systems. So I may just be one commit away of removing it. Going to prepare and test it.

I don't have all the kinks sorted out, so I'm just going to give a brief overview.
(...)
So all the locks can be taken and the state can be guaranteed as if no relocking happened.

This I agree but I'm not sure it is going to be simpler.

Perhaps the critical bit to stress is that one can get the root vnode of the next filesystem *without* busying it.

Conceptually it would look like this:

enum vget_state rvs;
struct vnode *rvp;
int error = VFS_ROOT_REF(vp->v_mount, &rvp, &vs);

mount point is guaranteed to not disappear nor be reused because vp is locked.

That's true indeed, because of the usual VFS_UNMOUNT() -> vflush() chain of calls.

assuming it worked out:

error = vget_finish_noabort(rvp, lockflags | LK_NOWAIT, vs);
`

If this worked out and rvp is not mounted on, this is it. Note there is no need to use vp_crossmp anywhere in here nor the flag.
If these nowaits fail, you vn_lock_pair and validate nothing changed. if something is going on with the mount point, you unlock stuff, deref the found vnode and wait for the mount to settle with vfs_busy. but at that time you don't hold any vnode locks so there is no deadlock potential

It seems to me this could work but keeping it compatible with removing VV_CROSSLOCK and associated complexity is going to be complex on its own. Part of the new approach I'm proposing is to ensure that the covered vnode and the root vnode are never locked together because they can be backed by the same lock, leading to recursive locking and other complexities (and LORs) in unionfs in particular. As long as you have to keep the covered vnode locked to traverse the mount, you have to watch out how/if you really have to lock the root vnode through vget_finish_noabort(). And you cannot blindly use vn_lock_pair(), since the backing locks could be the same. All in all, I'm not sure you can dispense with part of VV_CROSSLOCK's complexity if going this route.

I may end up hacking this up, but it will require addition of new routines to filesystems. I do think it is worth it.

Reduced complexity aside, the upshot is that the strange bug of vp_crossmp escaping namei will be fixed by whacking its usage.

As said above, I don't think such additional changes are needed to get rid of vp_crossmp, the already proposed ones are probably enough. But maybe I'm missing something. Going to try that very soon.

In vfs_busy_mountedhere(), try to busy the mount point via `vfs_busy(mp,
MBF_NOWAIT)` without unlocking the vnode. Impacts in callers since the contract
has changed: *unlocked can be set to false even when 0 is returned.

Also, revised several prediction predicates (the goal is to favor the fast path,
which is that vnodes are not mounted over, and if they are, that the mountpoint
can be busied right away) although I'm still not sure how much a difference they
make (if any).

olce marked 4 inline comments as done.Jul 6 2023, 8:40 PM

The tests I launched today are worthless since they were done on debug kernels by mistake. I've relaunched them correctly hopefully, so I'm expecting results for tomorrow (the initial revisions versus the updated ones).

I'm going to respond tomorrow, meanwhile 2 notes

  1. some cleanups landed and you will need to rebase.
In D40850#930879, @olce.freebsd_certner.fr wrote:

The tests I launched today are worthless since they were done on debug kernels by mistake. I've relaunched them correctly hopefully, so I'm expecting results for tomorrow (the initial revisions versus the updated ones).

if you are trying to bench it, don't. locked lookup, with spaghetti code and a massive branchfest is not in shape which justifies it

In D40850#930987, @mjg wrote:
  1. some cleanups landed and you will need to rebase.

Although not particularly difficult, I had preferred to spend time on something more productive. I don't think there was such an urgency to commit this change.

To rebase fast, I've currently just ignored the change, i.e., replaced vfs_lookup_cross_mount() with vn_cross_mounts(). The latter and other helper functions have been designed to cater for all necessary changes in the tree, not just that for vfs_lookup(), so I see no immediate value in trying to mold the existing work into the new utility function, I think the opposite should be done.

What I'm considering integrating from your change in a second step:

  1. Maybe changing the vn_ prefix with vfs_lookup_, modeled after your choice of name. I think however that there should be a clear distinction between crossing a single mount and multiple mounts, so I insist on retaining the 's' in vn_cross_mounts(). And, by consequence, the vn_ prefix may be more appropriate (it makes it clearer which mounts exactly).
  2. Use vn_lock() without LK_RETRY, but only after someone confirms that this cannot fail with EWOULDBLOCK/EINTR/ERESTART, or that the whole lookup can cope with that (that part I don't think so).
  3. Having an inline "frontend" that does perform fast checks and calls the crossing part only if there is something to cross. There is no need to separate both parts with one textually being part of the caller and the rest in the callee, on the contrary, this duplicates the "frontend" code in all callers (textually, again; by contrast, the produced code will stay identical) . However, without benchmarks, I'm still not sure that this can have a noticeable effect (in the VFS cache perhaps, but here?). Based on your experience, are you sure it can?

if you are trying to bench it, don't. locked lookup, with spaghetti code and a massive branchfest is not in shape which justifies it

Then what is the point in trying to remove an "additional" vfs_busy()?

I fear we (and especially I) are going to waste time with parallel changes. The series here is complete, with all impacts done, and supposedly ready to go in after having been tested with stress2 for days (except the new vp_crossmp part, which I've added to the series and I'm stress testing right now).

Why can't we just first finish reviewing @jah and myself's work and possibly commit it? We can then examine from here whether an alternative mechanism would make a difference and test it to our heart's content.

Same stress2 tests as before (unionfs, nullfs, *mount*) all pass.

In D40850#931082, @olce.freebsd_certner.fr wrote:
In D40850#930987, @mjg wrote:
  1. some cleanups landed and you will need to rebase.

Although not particularly difficult, I had preferred to spend time on something more productive. I don't think there was such an urgency to commit this change.

To rebase fast, I've currently just ignored the change, i.e., replaced vfs_lookup_cross_mount() with vn_cross_mounts(). The latter and other helper functions have been designed to cater for all necessary changes in the tree, not just that for vfs_lookup(), so I see no immediate value in trying to mold the existing work into the new utility function, I think the opposite should be done.

What I'm considering integrating from your change in a second step:

  1. Maybe changing the vn_ prefix with vfs_lookup_, modeled after your choice of name. I think however that there should be a clear distinction between crossing a single mount and multiple mounts, so I insist on retaining the 's' in vn_cross_mounts(). And, by consequence, the vn_ prefix may be more appropriate (it makes it clearer which mounts exactly).

cosmetics, does not matter

  1. Use vn_lock() without LK_RETRY, but only after someone confirms that this cannot fail with EWOULDBLOCK/EINTR/ERESTART, or that the whole lookup can cope with that (that part I don't think so).
  2. Having an inline "frontend" that does perform fast checks and calls the crossing part only if there is something to cross. There is no need to separate both parts with one textually being part of the caller and the rest in the callee, on the contrary, this duplicates the "frontend" code in all callers (textually, again; by contrast, the produced code will stay identical) . However, without benchmarks, I'm still not sure that this can have a noticeable effect (in the VFS cache perhaps, but here?). Based on your experience, are you sure it can?

your proposal adds a function call for every path component and most path components are not mount points. this is wasteful, but will be hard to measure given general waste in the area.

I don't have numbers handy, but on my way to make lookups lockless I also made a number of improvements to the locked variant (e.g., per-cpu vfs_ref and similar which avoid atomics) in total making it run about at twice the speed of the original. From my profiling of lockless lookup there are general changes to that area which can give at least +10%, some of which affect the locked lookup as well. I did not bother profiling the locked variant, but I would have hard time believing there is not at least +20% to gain once all the branchfest gets eliminated (i mean the entry point in namei, the loop for symlink business and so on). Also see below.

if you are trying to bench it, don't. locked lookup, with spaghetti code and a massive branchfest is not in shape which justifies it

Then what is the point in trying to remove an "additional" vfs_busy()?

the point of not doing busy is making the code simpler and avoiding deadlock potential

I fear we (and especially I) are going to waste time with parallel changes. The series here is complete, with all impacts done, and supposedly ready to go in after having been tested with stress2 for days (except the new vp_crossmp part, which I've added to the series and I'm stress testing right now).

Why can't we just first finish reviewing @jah and myself's work and possibly commit it? We can then examine from here whether an alternative mechanism would make a difference and test it to our heart's content.

What I don't like in this patchset in principle is that it tries to damage control the complexity, when for the most part it can be eliminated to begin with as outlined above.

So I had a closer look and I'm confident not doing busy in the fast path will work just fine. busy will be needed in the fallback, but even then it wont lock any vnodes.

Crucial thing was looking around what's going with cases which return vp_crossmp. It is a bug that it happens to begin with of course. The only consumers which are affected pass WANTPARENT, which does not pose any extra difficulties or LOCKPARENT which does pose the locking problem. However, the good news is that the latter is a corner case which almost never happens (things like rm on a mount point) thus can be slower than expected.

With this in mind one component will be VFS_ROOT_REF or similar which safely returns a refed vnode but does not lock it. For that to work it can utilize vfs_op_thread_enter mechanism to temporarily stabilize the state. this operation can fail, so it does not block anything. If it fails you go to the slowpath which instead performs vfs_busy and now can reliably ref the root vnode, if this is still a valid mount point. then again it only locks it *after* unbusy.

if there is any crossing to be done, the most common case is only one mount point and which is not in the process of being unmounted etc. for this case the fast path is close to:

error = VFS_ROOT_REF(mp, &rvp, &vs);
VOP_UNLOCK(vp);
if (__predict_false(error !=0)
        return slowpath(...);
error = vget_finish(rvp, cnflags, vs);
if (__predict_false(error !=0)
        return slowpath(...);
/* and here the vnode is safely locked */

unionfs is trivially patchable to provide vfs_root_ref, nullfs is a little bit of effort and the rest can be served with a general routine (see vfs_cache_root for an idea)

slowpath would start with 0 locks held, and while not pretty it could rely on type-stability of struct mount. alternatively vfs_ref can be sprinkled there.

it would look like this:

vfs_busy(mp, ....);
if (vp->v_mountedhere != mp)
         unbusy and try again with the new mount point, if any
VFS_ROOT(mp, 0, &rvp); /* guarantees returning a refed vnode, but does NOT lock it. error is guaranteed 0 */
vfs_unbusy(mp);
... lock here and if things don't work out redo the work

that would be an outline of my proposal. I note it accomplishes the same goal afaics while having less work to do when crossing mount points in their most common case. I also think it is simpler to reason about.

I could not grasp why your patch does not run into LOCKPARENT trouble and now I see it is in fact buggy. Note that *stock* mount point traversal is careful to lock the root vnode only after unlocking dp and dvp (modulo the crosslock case).

To my understanding the historical reason: say the new mount point has the root vnode locked and the lock owner doing an unbounded sleep off cpu. For example maybe the device/nfs share/whatever backing the mount point is dead. If you try to cross that you block and if you still have dvp locked, that's one more vnode up the chain which cannot be locked. And this can propagate all the way up to the / rootvnode, effectively killing most lookups because of one dead mount point.

For this reason dvp *must not* be held while you play around here, but LOCKPARENT may require it to get locked later which is where vn_lock_pair comes in.

Simplify vn_cross_mounts() by not restarting when encountering a doomed vnode.

In D40850#931774, @mjg wrote:

cosmetics, does not matter

Naming matters, even if not the most.

  1. Having an inline "frontend" that does perform fast checks and calls the crossing part only if there is something to cross. There is no need to separate both parts with one textually being part of the caller and the rest in the callee, on the contrary, this duplicates the "frontend" code in all callers (textually, again; by contrast, the produced code will stay identical) . However, without benchmarks, I'm still not sure that this can have a noticeable effect (in the VFS cache perhaps, but here?). Based on your experience, are you sure it can?

your proposal adds a function call for every path component and most path components are not mount points. this is wasteful, but will be hard to measure given general waste in the area.

I think you didn't get what I said, so I've left point 3 quoted in the hope that you will read it again. In short, my proposal (in fact, ours, since jah@ contributed code for the first part) adds a function call in every path in its current form, which is logically speaking wasteful but I don't think in the end it really matters in practice *and*, in any case, whether I'm right or wrong on this doesn't matter that much since it doesn't take very long to apply the scheme I've just described. So this simply dismisses this argument. See the added reviews for how testing whether a vnode is mounted on is now done inline. This could be pushed further to include the vfs_busy(mp, MBF_NOWAIT) call in the inline part.

I don't have numbers handy, but on my way to make lookups lockless I also made a number of improvements to the locked variant (e.g., per-cpu vfs_ref and similar which avoid atomics) in total making it run about at twice the speed of the original. From my profiling of lockless lookup there are general changes to that area which can give at least +10%, some of which affect the locked lookup as well. I did not bother profiling the locked variant, but I would have hard time believing there is not at least +20% to gain once all the branchfest gets eliminated (i mean the entry point in namei, the loop for symlink business and so on). Also see below

Thanks for providing this info. I don't dispute that, on the contrary, I agree. But you're not responding to my question: I was asking very specifically about the systematic function call's cost. This was out of curiosity, since as explained in the previous paragraph, the trick to forego it is easy to implement.

the point of not doing busy is making the code simpler and avoiding deadlock potential

Not doing busy doesn't make much difference in code simplicity, contrary to what you claim, and this is only for the common (non-contended) case. I'm pretty sure that, in your proposal, the code for the rest has to be much worse.

I find that the only real complexity in the proposal here is in vn_cross_mounts() and only because I insisted on having detailed semantics as close as possible to the original code (in terms of what the observable results are when looking up is done concurrently). This part is strictly speaking not necessary. I've just removed it so that you clearly see what this entails.

What is the deadlock potential you're talking about? Is it simply that which is already solved by this proposal?

What I don't like in this patchset in principle is that it tries to damage control the complexity, when for the most part it can be eliminated to begin with as outlined above.

This proposal doesn't damage control the complexity, it actively reduces it, and on two fronts:

  1. By splitting code in independently usable pieces, so that they can be reused in most situations (and all that currently exist) where crossing mounts is needed, instead of people rolling their own solutions.
  2. By removing lock dependencies that are not necessary.

The only thing I'm not so satisfied with is the unlocked boolean, not sure if I can entirely dispense with it (and not sure doing so is even worth it).

I find the "unnecessary complexity" statement wrong as well, see below my comment on your implementation sketch for VFS_ROOT_REF().

So I had a closer look and I'm confident not doing busy in the fast path will work just fine.

I've had no doubts that removing vfs_busy() in the fast path can work, and that you can make it work. The question is the price to pay, and comparison with this proposal, which I'm going to come to.

busy will be needed in the fallback, but even then it wont lock any vnodes.

Yes, vfs_busy() can sleep without holding any vnode locks. Which is exactly what already happens in this proposal (which removes holding the covered vnode lock when calling vfs_busy()) except for the parent directory vnode (parent of the covered vnode), which is not really necessary either (going to walk into details on that in response to your separate comment about vp_crossmp, since it is relevant to that part and was done in advance for that purpose). In the new proposal, I've just removed even that.

Crucial thing was looking around what's going with cases which return vp_crossmp. It is a bug that it happens to begin with of course. The only consumers which are affected pass WANTPARENT, which does not pose any extra difficulties or LOCKPARENT which does pose the locking problem. However, the good news is that the latter is a corner case which almost never happens (things like rm on a mount point) thus can be slower than expected.

It doesn't matter if this case is slower indeed. What's more, a rm on a mount point has to fail.

With this in mind one component will be VFS_ROOT_REF or similar which safely returns a refed vnode but does not lock it. For that to work it can utilize vfs_op_thread_enter mechanism to temporarily stabilize the state. this operation can fail, so it does not block anything. If it fails you go to the slowpath which instead performs vfs_busy and now can reliably ref the root vnode, if this is still a valid mount point. then again it only locks it *after* unbusy.

And this is where the "problem" is in your proposal. Not a technical problem, because indeed it can work, but rather the reason why you can't hope that it will be faster than what is already proposed here. Calling vfs_op_thread_enter() is exactly what vfs_busy() does in the non-contended case. So, you're trying to save a vfs_busy() that you're basically going to have to reintroduce in another form into VFS_ROOT_REF(), see? Add to that the actual need to code VFS_ROOT_REF() upfront, which I agree is an improvement (and incidentally can fix another LOR in unionfs) but with the current proposal can simply be added independently in a later step (this time without vfs_op_thread_enter()), and most importantly the fact that you're going to pessimize the non-contended case whereas this proposal does not.

if there is any crossing to be done, the most common case is only one mount point and which is not in the process of being unmounted etc. for this case the fast path is close to:

error = VFS_ROOT_REF(mp, &rvp, &vs);
VOP_UNLOCK(vp);
if (__predict_false(error !=0)
        return slowpath(...);
error = vget_finish(rvp, cnflags, vs);
if (__predict_false(error !=0)
        return slowpath(...);
/* and here the vnode is safely locked */

Yes. I had understood that already. This is a useful description nonetheless.

unionfs is trivially patchable to provide vfs_root_ref, nullfs is a little bit of effort and the rest can be served with a general routine (see vfs_cache_root for an idea)

Yes (I know the lookup code extremely well, and the vfs_cache_root() mechanism very well also; not even talking about unionfs).

slowpath would start with 0 locks held, and while not pretty it could rely on type-stability of struct mount. alternatively vfs_ref can be sprinkled there.

This is approximately what vn_cross_single_mount() already does, the vfs_ref() variant (it could be converted to the other variant; I would just prefer it is not).

it would look like this:
(snip)
that would be an outline of my proposal.

Thanks. I had understood.

I note it accomplishes the same goal afaics while having less work to do when crossing mount points in their most common case. I also think it is simpler to reason about.

As pointed out above, at least as currently stated, your proposal simply cannot do less work than this one and worse, it will pessimize the contended case.

With (what I find to be) the principal source of complexity removed (as pointed out above), would you still say this proposal is harder to reason about?

More generally, I completely agree that namei(), vfs_lookup() and co. are spaghetti code in need of a huge cleanup, and hope to be able to do so at some point, either on my own or actively supporting others' changes. However, I doubt it's reasonable to launch such a project now with 14-CURRENT freezing in less than a month.

In D40850#931777, @mjg wrote:

I could not grasp why your patch does not run into LOCKPARENT trouble and now I see it is in fact buggy. Note that *stock* mount point traversal is careful to lock the root vnode only after unlocking dp and dvp (modulo the crosslock case).

Yes, this change was deliberate on my part.

To my understanding the historical reason: say the new mount point has the root vnode locked and the lock owner doing an unbounded sleep off cpu. For example maybe the device/nfs share/whatever backing the mount point is dead. If you try to cross that you block and if you still have dvp locked, that's one more vnode up the chain which cannot be locked. And this can propagate all the way up to the / rootvnode, effectively killing most lookups because of one dead mount point.

This I didn't know. Thanks a lot for providing this information. I had not climbed to the source, i.e., Tor Egge's comment in an old commit message, thinking that the comment before vfs_busy() was all there was to it. So I took the chance of keeping the lock on ni_dvp just to see what would happen (since I had fixed the problematic LOR). And nothing bad happened since I didn't test NFS.

For this reason dvp *must not* be held while you play around here, but LOCKPARENT may require it to get locked later which is where vn_lock_pair comes in.

Fortunately, unless I'm missing something again, it is easy to circumvent this using vn_lock_pair(). I've proposed a new revision doing this. It can be improved a bit but is already enough as a PoC and for testing.

@mjg Did you read my comments? Are you working on an alternative? Do you expect me to code your proposal?

I would like to have this in 14 and MFC it to 13.

@olce @mjg This change seems to have stalled, what do you want to do about it?

FWIW, I do think it's worth trying @mjg's proposal. I see two main advantages:

  1. Less complexity overall, as pointed out earlier
  2. Elimination of vfs_busy() would seem to eliminate the need to make any changes to the unmount() path, so https://reviews.freebsd.org/D40847 could then be avoided altogether, no?

There are undoubtedly nuances I'm missing here, as I haven't looked at this code in a while and all the discussion above is a bit confusing. In the end I'll support whatever you guys decide to go with, as long as it eliminates the need for VV_CROSSLOCK.

In D40850#1000012, @jah wrote:

@olce @mjg This change seems to have stalled, what do you want to do about it?

See below for the main reason while I temporarily stopped pushing for it. The other reason being lack of time, but this also may change relatively soon.

FWIW, I do think it's worth trying @mjg's proposal. I see two main advantages:

  1. Less complexity overall, as pointed out earlier

I don't think it is going to be less complex, as I explained in several comments above, especially for the correctness of the non-contended case. vn_cross_single_mount() and vn_cross_mounts() are not the most pretty but aren't horrible either and, more importantly, they are isolated and re-usable. I'd like to see a complete alternate situation, with all corner cases handled, to reconsider that position.

  1. Elimination of vfs_busy() would seem to eliminate the need to make any changes to the unmount() path, so https://reviews.freebsd.org/D40847 could then be avoided altogether, no?

I don't think so, the non-contended case needs this change.

There are undoubtedly nuances I'm missing here, as I haven't looked at this code in a while and all the discussion above is a bit confusing. In the end I'll support whatever you guys decide to go with, as long as it eliminates the need for VV_CROSSLOCK.

The main reason I stopped pushing for a quick solution here, be it mjg@'s or mine, is that both are violating a lookup invariant. If a lookup is in the process of crossing a mount point and that mount point is unmounted concurrently, it is possible that the lookup fails, whereas it should always succeed if the looked up path exists both in the mount point but also underneath it. I agree it's a corner case, but at the same time our proposals as they stand would introduce a regression on this front. The way out here is to revamp the whole lookup procedure to be able to completely restart it in such (rare) cases.

Finally, I think mjg@'s proposal is not where we (or, at least, I) want to go long term. I don't think there is any real reason to lock the vnode to be able to traverse mount points. We should really decouple modifying (or generally, operating on) the vnode and crossing/changing mountpoints on it, and only my proposal is a step towards that. Imagine that some process has a hold on such a directory vnode, either because it looked it up before it was mounted over, or through a nullfs mount. Why should it be penalized if it creates files in it while other processes are crossing the mounts? There is absolutely no connection between both operations. Moreover, in a future where we push the VFS cache to the filesystems, mount point crossing won't be handled by the cache, so optimizing mount-point crossing is an important step in that direction.

In the short term (next weeks), it's unlikely I'll have time to rework on that, but I hope it can be the case in one or two months. I'll keep you posted.

Thanks.

First and foremost my apologies this fell through the cracks.

I'm mostly not involved in any of this anymore, so I'm definitely not going to stand in the way. I do work for a company which as a networking FreeBSD-based product (tracking main at that!), so I need vfs to keep working, even if it's not necessarily the way I envisioned it. I presume this will be met still ;)

That is to say feel free to do whatever changes you see fit, provided whatever review (if deemed warranted) is fine with it.

As my final remark here I'll only paste benchmark results from before my involvement with the VFS layer and after. Not only there is a lot of complexity which does not have to be there, but the code is avoidably slow, there is several % of single-threaded performance to gain by removing problems. A while back I added a bunch of stuff to kern/vfs_cache.c concerning this, "Appendix A" in the top comment in that file.

That said, a stat benchmark (stat("/tmp/willitscale.XXXXXX", ...)) on tmpfs, single-threaded, ops/s:

10.3674517
14.0+nofp2603155+285%with vfs.cache.param.fast_lookup=0
14.04284354+535%/+64%with vfs.cache.param.fast_lookup=1

That is to say even without lockless trickery things got drastically faster than they used to be *and* there is still room to grow while removing complexity.

I said my piece, I'm buggering off the area.