Details

Reviewers

kib
markj
pho

Commits

rGfcb164742b6f: unionfs: rework unionfs_getwritemount()

Summary

VOP_GETWRITEMOUNT() is called on the vn_start_write() path without any
vnode locks guaranteed to be held. It's therefore unsafe to blindly access per-mount
and per-vnode data. Instead, follow the approach taken by nullfs and
use the vnode interlock coupled with the hold count to ensure the
mount and the vnode won't be recycled while they are being accessed.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 44494
Build 41382: arc lint + arc unit

Event Timeline

jah created this revision.Feb 15 2022, 3:44 AM

Herald added a subscriber: imp. · View Herald TranscriptFeb 15 2022, 3:44 AM

jah requested review of this revision.Feb 15 2022, 3:44 AM

Harbormaster completed remote builds in B44454: Diff 102775.Feb 15 2022, 3:45 AM

jah added inline comments.Feb 15 2022, 3:47 AM

sys/fs/unionfs/union_vnops.c
1782	I removed the check for VREG from the updated code. Wouldn't we also want to check the upper vnode when beginning a directory write operation that will ultimately create a unionfs shadow dir?

jah edited the summary of this revision. (Show Details)Feb 15 2022, 3:55 AM

jah added inline comments.Feb 15 2022, 4:01 AM

sys/fs/unionfs/union_vnops.c
1777–1778	Since this is a frequently executed path, would it make sense to instead use vfs_smr_enter()/vfs_smr_exit() here, and an SMR pool for unionfs nodes? Presumably that would be coupled with atomic loads of vnode and mount fields.

mjg added a subscriber: mjg.Feb 15 2022, 11:13 AM

mjg added inline comments.

sys/fs/unionfs/union_vnops.c
1777–1778	imo this is yet another argument for reworking how forced unmount works and most notably leaving both the mount point and v_data intact as long as the vnode is alive. in the meantime I would suggest benchmarking e.g., buildkernel over unionfs. chances are decent there is so much slowness in there already that the interlock here does not make any difference.

jah added inline comments.Feb 15 2022, 7:43 PM

sys/fs/unionfs/union_vnops.c
1777–1778	imo this is yet another argument for reworking how forced unmount works and most notably leaving both the mount point and v_data intact as long as the vnode is alive. That's also a thought I've had quite a few times since I've started working on unionfs. It seems like each filesystem has to jump through quite a few hoops to accommodate forced unmount, which is both not the normal use case and somewhat hard to test. What I'll probably do for the time being is commit this change as-is (modulo review feedback). I'll need to get a bigger machine before I can do meaningful scalability testing.

mjg added inline comments.Feb 15 2022, 8:21 PM

sys/fs/unionfs/union_vnops.c
1777–1778	So I ran a quick test of -j 104 buildkernel (of stable/12 on main), along with mount -t unionfs -o noatime -o above /mnt/tmp /mnt/tmp2 tmpfs: 2566.10s user 322.43s system 8669% cpu 33.319 total unionfs: 1785.18s user 14403.61s system 8877% cpu 3:02.37 total Looking at the profile reveals the issue is exclusive vnode locking for path lookup.

jah added inline comments.Feb 15 2022, 8:39 PM

sys/fs/unionfs/union_vnops.c
1777–1778	Thanks for testing that out, and also thanks for reminding me: I've been meaning to see if I can get away with setting the shared-lookup flag on the unionfs mount, but there are at least 3 bugs (including this getwritemount) issue I need to fix before I can try that.

jah added inline comments.Feb 15 2022, 8:41 PM

sys/fs/unionfs/union_vnops.c
1777–1778	That of course would also be contingent on both underlying FSes supporting shared lookups, but I think most of them do these days.

markj added inline comments.Feb 15 2022, 8:59 PM

sys/fs/unionfs/union_vnops.c
1804	Are we effectively assuming that `uvp` isn't NULL here? I would guess that a preceding VOP_LOOKUP should provide that guarantee. If `uvp` can legitimately be NULL, meaning that a shadow directory for the parent doesn't exist, then I can't see why it's ok for this VOP to return 0.

jah added inline comments.Feb 16 2022, 2:35 AM

sys/fs/unionfs/union_vnops.c
1804	Yes, I believe the prior lookup should prevent this. But it's a good question: should we instead return EACCES below this for the case in which uvp is NULL? This was based on the similar logic in nullfs, so if the answer is "yes" it seems like we should change nullfs too. The nullfs implementation also doesn't pass up the error returned from VOP_GETWRITEMOUNT() on the lower vnode, which seemed strange to me so I did decide to propagate the error here.

kib added inline comments.Feb 16 2022, 12:34 PM

sys/fs/unionfs/union_vnops.c
1804	What is the similar logic in nullfs? For nullfs it is simpler, if we have nullfs vnode reclaimed, we cannot provide an answer for VOP_GETWRITEMOUNT(), otherwise we can. For unionfs it is more complicated, but perhaps the fall-back should be um_uppervp->v_mount or such.

markj added inline comments.Feb 16 2022, 4:51 PM

sys/fs/unionfs/union_vnops.c
1804	I note that no implementations of vop_getwritemount return anything other than 0, including dead_getwritemount(), but passing up the error from VOP_GETWRITEMOUNT does seem like the right thing to do both here and in nullfs. With respect to `unp->un_uppervp` being non-NULL, where `unp` corresponds to the parent vnode, if we think that a prior lookup guarantees the presence of an upper vnode, can't we add an assertion to that effect?

jah added inline comments.Feb 16 2022, 5:06 PM

sys/fs/unionfs/union_vnops.c
1804	I'd somewhat prefer to return EACCES if the parent's uppervp is NULL, as that seems less coupled to current unionfs and VFS lookup behavior. w.r.t. falling back to um_uppervp->v_mount, I'd rather not do that for the same reason: if at some point there is a situation where the parent uppervnode is NULL, we can't realistically complete whatever write operation is being attempted, so we should reject it.

kib added inline comments.Feb 16 2022, 6:30 PM

sys/fs/unionfs/union_vnops.c
1804	VOP_GETWRITEMOUNT() is not the write, it is about getting a point to the mount point that would be the target of a write, if write happens. Basically it is a mount point that gets vn_start_write().

jah added inline comments.Feb 16 2022, 7:39 PM

sys/fs/unionfs/union_vnops.c
1804	I understand that, it just seems not quite right to me to return success here in a case when a potential write using the specified vnode can't succeed, or to directly return um_uppervp->v_mount without querying the underlying FS.