VOP_GETWRITEMOUNT() is called on the vn_start_write() path without any
vnode locks guaranteed to be held. It's therefore unsafe to blindly access per-mount
and per-vnode data. Instead, follow the approach taken by nullfs and
use the vnode interlock coupled with the hold count to ensure the
mount and the vnode won't be recycled while they are being accessed.
Details
- Reviewers
kib markj pho - Commits
- rGfcb164742b6f: unionfs: rework unionfs_getwritemount()
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 44492 Build 41380: arc lint + arc unit
Event Timeline
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1782 | I removed the check for VREG from the updated code. Wouldn't we also want to check the upper vnode when beginning a directory write operation that will ultimately create a unionfs shadow dir? |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1780–1781 | Since this is a frequently executed path, would it make sense to instead use vfs_smr_enter()/vfs_smr_exit() here, and an SMR pool for unionfs nodes? Presumably that would be coupled with atomic loads of vnode and mount fields. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1780–1781 | imo this is yet another argument for reworking how forced unmount works and most notably leaving both the mount point and v_data intact as long as the vnode is alive. in the meantime I would suggest benchmarking e.g., buildkernel over unionfs. chances are decent there is so much slowness in there already that the interlock here does not make any difference. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1780–1781 |
That's also a thought I've had quite a few times since I've started working on unionfs. It seems like each filesystem has to jump through quite a few hoops to accommodate forced unmount, which is both not the normal use case and somewhat hard to test. What I'll probably do for the time being is commit this change as-is (modulo review feedback). I'll need to get a bigger machine before I can do meaningful scalability testing. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1780–1781 | So I ran a quick test of -j 104 buildkernel (of stable/12 on main), along with mount -t unionfs -o noatime -o above /mnt/tmp /mnt/tmp2 tmpfs: 2566.10s user 322.43s system 8669% cpu 33.319 total Looking at the profile reveals the issue is exclusive vnode locking for path lookup. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1780–1781 | Thanks for testing that out, and also thanks for reminding me: I've been meaning to see if I can get away with setting the shared-lookup flag on the unionfs mount, but there are at least 3 bugs (including this getwritemount) issue I need to fix before I can try that. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1780–1781 | That of course would also be contingent on both underlying FSes supporting shared lookups, but I think most of them do these days. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1811 | Are we effectively assuming that uvp isn't NULL here? I would guess that a preceding VOP_LOOKUP should provide that guarantee. If uvp can legitimately be NULL, meaning that a shadow directory for the parent doesn't exist, then I can't see why it's ok for this VOP to return 0. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1811 | Yes, I believe the prior lookup should prevent this. But it's a good question: should we instead return EACCES below this for the case in which uvp is NULL? This was based on the similar logic in nullfs, so if the answer is "yes" it seems like we should change nullfs too. The nullfs implementation also doesn't pass up the error returned from VOP_GETWRITEMOUNT() on the lower vnode, which seemed strange to me so I did decide to propagate the error here. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1811 | What is the similar logic in nullfs? For nullfs it is simpler, if we have nullfs vnode reclaimed, we cannot provide an answer for VOP_GETWRITEMOUNT(), otherwise we can. For unionfs it is more complicated, but perhaps the fall-back should be um_uppervp->v_mount or such. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1811 | I note that no implementations of vop_getwritemount return anything other than 0, including dead_getwritemount(), but passing up the error from VOP_GETWRITEMOUNT does seem like the right thing to do both here and in nullfs. With respect to unp->un_uppervp being non-NULL, where unp corresponds to the parent vnode, if we think that a prior lookup guarantees the presence of an upper vnode, can't we add an assertion to that effect? |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1811 | I'd somewhat prefer to return EACCES if the parent's uppervp is NULL, as that seems less coupled to current unionfs and VFS lookup behavior. w.r.t. falling back to um_uppervp->v_mount, I'd rather not do that for the same reason: if at some point there is a situation where the parent uppervnode is NULL, we can't realistically complete whatever write operation is being attempted, so we should reject it. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1811 | VOP_GETWRITEMOUNT() is not the write, it is about getting a point to the mount point that would be the target of a write, if write happens. Basically it is a mount point that gets vn_start_write(). |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1811 | I understand that, it just seems not quite right to me to return success here in a case when a potential write using the specified vnode can't succeed, or to directly return um_uppervp->v_mount without querying the underlying FS. |
sys/fs/unionfs/union_vnops.c | ||
---|---|---|
1780 | I do not think it is in the specification of this VOP to check for MNT_RDONLY, in fact. And I do not see much use for this check. |