Page MenuHomeFreeBSD

Fix interface between nfsclient and vnode pager.
ClosedPublic

Authored by kib on Oct 3 2019, 8:18 AM.
Tags
None
Referenced Files
F103011273: D21883.diff
Tue, Nov 19, 7:31 PM
Unknown Object (File)
Wed, Nov 13, 7:13 PM
Unknown Object (File)
Thu, Nov 7, 8:07 AM
Unknown Object (File)
Wed, Nov 6, 5:10 PM
Unknown Object (File)
Wed, Nov 6, 3:47 AM
Unknown Object (File)
Tue, Nov 5, 4:33 AM
Unknown Object (File)
Sun, Nov 3, 7:50 AM
Unknown Object (File)
Oct 18 2024, 7:51 AM

Details

Summary

This patch has three changes squashed together for the review.

  1. In vm_fault(), perform the check for the object size under the vnode lock, if the object has vnode type. The object lock is not enough to guarantee tht the internal fs data is consistent, so when we relock the vnode and try to page-in the faulted page, it is possible to get surprising results. I do not believe that this is an issue for the local filesystems, due to the way writes are usually structured.
  1. Ensure that nfsclient always calls vnode_pager_setsize() with the vnode exclusively locked. This ensures that page fault always can find the backing page if the object size check succeeded.

The main offender breaking the interface in nfsclient is nfs_loadattrcache(), which is used whenever server responded with updated attributes, which can happen on non-changing operations as well. Also, iod threads only have buffers locked (and even that is LK_KERNPROC), but they still may call nfs_loadattrcache() on RPC response.

Instead of immediately calling vnode_pager_setsize() if server response indicated changed file size, but the vnode is not exclusively locked, set a new node flag NVNSETSZSKIP. When the vnode exclusively locked, or when we can temporary upgrade the lock to exclusive, call vnode_pager_setsize(), by providing the nfsclient VOP_LOCK() implementation.

  1. Uncomment assert in vnode_pager_setsize() and require that all calls are done with the vnode exclusively locked.

The previous version of the patched kernel was used on NFS-booted machine and survived buildworld/buildkernel and nightly periodic run.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This patch fixes a reproduceable NFS deadlock (deadlres_td_sleep_q panic) I was getting on my diskless Rock64 with recent kernels.

I have a tangential question regarding the vnode size: in vn_lseek()'s L_XTND case, we use VOP_GETATTR to get the file size. Why is it not sufficient to use the size from the pager field, under the shared vnode lock?

I have a tangential question regarding the vnode size: in vn_lseek()'s L_XTND case, we use VOP_GETATTR to get the file size. Why is it not sufficient to use the size from the pager field, under the shared vnode lock?

This would only work if v_object != NULL, otherwise you still have to fall back to VOP_GETATTR().

Other than the one inline comment, it looks good to me. However, I don't know enough about the locking/vm stuff
to say if nfs_lock() is correct?
Delaying doing the vnode_pager_setsize() until a lock operation on the vnode sounds reasonable to me.

I have not found a problem with this doing testing so far.

I'm probably not the correct guy to do the actual review, given my lack of knowledge w.r.t. the lock/vm code,
but if no one else wants to click "reviewed", I will do so.

There are still a couple of places in ncl_write() that call vnode_pager_setsize() while holding the NFS node lock.
(Those also hold the LK_EXCLUSIVE vnode lock.) They should always be growing the size of the file, but if
there is a problem with doing those calls with the NFS node lock held, the lock can be easily changed to an "sx" lock.

Thanks for doing this. I would have never figured it out, rick

sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

lktype != LK_UPGRADE is checked twice. Did you mean one of them to be
lktype != LK_DOWNGRADE?

sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

It should have been LK_TRYUPGRADE. I am checking for the ops that cause exclusive locked vnode on success.

Hmm, I might handle LK_DOWNGRADE as well, in fact. But this is all theoretical right now, because I think upgrade/downgrade are not issued for NFS vnodes.

Second LK_UPGRADE should have been LK_TRYUPGRADE.

Added a reply to the inline comment.

sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

There is one case of LK_UPGRADE and no LK_DOWNGRADEs (except the one your
nfs_lock code adds) of LK_TRYUPGRADEs in the NFS client.

I now realize that the LK_DOWNGRADE case would imply that there is already
a LK_EXCLUSIVE lock held, so the handling of that would be at the beginning,
before doing the lock op, I think?
But, as you say, this it theoretical at this time, since there aren't any LK_DOWNGRADEs.

sys/fs/nfsclient/nfs_clvnops.c
319 ↗(On Diff #62859)

LK_UPGRADE/DOWNGRADE come mostly from namei(9) when handling the last path element and locking request from the caller.

LK_DOWNGRADE means that the current lock is exclusive indeed, but NVNSETSZSKIP is not synchronized by the vnode lock, so it could be set precisely because other thread did not locked the vnode. And since we own the lock exclusive, it might be good to call vnode_pager_setsize().

But in fact, the only critical moment where we must not miss the call is when vget() is called from vm_fault(), see onfault detection below. It is that de-synchronization which caused the reported SIGSEGV, I believe.

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

sys/vm/vm_fault.c
842 ↗(On Diff #62958)

I would note here that the vnode case is handled later, after the vnode is locked and just before pagein.

kib marked an inline comment as done.Oct 9 2019, 1:20 AM

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

Are you fine with both vm changes commits with 'Reviewed by: you' ?

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

This is because we actually have pages beyond EOF on the object' queue. Main offender is UFS, where negative logical block numbers are used for indirect blocks and extended data blocks. Then the negative indexes of the blocks are converted into very large pindexes, which cannot be equal to pindex of any data page, due to sign extension.

kib added a subscriber: pho.

Add comment in vm_fault.c.

Peter, could you, please, test this ? Whole test suite run is required, just NFS part is not enough.

In D21883#479546, @kib wrote:

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

Are you fine with both vm changes commits with 'Reviewed by: you' ?

Yes.

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

This is because we actually have pages beyond EOF on the object' queue. Main offender is UFS, where negative logical block numbers are used for indirect blocks and extended data blocks. Then the negative indexes of the blocks are converted into very large pindexes, which cannot be equal to pindex of any data page, due to sign extension.

I see, thanks. I knew about the negative block numbers but did not remember that they are used in VMIO.

sys/vm/vm_fault.c
1039–1043 ↗(On Diff #63070)

Can you please elaborate on how this delayed size check helps? The object's size field is only changed with object lock held. Yes? And, I believe that the object lock is held continuously from the earlier size check until this new one. In other words, how does the size field change between the earlier check and this new one?

In D21883#479546, @kib wrote:

I think the fault handler change is ok. I read through the NFS changes but am not very confident in that area.

Are you fine with both vm changes commits with 'Reviewed by: you' ?

Yes.

I was somewhat surprised to see that vm_page_alloc() and its callees do not verify that the requested pindex is < object->size. Indeed, vm_reserv_alloc_page() even handles the case where pindex >= object->size.

This is because we actually have pages beyond EOF on the object' queue. Main offender is UFS, where negative logical block numbers are used for indirect blocks and extended data blocks. Then the negative indexes of the blocks are converted into very large pindexes, which cannot be equal to pindex of any data page, due to sign extension.

I see, thanks. I knew about the negative block numbers but did not remember that they are used in VMIO.

After thinking some more, there are even more obvious cases when vm_page_alloc() is performed over the object size. Most implementation of VOP_WRITE() call vnode_pager_setsize() only after the buffer for the write is created. Sometimes vnode_pager_setsize() is only set after successful bwrite() of that buffer.

sys/vm/vm_fault.c
1039–1043 ↗(On Diff #63070)

Yes, of course there is no issues with the consistency for the vm_fault() itself. The problem is in inconsistency of user-visible file vs. mapping state. Since for instance iod's finish io without taking the NFS vnode lock, it might be possible for userspace to see larger (extended) file while the vnode_pager_setsize() call from iod is blocked on the object lock, thus vm_fault() effectively sees shorter file. The result is a spurious SIGSEGV that was reported.

Instead of unconditionally moving the object size check for vnodes, ensure that the vnode is locked at the existing check place. Only do the locking for filesystems that require it, by a flag on the object.

Apparently skipping the check for resident pages exposes pages outside the EOF which should not be touched.

The new revision makes changes in the vm part that I don't understand, so I am afraid I can't
review them.

sys/fs/nfsclient/nfs_clport.c
570 ↗(On Diff #63169)

It might be nice to have a comment here noting that the function
unlocks the NFS node, but only when nsizep == NULL.
(I find I have to look at the function closely to notice this.)
But it is just a suggestion.

kib marked an inline comment as done.

Add comment trying to explain convoluted ncl_pager_setsize() interface.

Thanks for the comment.

I've run this version of the patch through a test cycle and have not seen problems.
(Of course, the previous patch worked for my testing, as well.)

Rebase.
Disable assert for ZFS, I do not intend to fix it now.

This version passes stress2, according to Peter' report.

Alan. Mark, could you please look at the sys/vm part of the patch ? It was changed, and I think it was simplified comparing to the original version. Now the change is that vm_fault() path can take the vnode lock earlier, comparing with previous more delicate (and apparently buggy) move of the size check.

In D21883#481629, @kib wrote:

Alan. Mark, could you please look at the sys/vm part of the patch ? It was changed, and I think it was simplified comparing to the original version. Now the change is that vm_fault() path can take the vnode lock earlier, comparing with previous more delicate (and apparently buggy) move of the size check.

The VM bits look good to me.

This revision is now accepted and ready to land.Oct 16 2019, 7:33 PM

I intend to commit this patch set tomorrow or at Monday.

I think I ran into this same thing on a write (with nfs code being r358252). Do we need a similar fix in ncl_write to apply TDP2_SBPAGES? This thread has an exclusive lock on the vp.

`
#0  sched_switch (td=0xfffffea88567bb00, newtd=0xfffffe850e552b00, flags=<optimized out>) at /b/mnt/src/sys/kern/sched_ule.c:2423
#1  0xffffffff8070761b in mi_switch (flags=<optimized out>, newtd=0x0) at /b/mnt/src/sys/kern/kern_synch.c:605
#2  0xffffffff8075cd2c in sleepq_wait (wchan=<unavailable>, pri=<unavailable>) at /b/mnt/src/sys/kern/subr_sleepqueue.c:691
#3  0xffffffff80a7066a in _vm_page_busy_sleep (obj=<optimized out>, m=0xfffffe81ab46a250, pindex=<optimized out>, wmesg=0xffffffff81534b90 "vmopar", allocflags=<optimized out>, locked=<optimized out>) at /b/mnt/src/sys/vm/vm_page.c:1150
#4  0xffffffff80a710ce in vm_page_sleep_if_busy (m=<optimized out>, wmesg=<unavailable>) at /b/mnt/src/sys/vm/vm_page.c:1453
#5  0xffffffff80a6bcef in vm_object_page_remove (object=0xfffff8795dbbe990, start=<optimized out>, end=<optimized out>, options=0) at /b/mnt/src/sys/vm/vm_object.c:2140
#6  0xffffffff80a87163 in vnode_pager_setsize (vp=<optimized out>, nsize=0) at /b/mnt/src/sys/vm/vnode_pager.c:469
#7  0xffffffff80611583 in ncl_pager_setsize (vp=0xfffff8794c197b40, nsizep=<optimized out>) at /b/mnt/src/sys/fs/nfsclient/nfs_clport.c:609
#8  0xffffffff8061143d in nfscl_loadattrcache (vpp=<optimized out>, nap=<optimized out>, nvaper=0x1, stuff=<optimized out>, writeattr=<optimized out>, dontshrink=1) at /b/mnt/src/sys/fs/nfsclient/nfs_clport.c:567
#9  0xffffffff80608102 in ncl_writerpc (vp=0xfffff8794c197b40, uiop=0xfffffea889635ad8, cred=<optimized out>, iomode=0xfffffea889635b08, must_commit=0xfffffea889635b0c, called_from_strategy=1) at /b/mnt/src/sys/fs/nfsclient/nfs_clvnops.c:1498
#10 0xffffffff80615043 in ncl_doio (vp=0xfffff8794c197b40, bp=0xfffffe86f65db148, cr=<optimized out>, td=<optimized out>, called_from_strategy=-135768792) at /b/mnt/src/sys/fs/nfsclient/nfs_clbio.c:1755
#11 0xffffffff80606ec9 in nfs_strategy (ap=<optimized out>) at /b/mnt/src/sys/fs/nfsclient/nfs_clvnops.c:2735
#12 0xffffffff813665c4 in VOP_STRATEGY_APV (vop=<optimized out>, a=0xfffffea889635bc8) at vnode_if.c:2985
#13 0xffffffff807b56ac in VOP_STRATEGY (vp=<optimized out>, bp=<optimized out>) at ./vnode_if.h:1301
#14 bufstrategy (bo=<optimized out>, bp=<optimized out>) at /b/mnt/src/sys/kern/vfs_bio.c:6155
#15 0xffffffff80609e49 in bstrategy (bp=0xfffffe86f65db148) at /b/mnt/src/sys/sys/buf.h:598
#16 ncl_writebp (bp=0xfffffe86f65db148, force=<optimized out>, td=<optimized out>) at /b/mnt/src/sys/fs/nfsclient/nfs_clvnops.c:3327
#17 0xffffffff8061679b in ncl_write (ap=<optimized out>) at /b/mnt/src/sys/fs/nfsclient/nfs_clbio.c:1264

I think I ran into this same thing on a write (with nfs code being r358252). Do we need a similar fix in ncl_write to apply TDP2_SBPAGES? This thread has an exclusive lock on the vp.

D28306