zfs: Fix a deadlock between page busy and the teardown lock
ClosedPublic
Actions

Authored by markj on Nov 10 2021, 5:45 PM.

Details

Reviewers

alc
kib
avg
sef

Commits

rGb3427b18b1c6: zfs: Fix a deadlock between page busy and the teardown lock
rGcdf74673bc79: zfs: Fix a deadlock between page busy and the teardown lock
rG705a6ee2b611: zfs: Fix a deadlock between page busy and the teardown lock
rGd28af1abf031: vm: Add a mode to vm_object_page_remove() which skips invalid pages

Summary

ZFS has a per-mountpoint read-mostly "teardown lock" which is acquired
in read mode by most VOPs. It is used to suspend filesystem operations
in preparation for some dataset-level operation like a rollback
(reverting a filesystem to an earlier snapshot). In particular, the ZFS
VOP_GETPAGES implementation acquires this lock.

When rolling back a dataset, ZFS invalidates all file data resident in
the page cache, as a rollback can cause a file's contents to change and
we don't want to let stale data linger. To do this, it calls
vn_page_remove() on each vnode associated with the mountpoint. This
introduces a lock order reversal: to handle a page fault we busy vnode
pages before calling VOP_GETPAGES, and during rollback we busy vnode
pages in vm_object_page_remove() with the teardown lock held.

Resolve the deadlock by exploiting the fact that rollback only needs to
purge valid pages: invalid pages need not be purged by definition, and
then a busy lock holder of invalid ZFS vnode pages can safely block on
the teardown lock. This assumes that we will not pass valid pages to
VOP_GETPAGES via vm_pager_get_pages(). Since ZFS is solely responsible
for marking vnode pages valid, I believe it is a safe assumption.

Add a new mode to vm_object_page_remove() to skip over invalid pages.
Use it when rolling back a dataset.

PR: 258208

Test Plan

Peter has a stress2 scenario which triggers the deadlock fairly regularly.
We haven't observed any problems with this patch applied.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 42706
Build 39594: arc lint + arc unit

Event Timeline

markj created this revision.Nov 10 2021, 5:45 PM

Herald added subscribers: delphij, imp. · View Herald TranscriptNov 10 2021, 5:45 PM

markj requested review of this revision.Nov 10 2021, 5:45 PM

Harbormaster completed remote builds in B42706: Diff 98322.Nov 10 2021, 5:45 PM

markj edited the test plan for this revision. (Show Details)Nov 10 2021, 5:46 PM

kib added inline comments.Nov 10 2021, 7:04 PM

sys/vm/vm_object.c
2124	Would this still cause the issue with partially valid pages at EOF? If the page is not fully valid, vm_fault() still calls into pager, and pager calls into VOP. The validation of the rest of the page is performed by vm_pager_get_pages() after pgo_getpages() validated page up to EOF. It might be simplest to avoid this issue at all by unconditionally doing vm_page_zero_invalid() in zfs VOP_GETPAGES().

markj added inline comments.Nov 10 2021, 8:02 PM

sys/vm/vm_object.c
2124	I believe ZFS vnode pages are never partially valid. This is asserted in several places which maintain coherence between the page cache and the DMU. See page_busy() or dmu_read_pages(). Truncation (currently) does not mark pages partially invalid, see the last comment in vnode_pager_subpage_purge(). Maybe it is a somewhat fragile assumption, but it exists already.

kib accepted this revision.Nov 10 2021, 8:05 PM

This revision is now accepted and ready to land.Nov 10 2021, 8:05 PM

The code looks fine (small change, after all), other than my request for comments. :)

sys/kern/vfs_vnops.c
2444–2455	Can we get some comments? I realize the code has historically been low on them, but we can try to fix that for new code. :)

Add a comment for vn_pages_remove_valid().

This revision now requires review to proceed.Nov 11 2021, 8:33 PM

Harbormaster completed remote builds in B42736: Diff 98379.Nov 11 2021, 8:33 PM

Thank you very much! This is a very neat fix for what appeared as a quite substantial problem (and maybe is, in terms of interactions between layers / subsystems).

This revision is now accepted and ready to land.Nov 12 2021, 10:04 AM

Thanks

Closed by commit rGd28af1abf031: vm: Add a mode to vm_object_page_remove() which skips invalid pages (authored by markj). · Explain WhyNov 15 2021, 6:03 PM

This revision was automatically updated to reflect the committed changes.

markj added a commit: rGd28af1abf031: vm: Add a mode to vm_object_page_remove() which skips invalid pages.

markj added a commit: rG705a6ee2b611: zfs: Fix a deadlock between page busy and the teardown lock.Nov 20 2021, 4:37 PM

Brian Behlendorf <behlendorf1@llnl.gov> added a commit: rGcdf74673bc79: zfs: Fix a deadlock between page busy and the teardown lock.Dec 15 2021, 1:24 AM

Tony Hutter <hutter2@llnl.gov> added a commit: rGb3427b18b1c6: zfs: Fix a deadlock between page busy and the teardown lock.Mar 11 2022, 6:33 AM