Page MenuHomeFreeBSD

zfs: Fix a deadlock between page busy and the teardown lock
ClosedPublic

Authored by markj on Nov 10 2021, 5:45 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 7, 10:26 PM
Unknown Object (File)
Sun, Nov 3, 5:08 AM
Unknown Object (File)
Thu, Oct 31, 3:09 PM
Unknown Object (File)
Tue, Oct 29, 2:40 AM
Unknown Object (File)
Tue, Oct 29, 2:40 AM
Unknown Object (File)
Tue, Oct 29, 2:39 AM
Unknown Object (File)
Tue, Oct 29, 2:39 AM
Unknown Object (File)
Tue, Oct 29, 2:24 AM
Subscribers

Details

Summary

ZFS has a per-mountpoint read-mostly "teardown lock" which is acquired
in read mode by most VOPs. It is used to suspend filesystem operations
in preparation for some dataset-level operation like a rollback
(reverting a filesystem to an earlier snapshot). In particular, the ZFS
VOP_GETPAGES implementation acquires this lock.

When rolling back a dataset, ZFS invalidates all file data resident in
the page cache, as a rollback can cause a file's contents to change and
we don't want to let stale data linger. To do this, it calls
vn_page_remove() on each vnode associated with the mountpoint. This
introduces a lock order reversal: to handle a page fault we busy vnode
pages before calling VOP_GETPAGES, and during rollback we busy vnode
pages in vm_object_page_remove() with the teardown lock held.

Resolve the deadlock by exploiting the fact that rollback only needs to
purge valid pages: invalid pages need not be purged by definition, and
then a busy lock holder of invalid ZFS vnode pages can safely block on
the teardown lock. This assumes that we will not pass valid pages to
VOP_GETPAGES via vm_pager_get_pages(). Since ZFS is solely responsible
for marking vnode pages valid, I believe it is a safe assumption.

Add a new mode to vm_object_page_remove() to skip over invalid pages.
Use it when rolling back a dataset.

PR: 258208

Test Plan

Peter has a stress2 scenario which triggers the deadlock fairly regularly.
We haven't observed any problems with this patch applied.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 42706
Build 39594: arc lint + arc unit

Event Timeline

markj requested review of this revision.Nov 10 2021, 5:45 PM
sys/vm/vm_object.c
2124

Would this still cause the issue with partially valid pages at EOF? If the page is not fully valid, vm_fault() still calls into pager, and pager calls into VOP. The validation of the rest of the page is performed by vm_pager_get_pages() after pgo_getpages() validated page up to EOF.

It might be simplest to avoid this issue at all by unconditionally doing vm_page_zero_invalid() in zfs VOP_GETPAGES().

sys/vm/vm_object.c
2124

I believe ZFS vnode pages are never partially valid. This is asserted in several places which maintain coherence between the page cache and the DMU. See page_busy() or dmu_read_pages(). Truncation (currently) does not mark pages partially invalid, see the last comment in vnode_pager_subpage_purge(). Maybe it is a somewhat fragile assumption, but it exists already.

This revision is now accepted and ready to land.Nov 10 2021, 8:05 PM
sef added a subscriber: sef.

The code looks fine (small change, after all), other than my request for comments. :)

sys/kern/vfs_vnops.c
2444–2455

Can we get some comments? I realize the code has historically been low on them, but we can try to fix that for new code. :)

markj marked an inline comment as done.

Add a comment for vn_pages_remove_valid().

This revision now requires review to proceed.Nov 11 2021, 8:33 PM

Thank you very much! This is a very neat fix for what appeared as a quite substantial problem (and maybe is, in terms of interactions between layers / subsystems).

This revision is now accepted and ready to land.Nov 12 2021, 10:04 AM