vm_fault: reset iterator after vm_fault_populate()
ClosedPublic
Actions

Authored by dougm on Tue, Apr 22, 8:15 AM.

Details

Reviewers

markj
kib
alc

Commits

rG0f9b73ffa50d: vm_fault: reset iterator after vm_fault_populate()

Summary

An object lock can be released and reacquired without resetting the iterator in the vm_fault code. Between an iter_lookup call in vm_fault_object and page_alloc_after call in vm_fault_allocate, there is a call to vm_fault_populate in vm_fault_allocate that leads to functions that release and reacquire the object write lock. So, if vm_fault_populate is called and allocation is likely to follow, reset the iterator.

No tests have failed on account of this, as far as I know, but it is a bug that ought to be addressed promptly.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

dougm requested review of this revision.Tue, Apr 22, 8:15 AM

dougm created this revision.

I'm fine with reverting the change until the bug can be found.

sys/vm/vm_fault.c
1613	Suppose vm_fault_soft_fast() failed above. Aren't we then proceeding with an uninitialized iterator?

dougm marked an inline comment as done.Tue, Apr 22, 4:35 PM

dougm added inline comments.

sys/vm/vm_fault.c
1613	We only use the iterator if (fs.entry->eflags & MAP_ENTRY_SPLIT_BOUNDARY_MASK) != 0. We only invoke vm_fault_soft_fast() if (fs.entry->eflags & MAP_ENTRY_SPLIT_BOUNDARY_MASK) == 0.

The syzkaller test still triggers a panic:

20250422 20:52:41 all (1/1): syzkaller75.sh
panic: vm_pager_assert_in: page 0xfffffe0024120cc0 is mapped
cpuid = 6
time = 1745348147
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe010873a730
vpanic() at vpanic+0x136/frame 0xfffffe010873a860
panic() at panic+0x43/frame 0xfffffe010873a8c0
vm_pager_assert_in() at vm_pager_assert_in+0x23a/frame 0xfffffe010873a900
vm_pager_get_pages() at vm_pager_get_pages+0x7e/frame 0xfffffe010873a960
vm_fault_getpages() at vm_fault_getpages+0x22b/frame 0xfffffe010873a9c0
vm_fault_object() at vm_fault_object+0x29a/frame 0xfffffe010873aa00
vm_fault() at vm_fault+0x2aa/frame 0xfffffe010873ab40
vm_map_wire_locked() at vm_map_wire_locked+0x385/frame 0xfffffe010873abf0
vm_mmap_object() at vm_mmap_object+0x2fd/frame 0xfffffe010873ac50
vn_mmap() at vn_mmap+0x152/frame 0xfffffe010873ace0
kern_mmap() at kern_mmap+0x621/frame 0xfffffe010873adc0
sys_mmap() at sys_mmap+0x42/frame 0xfffffe010873ae00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe010873af30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe010873af30
--- syscall (0, FreeBSD ELF64, syscall), rip = 0x823e0b7fa, rsp = 0x8248e4f68, rbp = 0x8248e4f90 ---

https://people.freebsd.org/~pho/stress/log/log0582.txt

In D49957#1139502, @pho wrote:
The syzkaller test still triggers a panic:
20250422 20:52:41 all (1/1): syzkaller75.sh

I don't have access to syzkaller75.sh. Can you provide it?

In D49957#1139554, @dougm wrote:
In D49957#1139502, @pho wrote:
The syzkaller test still triggers a panic:
20250422 20:52:41 all (1/1): syzkaller75.sh
I don't have access to syzkaller75.sh. Can you provide it?

I have uploaded the WiP test scenario.{F115374599}

In D49957#1139615, @pho wrote:

I have uploaded the WiP test scenario.{F115374599}

Unfortunately, you are the only person who can view this attachment. The same thing happens to me when I upload files here, and I don't know how to fix it.

What was the most recent version of main to pass this test?

Doug,
I have mailed you the test scenario. I will start a test on the latest version of HEAD.

Upon further study, vm_fault_populate() can definitely release and reacquire the object lock. So reset the iterator in that case.

dougm retitled this revision from vm_fault: abandon page_alloc_after from vm_fault to vm_fault: reset iterator after vm_fault_populate().Wed, Apr 23, 7:54 AM

20250423 09:20:24 all (1/1): syzkaller75.sh
panic: vm_pager_assert_in: page 0xfffffe0000b5fcc0 is mapped
cpuid = 11
time = 1745393009
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe016df1d7b0
vpanic() at vpanic+0x136/frame 0xfffffe016df1d8e0
panic() at panic+0x43/frame 0xfffffe016df1d940
vm_pager_assert_in() at vm_pager_assert_in+0x1fa/frame 0xfffffe016df1d980
vm_pager_get_pages() at vm_pager_get_pages+0x3d/frame 0xfffffe016df1d9d0
vm_fault() at vm_fault+0x745/frame 0xfffffe016df1db40
vm_map_wire_locked() at vm_map_wire_locked+0x385/frame 0xfffffe016df1dbf0
vm_mmap_object() at vm_mmap_object+0x2fd/frame 0xfffffe016df1dc50
vn_mmap() at vn_mmap+0x152/frame 0xfffffe016df1dce0
kern_mmap() at kern_mmap+0x621/frame 0xfffffe016df1ddc0
sys_mmap() at sys_mmap+0x42/frame 0xfffffe016df1de00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe016df1df30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe016df1df30
--- syscall (0, FreeBSD ELF64, syscall), rip = 0x823efb7fa, rsp = 0x821f52f68, rbp = 0x821f52f90 ---
KDB: enter: panic
[ thread pid 37673 tid 108519 ]
Stopped at      kdb_enter+0x33: movq    $0,0x104d7a2(%rip)
db> x/s version
version:  FreeBSD 15.0-CURRENT #0 main-n276680-d14036ea424d-dirty: Wed Apr 23 09:10:41 CEST 2025
  pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO
db>
Uptime: 8m32s

dougm edited the summary of this revision. (Show Details)Wed, Apr 23, 8:45 PM

In D49957#1139632, @dougm wrote:

What was the most recent version of main to pass this test?

I'm curious too. The syzbot reproducer doesn't work for me for some reason. I spent some time reading commits 0b694568b0b5d3cf76f56b119b3a378a47b44a1f and 9cc73397534ee5373593f5946abe0b00c1d2b657 and can't spot a bug.

sys/vm/vm_fault.c
1261	I think it would be cleaner to include a pointer to the iterator in struct faultstate.

dougm added a reviewer: alc.Thu, Apr 24, 8:25 PM

kib accepted this revision.Thu, Apr 24, 8:41 PM

This revision is now accepted and ready to land.Thu, Apr 24, 8:41 PM

Closed by commit rG0f9b73ffa50d: vm_fault: reset iterator after vm_fault_populate() (authored by dougm). · Explain WhyThu, Apr 24, 10:23 PM

This revision was automatically updated to reflect the committed changes.

dougm added a commit: rG0f9b73ffa50d: vm_fault: reset iterator after vm_fault_populate().

Herald added a subscriber: imp. · View Herald TranscriptThu, Apr 24, 10:23 PM

alc added inline comments.Fri, Apr 25, 6:44 AM

sys/vm/vm_fault.c
1261	A pointer or an instance of the struct? I would have expected the iterator to be a field within the faultstate.

markj added inline comments.Fri, Apr 25, 7:58 PM

sys/vm/vm_fault.c
1261	Sorry, yes, embedding the iterator makes more sense.

Now that iterators are much smaller, consisting of only four 64-bit fields, I think we should seriously consider embedding an iterator in the vm_object for use by operations that require a write lock on the object. If that iterator is used by the functions that insert and remove pages from the trie, then it will always be consistent, and most of the pctrie_iter_reset()s can be eliminated. With the removal of the memq from the object, this would only grow the size of the vm_object by 2 64-bit fields. (If there was some way to avoid having both the root of the pctrie and the iterator's pointer to that root within the vm_object, then the size of the vm_object would only grow by 1 64-bit field.)

The operations, like vm_map_pmap_enter(), that utilize a read lock will still need to define and use their own private iterator.

That all said, I think that we should complete the removal of the memq, and then consider embedding an iterator in the object.

In D49957#1140976, @alc wrote:

Now that iterators are much smaller, consisting of only four 64-bit fields, I think we should seriously consider embedding an iterator in the vm_object for use by operations that require a write lock on the object. If that iterator is used by the functions that insert and remove pages from the trie, then it will always be consistent, and most of the pctrie_iter_reset()s can be eliminated. With the removal of the memq from the object, this would only grow the size of the vm_object by 2 64-bit fields. (If there was some way to avoid having both the root of the pctrie and the iterator's pointer to that root within the vm_object, then the size of the vm_object would only grow by 1 64-bit field.)

The operations, like vm_map_pmap_enter(), that utilize a read lock will still need to define and use their own private iterator.

That all said, I think that we should complete the removal of the memq, and then consider embedding an iterator in the object.

Consider D49942 as an example. Most of the changes there would not be necessary if we were using an iterator that was embedded within the vm_object. The necessary changes would only be do switch from using vm_page_lookup to vm_radix_iter_lookup on the embedded iterator.

In D49957#1140988, @alc wrote:

In D49957#1140976, @alc wrote:

Now that iterators are much smaller, consisting of only four 64-bit fields, I think we should seriously consider embedding an iterator in the vm_object for use by operations that require a write lock on the object. If that iterator is used by the functions that insert and remove pages from the trie, then it will always be consistent, and most of the pctrie_iter_reset()s can be eliminated. With the removal of the memq from the object, this would only grow the size of the vm_object by 2 64-bit fields. (If there was some way to avoid having both the root of the pctrie and the iterator's pointer to that root within the vm_object, then the size of the vm_object would only grow by 1 64-bit field.)

Having different functions taking turns using the same iterator could be problematic. If two of them are iterating over consecutive entries, then sharing the same index or limit fields of the iterator would screw up one of them.

The good news is that the only field that matters is the node field, the one that is set to NULL in pctrie_iter_reset(). I suggest that what the vm_object should really hold is the node pointer from the iterator, so we'd add a struct pctrie_node pointer to the iterator, and we'd define a new type of iterator - call it a "sharable_pctrie_iter" for now - that had, instead of a "struct pctrie_node *node;" field, had a "struct pctrie node **pnode;" field. There would be different functions - for every 'iter' function now, we'd have to add a 'shared_iter' function - but the only difference would be in whether 'node' or '*pnode' was read from and assigned to. The value of 'pnode' would be initialized to &object->page_node.

In D49957#1140996, @dougm wrote:

The value of 'pnode' would be initialized to &object->page_node.

Better still, we could just add struct pctrie_node *curr as the second field to the pctrie struct, along with root. Iterators would read from it and set it. Functions that changed the pctrie without using iterators would NULL it, to avoid breaking the iter-users. We could just drop 'node' from the iterator struct and use ptree->curr instead.

In D49957#1140988, @alc wrote:

In D49957#1140976, @alc wrote:

Now that iterators are much smaller, consisting of only four 64-bit fields, I think we should seriously consider embedding an iterator in the vm_object for use by operations that require a write lock on the object. If that iterator is used by the functions that insert and remove pages from the trie, then it will always be consistent, and most of the pctrie_iter_reset()s can be eliminated. With the removal of the memq from the object, this would only grow the size of the vm_object by 2 64-bit fields. (If there was some way to avoid having both the root of the pctrie and the iterator's pointer to that root within the vm_object, then the size of the vm_object would only grow by 1 64-bit field.)

The operations, like vm_map_pmap_enter(), that utilize a read lock will still need to define and use their own private iterator.

That all said, I think that we should complete the removal of the memq, and then consider embedding an iterator in the object.

Consider D49942 as an example. Most of the changes there would not be necessary if we were using an iterator that was embedded within the vm_object. The necessary changes would only be do switch from using vm_page_lookup to vm_radix_iter_lookup on the embedded iterator.

This would mean, practically, that each vnode is increased in size by 32 byte. On modest machines with 1M cached vnodes it means 32M of the kernel memory, both KVA and phys.
We try hard to not bloat struct vnode even by a byte.

Revision Contents
Changeset List

Path

Size

sys/

vm/

vm_fault.c

1 line

Diff 154251

View Options

vm_fault: reset iterator after vm_fault_populate()ClosedPublicActions