Use a pctrie_lookup to avoid walking over low out-of-range buf list entries, and an early break to avoid the high out-of-range entries. Avoid writing almost identical loops for the dirty and clean lists.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
sys/kern/vfs_subr.c | ||
---|---|---|
2655–2656 | Note that B_DELWRI tests are different between loops. All buffers in the clean queue must have B_DELWRI clean, the flag set means that we lost the iterator. For the dirty queue, it is reverse. |
Make the code less tricky, with fewer variables, and make the resulting amd64 code a few bytes smaller than before.
sys/kern/vfs_subr.c | ||
---|---|---|
2656 | This is even more tricky than before. Both parts of the comparison operator are converted to same type, in this case int. Then clean becomes 0 or 1, while B_DELWRI bit is not at position 0, so the left hand cannot have the value 1 ever. So this expression is apparently true for any delayed-write buffer, which is not what needed. |
sys/kern/vfs_subr.c | ||
---|---|---|
2656 | !x is 0 when x is nonzero and 1 when x is zero. But I can rewrite as (nbp->b_flags & B_DELWRI) == (clean ? B_DELWRI : 0) if that would be clearer. |
sys/kern/vfs_subr.c | ||
---|---|---|
2656 | Huh, I missed the '!' before the logical-and expression. It is correct but still tricky (and not in style). So yes please change as you proposed. |
Don't use pctrie lookup to find the first block to scan; just use linear search. Using pctrie lookup leads to two crashes (truncate3.sh, mmap28.sh) on stress tests, and this version passes those tests. I guess there are items in the list not in the trie?
All buffers must be in trie. Might be, it is due to UFS metadata buffers having negative indexes?
Not sure if the panic is related to this patch?
20241011 13:49:02 all (664/970): swapoff4.sh Oct 11 13:50:22 mercat1 kernel: pid 90548 (swap), jid 0, uid 0, was killed: failed to reclaim memory Oct 11 13:50:23 mercat1 kernel: pid 90555 (swap), jid 0, uid 0, was killed: failed to reclaim memory Oct 11 13:50:25 mercat1 kernel: pid 90550 (swap), jid 0, uid 0, was killed: failed to reclaim memory panic: Assertion (object->flags & OBJ_SWAP) != 0 failed at ../../../vm/swap_pager.c:564 cpuid = 5 time = 1728647425 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01d8174990 vpanic() at vpanic+0x13f/frame 0xfffffe01d8174ac0 panic() at panic+0x43/frame 0xfffffe01d8174b20 swapoff_one() at swapoff_one+0x8a9/frame 0xfffffe01d8174d00 kern_swapoff() at kern_swapoff+0x1ab/frame 0xfffffe01d8174e00 amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe01d8174f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01d8174f30 --- syscall (582, FreeBSD ELF64, swapoff), rip = 0xdc6777123ba, rsp = 0xdc6748c87e8, rbp = 0xdc6748c8920 ---
Nope. It is due to the swap pager iterators code, which adds (IMO) correct assertions that the object has the OBJ_SWAP flag set. But the flag might become cleared after the object relock, if the object is dying.
May be something like this
diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index f4db46a32dee..50c320a95002 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -1922,6 +1922,7 @@ swap_pager_swapoff_object(struct swdevt *sp, vm_object_t object) * Make sure that pending writes finish before * returning. */ +dead: vm_object_pip_wait(object, "swpoff"); swp_pager_meta_free_all(object); break; @@ -1988,11 +1989,15 @@ swap_pager_swapoff_object(struct swdevt *sp, vm_object_t object) * iterator and search again for the first swblk at or * after blks.index. */ + if ((object->flags & OBJ_DEAD) != 0) + goto dead; pctrie_iter_reset(&pages); sb = swblk_iter_init(&blks, object, blks.index); continue; } if (sb_empty) { + if ((object->flags & OBJ_DEAD) != 0) + goto dead; swblk_iter_remove(&blks); uma_zfree(swblk_zone, sb); }
I ran tests with D46963.144614.patch added for 14 hours. I did not observe any issues.
Restore the pctrie lookup. Add a check for negative block number. Passes mmap28.sh and truncate3.sh tests.
Copy what bnoreuselist() does to handle the problem with LOOKUP_GE results, which is more robust than what I did before.