Page MenuHomeFreeBSD

vm_object: drop memq field
Needs ReviewPublic

Authored by dougm on Dec 9 2024, 8:06 PM.
Tags
None
Referenced Files
F115994383: D48007.diff
Thu, May 1, 8:26 AM
Unknown Object (File)
Fri, Apr 25, 3:52 PM
Unknown Object (File)
Thu, Apr 24, 11:42 PM
Unknown Object (File)
Sat, Apr 19, 5:32 AM
Unknown Object (File)
Tue, Apr 8, 11:46 AM
Unknown Object (File)
Mar 28 2025, 5:32 AM
Unknown Object (File)
Mar 26 2025, 10:43 AM
Unknown Object (File)
Mar 21 2025, 5:44 PM

Details

Summary

The memq field in vm_object is used to maintain a list of mapped pages, sorted by pindex. Remove that field, and stop maintaining the list.
Where a predecessor page is passed to a function solely to help with insertion into the memq list, drop that parameter. Rename functions that those arguments from *_after() to *.iter().

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

dougm requested review of this revision.Dec 9 2024, 8:06 PM

A code reformat to (slightly) reduce the diff size.

Define and use VM_RADIX_FOREACH macros. Initialize a couple of iterators before acquiring read locks.

sys/amd64/sgx/sgx.c
396

Rather than lifting out the vm_page_remove() calls, perhaps pass the iterator to sgx_page_remove() as well, and have it decide whether to call vm_page_remove() or vm_page_iter_remove() based on whether the iterator pointer is NULL.

sys/vm/vm_glue.c
625

Do you have some other uses for NOCREAT_LT planned? This function is called rarely (kernel stacks are cached by UMA and reused many times without going through this function), so the extra complexity feels a bit unnecessary.

sys/vm/vm_page.h
224

Not quite related to this change, but once it lands, I'd like to add one 8-byte field to plinks. I believe that'd be enough to let us start embedding UMA slab headers (i.e. uma_slab_t) within the vm_page structure. This is an alternative approach to D22768.

sys/vm/vm_radix.h
272

Bikeshedding, but VM_RADIX_FOREACH_FROM reads a bit more naturally to me, and we already have TAILQ_FOREACH_FROM etc..

sys/vm/vm_reserv.c
528

This is more expensive, but at some point I presume it should be possible to pass an interator down from the page allocator and use that instead of doing a full lookup here.

dougm marked 2 inline comments as done.Jan 6 2025, 7:26 PM
dougm added inline comments.
sys/vm/vm_glue.c
625

I have no other uses planned. I was just looking for a way to avoid a redundant lookup.

sys/vm/vm_radix.h
272

I don't object to using FROM, but the reason I didn't is that FOREACH_FROM takes two arguments, and this macro takes three. FOREACH_FROM assumes that the lookup has already happened and FOREACH_START has to do the lookup. Those differences are the reason that I chose to use a distinct name. But it doesn't matter much to me.

sys/vm/vm_reserv.c
528

I agree that it would be possible.

dougm marked an inline comment as done.

Doug, how would you like to proceed with the patch? Since quite a few pieces of it are independent, I imagine they can be reviewed and peeled off one by one, especially in areas that aren't performance-critical.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

sys/kern/kern_kcov.c
411

Can't we use vm_radix_iter_next() here?

sys/kern/subr_pctrie.c
550

This can presumably be committed on its own. It looks reasonable to me.

sys/x86/iommu/amd_idpgtbl.c
123

The assertion can come first, as in dmar_domain_free_pgtbl().

sys/x86/iommu/intel_idpgtbl.c
266–268

Elsewhere you are careful to initialize the iterator outside the scope of the object lock, but not here.

268

Does it make sense to have a macro for this usage too? Perhaps we could use VM_RADIX_FOREACH and assert that pindices are consecutive?

dougm marked 3 inline comments as done.

Doug, how would you like to proceed with the patch? Since quite a few pieces of it are independent, I imagine they can be reviewed and peeled off one by one, especially in areas that aren't performance-critical.

Some of them have been reviewed separately, and approved by @kib, but blocked by comments from @alc. His comment on D47718 is why I stopped creeping along a bit at a time and threw it all out for consideration at once. I expect that his approval depends on some kind of performance testing. He's been drawn away from BSD considerations by other life events lately.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

@alc has concerns. He's reported performance worsened a bit by some of the page management changes already made, and worsened a bit by some other changes that have been under review for a while.

sys/x86/iommu/intel_idpgtbl.c
268

I don't know. I'll add a VM_RADIX_FORALL macro for this purpose, and you can decide how you feel about it.

Use VM_RADIX_FORALL_FROM one more place.

Use VM_RADIX_FORALL in kcov_free.

Abandon VM_ALLOC_NOCREAT_LT.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

@alc has concerns. He's reported performance worsened a bit by some of the page management changes already made, and worsened a bit by some other changes that have been under review for a while.

Do we have a benchmark to experiment with, either a workload to test the kernel or a standalone userspace program embedding subr_pctrie.o? I wrote a patch which tries to add a fast path to pctrie_iter_step/next, wherein pctrie.h inlines iteration when the top-most node in the iterator contains a run of successive leaf values, and falls back to pctrie_iter_lookup_ge() to look up new nodes or handle holes. So far I've only tested it to check correctness, but I wonder if it would help performance in the case where a VM object is densely populated.

One other performance-related detail I'm wondering about is, we check the iterator limit twice when stepping: once before looking up a leaf, and once after finding one. I believe the latter check is sufficient for correctness. If so, are we making the right tradeoff in checking twice instead of once?

sys/amd64/amd64/pmap.c
7710

I don't have a concrete suggestion to offer at the moment, but here we're doing some unnecessary work: vm_map_pmap_enter(), the sole caller of this function, uses an iterator to find a contiguous run of pindices that correspond to a run of valid pages; then, it passes the bounds of this run to pmap_enter_object(), which uses a separate iterator to re-lookup each page in the run. That seems redundant, and we should perhaps rework the pmap_enter_object() interface to avoid it.

One possibility would be modifying vm_map_pmap_enter() to store a fixed-size array of vm_page pointers on the stack, which gets passed to pmap_enter_object() once full or the end of a run is detected or a superpage is detected.

In vm_map.c, have page index jump ahead with a superpage jump-ahead. Initialize the cursor before acquiring a read lock.

sys/amd64/amd64/pmap.c
7710

How about if vm_map_pmap_enter stops trying to find a max page range between calls to pmap_enter_object, and instead just calls it every time vm_page_all_valid is true, passing along the start and psind vals? The pmap_lock would be acquired once, before the loop in vm_map_pmap_enter, and released once, and pmap_enter_object wouldn't have a loop at all. Since I don't know what the rwlock is about, I don't know how the potential extra locking/unlocking of the rwlock would affect things.

Undo the parts of the last change about moving the iterator initialization.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

@alc has concerns. He's reported performance worsened a bit by some of the page management changes already made, and worsened a bit by some other changes that have been under review for a while.

Do we have a benchmark to experiment with, either a workload to test the kernel or a standalone userspace program embedding subr_pctrie.o? I wrote a patch which tries to add a fast path to pctrie_iter_step/next, wherein pctrie.h inlines iteration when the top-most node in the iterator contains a run of successive leaf values, and falls back to pctrie_iter_lookup_ge() to look up new nodes or handle holes. So far I've only tested it to check correctness, but I wonder if it would help performance in the case where a VM object is densely populated.

I've just been running buildworld tests. I wrote a patch (a part of D47680) that tries to have the next lookup execute straightline code in the common case.

One other performance-related detail I'm wondering about is, we check the iterator limit twice when stepping: once before looking up a leaf, and once after finding one. I believe the latter check is sufficient for correctness. If so, are we making the right tradeoff in checking twice instead of once?

@alc has suggested that we could modify pctrie nodes to maintain a parent pointer and avoid the iterator array.

One other performance-related detail I'm wondering about is, we check the iterator limit twice when stepping: once before looking up a leaf, and once after finding one. I believe the latter check is sufficient for correctness. If so, are we making the right tradeoff in checking twice instead of once?

I haven't decoded this yet. Could you provide a bit more detail?

sys/amd64/amd64/pmap.c
7710

Cancel the previous idea.

Can we use p->plinks.s to build a list of the valid pages, and pass that list to pmap_enter_object to avoid the second iterator? Using an array leads to a question of how big to make the array, where using plinks.s avoids that - assuming that someone else isn't using plinks for something while vm_map_pmap_enter is happening. I don't know whether I can safely assume that.

sys/amd64/amd64/pmap.c
7710

At the moment, we cannot use plinks.s: it's overlaid with plinks.q, the fields used to link pages together in the global page queues. The pages mapped here will in general be managed (which among other things means that they are reclaimable and thus are kept in these page queues), so plinks is already in use.

That said, in this diff you remove two pointer-sized fields from struct vm_page, and I would later like to add one back for use in UMA (the kernel memory allocator). We could make use of such a field here: that would not overlap with UMA's use of it.

dougm edited the summary of this revision. (Show Details)
sys/vm/vm_object.h
101

We'll need a __FreeBSD_version bump with this change, to force out-of-tree kernel modules to be recompiled. It's not too unusual for them to access vm_object fields directly. None of the usual suspects access memq though.

sys/vm/vm_page.h
224

Why would that require the current review before?

238–239

Is listq still needed?

sys/vm/vm_page.h
224

In order to avoid making struct vm_page larger. My intent would be to remove all uses of listq. Though, this is still a bit tricky in vm_phys.c because of the way we lazily remove pages from the page queues.

sys/vm/vm_page.h
224

But you said that the bytes are added to plinks, which should overlay other members of the union? But nevermind.

sys/vm/vm_page.h
238–239

listq is still needed in uma_core.c and vm_phys.c, where pages are not part of objects.

The first version of this patch shrank vm_page by using a union to store listq and (object, pindex) in the same space. And that used to work. Now, it doesn't work, and I plan to make it work after this gets committed.

Always make the iter pointer the last argument to any function. Suggested offline by @alc.

In page.c, reorder more param lists to make the iterator appear last.