Implement everything currently in vm_radix.c with calls to functions in subr_pctrie.c, asccessed via the interface provided by the DEFINE_PCTRIE_SMR macro.
Details
A GENERIC and a GENERIC-NODEBUG kernel both boot successfully.
Peter, can you test this?
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
As fully as people will accept. I expect objection to this change because it is not big enough, and objection to a bigger change that affects the insert/remove functions, because it is too big.
I am fine with the sharing. I would request only two things noted in the comments: subr_pctrie.c needs to get a note that the specific functions are re-used by vm_radix.c, and explanation why it is required to have pindex as the first vm_page member.
This change is ok by me so long as the intent is to fully merge the two implementations.
Smaller diffs are indeed easier to review; I'd suggest posting a series of reviews that incrementally switch over to the subr_pctrie.c implementation. Once all the diffs are accepted, the commits can be pushed together.
Suppose that we have 16 physically contiguous pages on amd64/arm64. I can foresee using the vm_page's order field to encode an order, and thereby eliminate a level in the radix trie. Specifically, I believe that we could encode the order for allocated vm_pages as VM_NFREEORDER + the order, and tweak the assertions in vm_phys.c to test for >= VM_NFREEORDER, rather than equality.
Or better yet, being able to insert a reservation as a different type of leaf, and using a bit vector to indicate whether a vm_page from the reservation is a member of the radix trie. However, it is less clear to me how this could be implemented.
Is there a way to do such things if we use the subr_pctrie.c code for vm_pages?
I can foresee using the vm_page's order field to encode an order, and thereby eliminate a level in the radix trie. Specifically, I believe that we could encode the order for allocated vm_pages as VM_NFREEORDER + the order
Then you would want to change the implementation of vm_radix_lookup to call pctrie_lookup_le, and use the order field and pindex of the found rnode to determine whether your lookup succeeded or not. You might also want to change vm_radix_insert, so that when you inserted that 16th page, vm_radix_insert could remove all the little pages that would be covered by the new big page, before inserting the new big page. And you might want to change vm_radix_remove too, but all those changes would involve calling pctrie_insert and pctrie_remove as necessary. And maybe the pctrie people could implement some kind of 'reclaim everything in this range' call, so that vm_radix could do it faster, somehow. I'm sure the fine people who work on the pctrie project would be happy to help, once you define what you want from them.
Or better yet, being able to insert a reservation as a different type of leaf, and using a bit vector to indicate whether a vm_page from the reservation is a member of the radix trie
PCtrie doesn't know anything about types of leaves. It just points to them. If vm_radix code wants to have two flavors of leaves, that's not really pctrie's business.
No, lookup should not call lookup_le. When lookup arrives at a non-NULL leaf, it examines the vm_page's order field to determine the leaf's size, and thus whether the leaf contains the specified pindex. (I suspect this test would be similar to keybarr, but using the vm_page's pindex and order.) If it does, then lookup calculates the address of the corresponding vm_page structure and returns it.
The elimination of levels should be transparent to the callers.
Insert and remove would have to perform a sort of promotion and demotion. Promotion would be triggered by filling a popmap. Then, you would test to see if the leaf pointers point to adjacent vm_page structures. If they do, then you could either check that the vm_pages have the same segind or check their physical addresses to prove physical contiguity. The point of checking the leaf pointers first is that it will allow us to detect almost all cases where the vm_pages are not physically contiguous without introducing a lot of data cache misses.
For anonymous virtual memory that is backed by a reservation, I don't expect this to introduce a lot of repeating promotion/demotion cycles. Essentially, the freeing of a radix trie node that now happens at vm object termination in vm_radix_reclaim_allnodes would happen earlier during promotion.
For vm_page_alloc_contig, we ought to find a way to do a range insert on the trie.
Or better yet, being able to insert a reservation as a different type of leaf, and using a bit vector to indicate whether a vm_page from the reservation is a member of the radix trie
PCtrie doesn't know anything about types of leaves. It just points to them. If vm_radix code wants to have two flavors of leaves, that's not really pctrie's business.
For vm_radix.c that is a choice, not a requirement.
Nothing you propose changes the layout of nodes, or the basic search routines, so I continue to believe that there's no need for all this code duplication.
For 'lookup', I can satisfy your request by adding a method to pctrie call 'find_leaf' that is just like lookup without the index equality test. Then, when 'find_leaf' returns a non-NULL, vm_radix_lookup can calculate from pindex and order an address to return. 'lookup can just be implemented in terms of 'find_leaf'.
For 'insert', I can satisfy your request by reporting to you when the insertion produced a node with all non-null leaf children, and give you a means to iterate over them. Then if you decide they are mergeable based on your vm_page-based knowledge, I can give you some version of 'replace' that lets you replace a subtree with a super-leaf.
For 'remove', we'd need some kind of 'remove_leaf' function that didn't have an index test, and when the leaf returned was a superpage, vm_radix_remove would need to insert 15 new leaves. I gather that they would be distinct, so I don't see a way to shortcut that with an 'insert this leaf from index to index+count-1' function. But if that would help, it could be created.
These things don't require a tight coupling of knowledge about 'how pages work' and 'how pctries work', but only a bigger pctrie interface.
Don't modify vm_page. Implement everything in terms of subr_pctrie. Depends on another change (D41396) still under review.
I would appreciate feedback here. I'm afraid that the only (unwritten, verbal) feedback I have so far is that this change is only acceptable if I strip all the code out of subr_pctrie.c and inline it in pctrie.h so that all that code can be generated over again for every user of pctrie.h. Is that a consensus position (of everyone but me)?
Could you elaborate on the motivation of that answer, please? For me it sounds fine to share pctrie code as functions.
Looks good to me, but wait a week to see if anyone else will comment before you commit it.
Rearrange the members of vm_page. Move pindex to the beginning to avoid adding and subtracting in the vm_radix functions. Move order and 3 others up with pindex to avoid creating a gap, and because future changes are likely to access 'order' in vm_radix calls, and having 'order' next to 'pindex' will be cache friendly. Make order signed because, first, it is only ever assigned to, compare with, or printed as a signed value and, second, because negative values will be part of a planned future change.
In the absence of feeback, this will be committed on 9/6.
sys/vm/vm_page.h | ||
---|---|---|
228 | Doesn't this create a hole in the structure? The following anonymous union must be 8-byte aligned on LP64 platforms, so this rearrangement of the 8-bit fields will increase the size of struct vm_page. Why isn't it sufficient to move the pindex field? |
Move order and several others up with pindex to avoid creating a gap, and because future changes are likely to access 'order' in vm_radix calls, and having 'order' next to 'pindex' will be cache friendly. Didn't move enough up with pindex last time.
Rearrange ordering in vm_page one more time - to move pindex to the start and have order right after it. Without introducing gaps.
Please make sure to bump __FreeBSD_version in sys/param.h in this commit. The layout change would affect some out-of-tree kernel modules, requiring them to be recompiled.
Would this be specifically to avoid introducing an extra function call, or is there another reason? Looking at the patch, I'm inclined to suggest making modified vm_radix functions inline functions in vm_radix.h, but maybe some future plans mean that that's not a good idea.
sys/vm/vm_page.h | ||
---|---|---|
233 | This layout will still leave a hole in the case where PAGE_SIZE is 16384 (so sizeof(vm_page_bits_t) is 4), as can be the case on arm64. When PAGE_SIZE is 4096, there will be a hole at the end of the vm_page structure. Given the desire to move order close to pindex, I'd suggest moving the fields from a to oflags instead. |
Bump __FreeBSD_version.
Reshuffle vm_page members as suggested.
Inline all the vm_radix functions that are wrappers for pctrie functions, leaving only a bit of storage management code in vm_radix.c.
Planned changes will make the vm_radix search functions more than just pctrie wrappers; when those changes happen, some of those functions might be moved back into vm_radix.c, but that can be argued about later.
He who complains about this change says that he wants the pctrie code to be 'more like the rbtree code'. I can't guess how much this change will satisfy him, or whether he will ever find the time to express his thoughts here.
sys/vm/vm_radix.h | ||
---|---|---|
37 | Why exactly do we need the vm_page.h include? As far as I can see, we just need the typedef for vm_page_t, which comes from vm.h. (Actually that typedef is also present in sys/types.h for some reason.) | |
69 | I believe these __unused annotations are unneeded in the presence of __inline. Is that false? | |
83 | Extra blank line. |
sys/vm/vm_radix.h | ||
---|---|---|
37 | We need some definition of where pindex lies within the vm_page struct. However, all who include this header seem to have included vm_page.h previously (except vm_radix.c), so I can remove it here if I add it to vm_radix.c (and I have). | |
69 | You're right. I was copying the usage from sys/pctrie.h, but it doesn't seem necessary. So I dropped it. |
I ran tests for 24 hours with D41344.id126971.diff. This on amd64 with a DEBUG kernel.
No problems seen.
Sharing the source code, similar to the RB tree macros, is good; sharing the relatively small amount of machine code is unimportant, and not really a performance optimization. Sharing the subr_pctrie machine code requires either "fixups" in the generated code, for example,
static __inline struct type * \ name##_PCTRIE_VAL2PTR(uint64_t *val) \ { \ \ if (val == NULL) \ return (NULL); \ return (struct type *) \ ((uintptr_t)val - __offsetof(struct type, field)); \ } \
and
static __inline __unused struct type * \ name##_PCTRIE_LOOKUP(struct pctrie *ptree, uint64_t key) \ { \ \ return name##_PCTRIE_VAL2PTR(pctrie_lookup(ptree, key)); \ } \
or rearrangement of the fields to enable the compiler to optimize out the "fixups". The RB tree-style of implementation requires nothing like this.
That said, the radix trie is different from the RB tree in that the interior nodes of the radix trie are identical regardless of the leaf type. So, shared machine code for some helper functions on the interior nodes, e.g., a "find leaf" function, could make sense. But, I argue for any code that needs to access fields of the leaf being generated by the macros so that no "fixups" or reordering of fields in the leaves are required.
For the most part, I am fine with committing this version, but I would prefer to see the reordering of the fields left out for now. First and foremost, it's not clear that the "optimization" that motivates moving the order field is, in fact, an optimization. It will need to tested before that can really be known. Second, I'm hoping that the refactoring of the radix trie code to support that optimization will move the code in the direction that I describe in the previous paragraph, where neither "fixups" nor field reordering are needed.
As @alc requested, drop the changes to the vm_page struct definition and the increment of the version number that accompanied it.