Page MenuHomeFreeBSD

vm_radix: offer insert with le lookup and fast replacement
Needs ReviewPublic

Authored by dougm on Jun 6 2024, 6:24 AM.
Tags
None
Referenced Files
F107902767: D45510.diff
Sun, Jan 19, 5:12 AM
Unknown Object (File)
Thu, Jan 9, 10:59 PM
Unknown Object (File)
Dec 13 2024, 12:11 PM
Unknown Object (File)
Oct 18 2024, 9:51 AM
Unknown Object (File)
Oct 7 2024, 5:18 AM
Unknown Object (File)
Sep 29 2024, 9:17 PM
Unknown Object (File)
Sep 21 2024, 4:48 AM
Unknown Object (File)
Sep 21 2024, 4:24 AM
Subscribers

Details

Reviewers
rlibby
alc
Summary

This is a first draft at half of what @alc suggested in a comment about D45486. It offers a way to insert a dummy page into the radix trie and retrieve mpred and record the address of the pointer to the dummy (vm_radix_insert_lookup_lt_parent). And it offers a function to quickly replace the dummy with a real page (vm_radix_fast_replace).

It is in no way tested (except for compilation). I just want to get it out there to see if it is of interest.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

dougm requested review of this revision.Jun 6 2024, 6:24 AM
dougm created this revision.

So this supports a split insert/lookup, correct? Like the idea is that it lets us do something with the previous entry before we complete the insert?

Wouldn't a dummy insert be a problem for unlocked lookups?

Could this help address batch operations too?

Here's a brain dump of what I had been thinking about a possible iterator, which is not very refined.

This is obviously a rough sketch, these may actually have more descriptive names, be macro generated with types, have different signatures, etc. And not all may be needed.

struct pctrie_iter {
	struct pctrie_node *cur;/* current inner */
	struct pctrie_node *prev;/* last inner, or ancestor, for peek_prev */
	struct pctrie_node *parent;/* current parent, for remove and next */
	uint64_t key;		/* current position */
	/* May need some state bits? */
};

/*
 * Initialize the iterator to an arbitrary position which may exist or not.
 * Very like the work of insert/lookup.
 */
void pctrie_iter_init(struct pctrie *ptree, struct pctrie_iter *pti,
    uint64_t key);

/* Like init, but actually advance to an entry, if any present. */
void pctrie_iter_init_next(struct pctrie *ptree, struct pctrie_iter *pti,
    uint64_t key);

/*
 * Like init, but assume's the iter is valid.  A short seek may be optimized.
 */
void pctrie_iter_seek(struct pctrie *ptree, struct pctrie_iter *pti,
    uint64_t key);

/* The entry at the current position */
uint64_t *pctrie_iter_peek(...);
/* The previous entry, may involve completing a lookup */
uint64_t *pctrie_iter_peek_prev(...);

/*
 * Advance iterator to next extant entry.  Return false if no more entries.  In
 * the iter, key is its key, cur is its node, prev is the previous entry's
 * node.  parent is cur's parent.
 *
 * The ge variant may not advance if there is an entry at the current key.
 *
 * In a dense trie, this would frequently only advance to cur's next child.
 * Tracking one parent would help avoid many re-lookups, and also support
 * remove().  But we would also regularly need to do re-lookup from the top
 * upon exhausting parent's children (though prev could still be valid).
 */
bool pctrie_iter_next(...);
bool pctrie_iter_next_ge(...);

/*
 * Rather than "peek", "peek_prev", and "next" they might be "current", "prev",
 * and "advance"?  Want to avoid confusing "next" meaning "advance" with "next"
 * meaning "peek_next", which I'm not sure would be needed.
 */

/* Insert, remove, or replace entry at current key position. */
int pctrie_iter_insert(...);
int pctrie_iter_remove(...);
int pctrie_iter_replace(...);

Then to do something like insert a run of pages you might do something like this. I'll ignore val2ptr/ptr2val to be brief.

pctrie_iter_init(rtree, &iter, pindex);
mpred = pctrie_iter_peek_prev(&iter);
for (i = 0 ; i < n; i++) {
	mpred = new_m = page_alloc(mpred);
	pctrie_iter_insert(&iter, new_m);
	pctrie_iter_seek(rtree, &iter, pindex + i);
}

A tree walk might look like this

pctrie_iter_init_next(ptree, &iter, 0);
for (;;) {
	x = pctrie_iter_peek(&iter);
	if (x == NULL) /* empty or exhausted tree.  a bit ugly... */
		break;
	foo(x);
	pctrie_iter_next(ptree, &iter);
}

This can obviously be improved, but maybe it shows a direction an iterator could take.

I think @markj may be interested too.

So this supports a split insert/lookup, correct? Like the idea is that it lets us do something with the previous entry before we complete the insert?

Yes.

Wouldn't a dummy insert be a problem for unlocked lookups?

Yes, absolutely. You'd have to call vm_radix_insert_lookup_lt_parent with a lock held, and call vm_radix_fast_replace before releasing the lock.

Could this help address batch operations too?

I didn't intend to address that problem here.

struct pctrie_iter {
struct pctrie_node *cur;/* current inner */
struct pctrie_node *prev;/* last inner, or ancestor, for peek_prev */
struct pctrie_node *parent;/* current parent, for remove and next */
uint64_t key; /* current position */
/* May need some state bits? */
};

It's not clear to me how the iterator could take two steps up the tree. You can use parent to get one step up, but how do you find the parent's parent? Only if every node stores it parent, I think, which would be a big change.

I would expect an iterator to consist of an array of maxdepth pointers and a parent index and a leaf index. Initialization means setting the first array element to root and the counter to 1. Searching updates the path, and insert/remove operations can happen in constant time at the bottom of the path, given the leaf index, and iteration happens by incrementing the leaf index to the next non-null child or, if there is none, backing up the path as far as necessary to take a step further right.

Update for recent overlapping commits.

struct pctrie_iter {
struct pctrie_node *cur;/* current inner */
struct pctrie_node *prev;/* last inner, or ancestor, for peek_prev */
struct pctrie_node *parent;/* current parent, for remove and next */
uint64_t key; /* current position */
/* May need some state bits? */
};

It's not clear to me how the iterator could take two steps up the tree. You can use parent to get one step up, but how do you find the parent's parent? Only if every node stores it parent, I think, which would be a big change.

Right, it wouldn't. You would just detect that you were at an edge and do a new lookup. The idea would be that it's amortized enough that occasional re-lookup doesn't hurt so much, and meanwhile short iteration remains cheaper. Would most iteration be more than 2 levels at once anyway?

I would expect an iterator to consist of an array of maxdepth pointers and a parent index and a leaf index. Initialization means setting the first array element to root and the counter to 1. Searching updates the path, and insert/remove operations can happen in constant time at the bottom of the path, given the leaf index, and iteration happens by incrementing the leaf index to the next non-null child or, if there is none, backing up the path as far as necessary to take a step further right.

Yes I think that would work, and probably has simpler logic. IIUC the maxdepth is about 16 or 21 levels (depending on arch, and I may be off by 1), so 128 or 84 bytes for the array. I guess that's not so bad.

I don't think it would necessarily change the API much.

I would expect an iterator to consist of an array of maxdepth pointers and a parent index and a leaf index. Initialization means setting the first array element to root and the counter to 1. Searching updates the path, and insert/remove operations can happen in constant time at the bottom of the path, given the leaf index, and iteration happens by incrementing the leaf index to the next non-null child or, if there is none, backing up the path as far as necessary to take a step further right.

Yes I think that would work, and probably has simpler logic. IIUC the maxdepth is about 16 or 21 levels (depending on arch, and I may be off by 1), so 128 or 84 bytes for the array. I guess that's not so bad.

Thinking more, I agree. Two pointers really only saves stack space, which is not so precious. A full depth stack of pointers is simpler and faster, and anyway reasonably bounded.

I would expect an iterator to consist of an array of maxdepth pointers and a parent index and a leaf index. Initialization means setting the first array element to root and the counter to 1. Searching updates the path, and insert/remove operations can happen in constant time at the bottom of the path, given the leaf index, and iteration happens by incrementing the leaf index to the next non-null child or, if there is none, backing up the path as far as necessary to take a step further right.

Yes I think that would work, and probably has simpler logic. IIUC the maxdepth is about 16 or 21 levels (depending on arch, and I may be off by 1), so 128 or 84 bytes for the array. I guess that's not so bad.

Thinking more, I agree. Two pointers really only saves stack space, which is not so precious. A full depth stack of pointers is simpler and faster, and anyway reasonably bounded.

And nesting iterators will probably be rare.

Here's a brain dump of what I had been thinking about a possible iterator, which is not very refined.

This is obviously a rough sketch, these may actually have more descriptive names, be macro generated with types, have different signatures, etc. And not all may be needed.

/*
 * Advance iterator to next extant entry.  Return false if no more entries.  In
 * the iter, key is its key, cur is its node, prev is the previous entry's
 * node.  parent is cur's parent.
 *
 * The ge variant may not advance if there is an entry at the current key.
 *
 * In a dense trie, this would frequently only advance to cur's next child.
 * Tracking one parent would help avoid many re-lookups, and also support
 * remove().  But we would also regularly need to do re-lookup from the top
 * upon exhausting parent's children (though prev could still be valid).
 */
bool pctrie_iter_next(...);
bool pctrie_iter_next_ge(...);

We might want some way to express that iteration should terminate at some index. There are lots of places in the VM code where we want something like that (or at least, that was my impression a while ago). The consumer can always check that itself, of course, but it might be possible to accomplish that more efficiently in the iterator layer?

A tree walk might look like this

pctrie_iter_init_next(ptree, &iter, 0);
for (;;) {
	x = pctrie_iter_peek(&iter);
	if (x == NULL) /* empty or exhausted tree.  a bit ugly... */
		break;
	foo(x);
	pctrie_iter_next(ptree, &iter);
}

When I made an attempt at this, I tried to come up with a set of VM_OBJECT_*FOREACH macros which covered the existing uses of the memq, but it was difficult to express everything I wanted. I think it's easy to write a macro for the example above though?