Page MenuHomeFreeBSD

Fix getblk() with GB_NOCREAT returning false-negatives.
ClosedPublic

Authored by bdrewery on Jan 27 2021, 6:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 14, 1:49 AM
Unknown Object (File)
Dec 14 2024, 7:55 AM
Unknown Object (File)
Nov 18 2024, 12:58 AM
Unknown Object (File)
Nov 6 2024, 10:20 PM
Unknown Object (File)
Sep 29 2024, 8:03 AM
Unknown Object (File)
Sep 29 2024, 4:56 AM
Unknown Object (File)
Sep 28 2024, 7:49 PM
Unknown Object (File)
Sep 28 2024, 9:21 AM

Details

Summary

It is possible for a buf to be reassigned between the dirty and clean
lists while gbincore_unlocked() looks in each list. Avoid creating
a buffer in that case and fallback to a locked lookup.

This fixes a regression from r363482.

Reported by: Suraj.Raju at dell.com
Submitted by: Suraj.Raju at dell.com, cem, [based on both]
MFC after: 2 weeks
Sponsored by: Dell EMC

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This revision is now accepted and ready to land.Jan 27 2021, 6:22 PM

This approach seems safe but obviously makes the lookup twice as slow when the buffer doesn't exist. I'm not sure we could do better easily...

Do we also need to consider reverting r363483? Or are we sure that incore() is okay or doesn't matter for some reason?
https://cgit.freebsd.org/src/commit/?id=81dc6c2c61a70c21b61ca296fb64dfd7b215a77e


...musing about whether we could do better, feel free to ignore:

I think we could cook up an atomic scheme to count reassigns started and finished, but I doubt it is worth the complexity at this point and it might just punish reassignbuf() instead. Something like

finished = atomic_load_acq(bo_reassign_finished)
/* ... Do lookup in both tries ... */
if (found)
    return bp
atomic_thread_fence_acq();
started = atomic_load(bo_reassign_started)
if (started == finished)
	return ENOENT /* No reassign was in progress */
else
	return EAGAIN /* Need the bo lock */

And corresponding code in reassignbuf():

atomic_add(&bo_reassign_started, 1);
atomic_thread_fence_rel();
/* ... Do reassign ... */
atomic_add_rel(&bo_reassign_finished, 1);

I think incore was already always racy? The race scenarios may have changed slightly.

I think I mentioned this talking to Bryan but I’m not sure we actually need separate clean and dirty buf tries— we might be better off with a single trie, and two linked lists for enumeration. That’s a big refactor and there may be reasons it isn’t feasible.

In D28375#634847, @cem wrote:

I think I mentioned this talking to Bryan but I’m not sure we actually need separate clean and dirty buf tries— we might be better off with a single trie, and two linked lists for enumeration. That’s a big refactor and there may be reasons it isn’t feasible.

Yeah, we need the clean/dirty tailq linkage, but if we could get away with just one pctrie that would be great. That would also get rid of the free/alloc entirely during reassignbuf(). From grepping, only vfs_subr.c actually accesses bv_root, so that's encouraging, and struct bufv doesn't look too bad either. It's looking feasible to me...

In D28375#634847, @cem wrote:

I think I mentioned this talking to Bryan but I’m not sure we actually need separate clean and dirty buf tries— we might be better off with a single trie, and two linked lists for enumeration. That’s a big refactor and there may be reasons it isn’t feasible.

Yeah, we need the clean/dirty tailq linkage, but if we could get away with just one pctrie that would be great. That would also get rid of the free/alloc entirely during reassignbuf(). From grepping, only vfs_subr.c actually accesses bv_root, so that's encouraging, and struct bufv doesn't look too bad either. It's looking feasible to me...

I think we use the tries to enable fast insertion into the sorted lists?

One possibility to unify the two tries might be to augment each node with a bitmap, with one bit per slot, and maintain the invariant that a bit is set iff the subtree rooted at that slot contains at least one dirty buf, or the slot points directly to a dirty buf. Then we can still quickly look up the dirty buf with smallest LBN that's larger than a given LBN. That's still not ideal for implementing lookups of clean bufs since all bufs in a subtree may be dirty. We could use a second bitmap to handle that case or maybe there's some more clever way.

Do we also need to consider reverting r363483? Or are we sure that incore() is okay or doesn't matter for some reason?
https://cgit.freebsd.org/src/commit/?id=81dc6c2c61a70c21b61ca296fb64dfd7b215a77e

Hmm. It's not really obvious to me that all callers handle false negatives correctly. Some callers simply call getblk() in that case, but others like ufs_bmaparray() are more complicated.

In D28375#634847, @cem wrote:

I think I mentioned this talking to Bryan but I’m not sure we actually need separate clean and dirty buf tries— we might be better off with a single trie, and two linked lists for enumeration. That’s a big refactor and there may be reasons it isn’t feasible.

Yeah, we need the clean/dirty tailq linkage, but if we could get away with just one pctrie that would be great. That would also get rid of the free/alloc entirely during reassignbuf(). From grepping, only vfs_subr.c actually accesses bv_root, so that's encouraging, and struct bufv doesn't look too bad either. It's looking feasible to me...

I think we use the tries to enable fast insertion into the sorted lists?

Ah, yeah, I see. (And we actually use the order?)

One possibility to unify the two tries might be to augment each node with a bitmap, with one bit per slot, and maintain the invariant that a bit is set iff the subtree rooted at that slot contains at least one dirty buf, or the slot points directly to a dirty buf. Then we can still quickly look up the dirty buf with smallest LBN that's larger than a given LBN. That's still not ideal for implementing lookups of clean bufs since all bufs in a subtree may be dirty. We could use a second bitmap to handle that case or maybe there's some more clever way.

Right, if you did it with two bits or two bitmaps, you can track each of dirty and clean per subtree that way, but either way you still have to do the log(n) scan.

I guess an alternate solution could be to convert those tailqs to rb trees + separate tail tracking. That wouldn't take any more memory or hurt the big-O time complexity, but I imagine in practice the rb insert would be more expensive than the pctrie lookup + tailq insert for a large bufobj.

In D28375#634847, @cem wrote:

I think I mentioned this talking to Bryan but I’m not sure we actually need separate clean and dirty buf tries— we might be better off with a single trie, and two linked lists for enumeration. That’s a big refactor and there may be reasons it isn’t feasible.

Yeah, we need the clean/dirty tailq linkage, but if we could get away with just one pctrie that would be great. That would also get rid of the free/alloc entirely during reassignbuf(). From grepping, only vfs_subr.c actually accesses bv_root, so that's encouraging, and struct bufv doesn't look too bad either. It's looking feasible to me...

I think we use the tries to enable fast insertion into the sorted lists?

Ah, yeah, I see. (And we actually use the order?)

There are some operations on ranges of bufs which rely on this. v_inval_buf_range_locked() for instance. They can drop the bufobj lock between iterations so it doesn't seem very easy to maintain a cursor for the pctrie.

One possibility to unify the two tries might be to augment each node with a bitmap, with one bit per slot, and maintain the invariant that a bit is set iff the subtree rooted at that slot contains at least one dirty buf, or the slot points directly to a dirty buf. Then we can still quickly look up the dirty buf with smallest LBN that's larger than a given LBN. That's still not ideal for implementing lookups of clean bufs since all bufs in a subtree may be dirty. We could use a second bitmap to handle that case or maybe there's some more clever way.

Right, if you did it with two bits or two bitmaps, you can track each of dirty and clean per subtree that way, but either way you still have to do the log(n) scan.

Which scan do you mean exactly?

SMR might also be a complication for an augmented trie, you'd want to order updates so that unlocked lookups don't get false negatives.

I guess an alternate solution could be to convert those tailqs to rb trees + separate tail tracking. That wouldn't take any more memory or hurt the big-O time complexity, but I imagine in practice the rb insert would be more expensive than the pctrie lookup + tailq insert for a large bufobj.

The RB_ENTRY linkage structure is also three pointers rather than two, so there's some extra memory overhead if one were to go with the tree.h implementation. That said, reassignbuf() is the reason we pre-allocate buf trie nodes (see the uma_prealloc() call in vntblinit()) and make the zone NOFREE, so taking an extra pointer per buf is not the end of the world IMO.

For insertion, the extra overhead might be tolerable but indeed I'd worry about large devvps.

...

One possibility to unify the two tries might be to augment each node with a bitmap, with one bit per slot, and maintain the invariant that a bit is set iff the subtree rooted at that slot contains at least one dirty buf, or the slot points directly to a dirty buf. Then we can still quickly look up the dirty buf with smallest LBN that's larger than a given LBN. That's still not ideal for implementing lookups of clean bufs since all bufs in a subtree may be dirty. We could use a second bitmap to handle that case or maybe there's some more clever way.

Right, if you did it with two bits or two bitmaps, you can track each of dirty and clean per subtree that way, but either way you still have to do the log(n) scan.

Which scan do you mean exactly?

If I understand the pctrie bitmap proposal correctly:

Any walk of just the dirty or clean list becomes O(log(n + m) + n) where n is the number of entries in the list of interest and n + m is the sum of the number of entries from both lists, or in other words, finding the next element becomes O(log(n + m)) instead of O(1). The situation where you'd notice this would be when m >> n. In code, this is git grep "TAILQ_FOREACH.*bo_\(dirty\|clean\)".

...

One possibility to unify the two tries might be to augment each node with a bitmap, with one bit per slot, and maintain the invariant that a bit is set iff the subtree rooted at that slot contains at least one dirty buf, or the slot points directly to a dirty buf. Then we can still quickly look up the dirty buf with smallest LBN that's larger than a given LBN. That's still not ideal for implementing lookups of clean bufs since all bufs in a subtree may be dirty. We could use a second bitmap to handle that case or maybe there's some more clever way.

Right, if you did it with two bits or two bitmaps, you can track each of dirty and clean per subtree that way, but either way you still have to do the log(n) scan.

Which scan do you mean exactly?

If I understand the pctrie bitmap proposal correctly:

Any walk of just the dirty or clean list becomes O(log(n + m) + n) where n is the number of entries in the list of interest and n + m is the sum of the number of entries from both lists, or in other words, finding the next element becomes O(log(n + m)) instead of O(1). The situation where you'd notice this would be when m >> n. In code, this is git grep "TAILQ_FOREACH.*bo_\(dirty\|clean\)".

Oh, I was envisioning keeping the separate lists and just merging the pctries. The bits let you quickly find the next dirty or clean buf after a given LBN, so buf_vlist_add() would stay more or less the same, and you'd keep the lists for iterating over successive bufs.

In D28375#634847, @cem wrote:

I think I mentioned this talking to Bryan but I’m not sure we actually need separate clean and dirty buf tries— we might be better off with a single trie, and two linked lists for enumeration. That’s a big refactor and there may be reasons it isn’t feasible.

Yeah, we need the clean/dirty tailq linkage, but if we could get away with just one pctrie that would be great. That would also get rid of the free/alloc entirely during reassignbuf(). From grepping, only vfs_subr.c actually accesses bv_root, so that's encouraging, and struct bufv doesn't look too bad either. It's looking feasible to me...

I think we use the tries to enable fast insertion into the sorted lists?

Ah, yeah, I see. (And we actually use the order?)

There are some operations on ranges of bufs which rely on this. v_inval_buf_range_locked() for instance. They can drop the bufobj lock between iterations so it doesn't seem very easy to maintain a cursor for the pctrie.

Right, okay. Although, amusingly, v_inval_buf_range_locked in particular must not be very sensitive to this because for some reason it doesn't do a lookup and then a TAILQ_FOREACH_FROM_SAFE, and we're surviving. We should fix that.

...

I guess an alternate solution could be to convert those tailqs to rb trees + separate tail tracking. That wouldn't take any more memory or hurt the big-O time complexity, but I imagine in practice the rb insert would be more expensive than the pctrie lookup + tailq insert for a large bufobj.

The RB_ENTRY linkage structure is also three pointers rather than two, so there's some extra memory overhead if one were to go with the tree.h implementation. That said, reassignbuf() is the reason we pre-allocate buf trie nodes (see the uma_prealloc() call in vntblinit()) and make the zone NOFREE, so taking an extra pointer per buf is not the end of the world IMO.

Yeah, I forgot the parent pointer.

...

One possibility to unify the two tries might be to augment each node with a bitmap, with one bit per slot, and maintain the invariant that a bit is set iff the subtree rooted at that slot contains at least one dirty buf, or the slot points directly to a dirty buf. Then we can still quickly look up the dirty buf with smallest LBN that's larger than a given LBN. That's still not ideal for implementing lookups of clean bufs since all bufs in a subtree may be dirty. We could use a second bitmap to handle that case or maybe there's some more clever way.

Right, if you did it with two bits or two bitmaps, you can track each of dirty and clean per subtree that way, but either way you still have to do the log(n) scan.

Which scan do you mean exactly?

If I understand the pctrie bitmap proposal correctly:

Any walk of just the dirty or clean list becomes O(log(n + m) + n) where n is the number of entries in the list of interest and n + m is the sum of the number of entries from both lists, or in other words, finding the next element becomes O(log(n + m)) instead of O(1). The situation where you'd notice this would be when m >> n. In code, this is git grep "TAILQ_FOREACH.*bo_\(dirty\|clean\)".

Oh, I was envisioning keeping the separate lists and just merging the pctries. The bits let you quickly find the next dirty or clean buf after a given LBN, so buf_vlist_add() would stay more or less the same, and you'd keep the lists for iterating over successive bufs.

Ah, I'm with you now. Yes, that way finding the next clean/dirty buf would be no more expensive than a pctrie lookup.

SMR might also be a complication for an augmented trie, you'd want to order updates so that unlocked lookups don't get false negatives.

Not sure I follow this though. Why would unlocked lookups even look at the bitmap? If you wanted to do a lookup in clean or dirty, you can just do a regular lookup, then check if the buf is clean or dirty, can't you? We don't currently have any unlocked lookup of just clean or just dirty anyway, do we?

Hmm. It's not really obvious to me that all callers handle false negatives correctly. Some callers simply call getblk() in that case, but others like ufs_bmaparray() are more complicated.

incore has always been able to produce false negatives, FWIW. It's fundamentally racy. I think it's mostly used as a heuristic for avoiding unnecessary prefetch.

SMR might also be a complication for an augmented trie, you'd want to order updates so that unlocked lookups don't get false negatives.

Not sure I follow this though. Why would unlocked lookups even look at the bitmap? If you wanted to do a lookup in clean or dirty, you can just do a regular lookup, then check if the buf is clean or dirty, can't you? We don't currently have any unlocked lookup of just clean or just dirty anyway, do we?

You're right, this is a non-issue.

In D28375#634909, @cem wrote:

Hmm. It's not really obvious to me that all callers handle false negatives correctly. Some callers simply call getblk() in that case, but others like ufs_bmaparray() are more complicated.

incore has always been able to produce false negatives, FWIW. It's fundamentally racy. I think it's mostly used as a heuristic for avoiding unnecessary prefetch.

That's true, but it's possible for callers to depend on the instantaneous result being correct because some external synchronization guarantees that the result is stable. For instance, some lock might serialize calls to getblk() for a particular vnode, so holding that lock ensures that incore() returns a stable result. I'm worried about false negatives caused by reassignbuf() in cases like this.

To be clear, everyone is ok with this being committed for now and improvements being worked on and tested for later?

To be clear, everyone is ok with this being committed for now and improvements being worked on and tested for later?

Yes, we should definitely commit this fix for 13.0.

In D28375#634909, @cem wrote:

incore has always been able to produce false negatives, FWIW. It's fundamentally racy. I think it's mostly used as a heuristic for avoiding unnecessary prefetch.

That's true, but it's possible for callers to depend on the instantaneous result being correct because some external synchronization guarantees that the result is stable. For instance, some lock might serialize calls to getblk() for a particular vnode, so holding that lock ensures that incore() returns a stable result. I'm worried about false negatives caused by reassignbuf() in cases like this.

Sure.

To be clear, everyone is ok with this being committed for now and improvements being worked on and tested for later?

Yes.

To be clear, everyone is ok with this being committed for now and improvements being worked on and tested for later?

Yes.