Details

Reviewers

mjg
markj
olce

Commits

rGd8a16b6a4c44: kern_rangelock.c: add Foundation copyright, remove sys/cdefs.h include
rG9ef425e560a9: rangelocks: add fast cheating mode
rGff1ae3b3639d: rangelocks: restore caching of the single rl entry in the struct thread
rGc31580080e05: Microoptimize rangelock_unlock_int()
rG5badbeeaf061: Re-implement rangelocks part 2
rG6c32d89ecabe: kern_rangelock.c: add rl_e_is_rlock()
rGc3d8a93126b9: Re-implement rangelocks part 1

Summary

Using the algorithms from https://doi.org/10.1145/3342195.3387533
As a perf optimization, until a conflict is detected, a trivial sleepable sx-like lock is used (no full-fledged sx to avoid blowing vnode size).

I did not yet measured the scalability systematically, but for cheat mode the line is almost same as using the vnode interlock, and real rangelock mode seems to scale instead of providing the flat line.

Tested by: pho

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Another batch of Mark notes.

Did you test this on arm64 or any other weakly ordered arch?

Am I right that if a thread successively locks small ranges, starting from the end of the file and proceeding to the beginning, we will not clean up marked entries on the rangelock queue?

I wonder if some periodic mechanism, e.g., VFS_SYNC, should try to clean freed entries from vnode rangelocks.

sys/kern/kern_rangelock.c
133	If there is no other synchronization for the corresponding memory accesses, then I expect the barriers really are needed.
384	`trylock` has the meaning "don't attempt to acquire sleepqueue spinlocks," but why? It should be harmless to do so. Consider that if both a reader and a write simultaneously trylock a range, with the current implementation it is possible for both of them to fail.
389	I don't quite understand this path: we already marked the newly inserted entry, how is it guaranteed that we will not return RL_LOCK_SUCCESS later?
391	These three lines could be replaced by `cur = rl_e_unmark(rl_q_load(prev));`. There is another instance below.
417
422	This condition is always false on the first iteration, right?
434	Why do you check whether `cur` is marked here? This is already checked once in the `cur != NULL` case above. Does it actually help to repeat the check before locking and checking again for a third time?

In D41787#959334, @markj wrote:

Did you test this on arm64 or any other weakly ordered arch?

Am I right that if a thread successively locks small ranges, starting from the end of the file and proceeding to the beginning, we will not clean up marked entries on the rangelock queue?

I wonder if some periodic mechanism, e.g., VFS_SYNC, should try to clean freed entries from vnode rangelocks.

If all of them are writes then no, indeed, since rl_w_validate() goes from head to inserted lock.

I think less drastic measure can be taken, e.g. rl_w_validate() could continue the loop after cur == e is true, by only doing cleanup, and only sometimes. But I did not tried this, lets finish with the current review first.

sys/kern/kern_rangelock.c
133	Rangelocks do not sync access to memory. They sync file data manipulations, and there are enough synchronization to ensure the visibility of file data cached in the memory.
384	I think the weaker semantic is fine there, but I fixed the issue you pointed out.
389	Yes, I need to unlock e only when committed to return error.
391	rl_e_unmark() asserts that the entry is marked. I added _unchecked() version, ok.
422	You mean, that the head (not the first element in the list) is never marked. Yes, this is true.

Another batch of changes.

olce added a subscriber: olce.Oct 3 2023, 7:49 PM

olce added inline comments.

sys/kern/kern_rangelock.c
71–73	I think I see why you put this one after `RL_CHEAT_CHEATING`, although this doesn't match the numerical order. I would add a comment about `RL_CHEAT_CHEATING`, something like "Tag indicating that 'head' is not a pointer but contains flags." And probably one also for `RL_CHEAT_MASK`, something like: "Mask of bit flags in 'head' when 'RL_CHEAT_CHEATING' is set."

I think I'm ok with the diff now. My main complaint is that there is no bound on the number of garbage queue entries, see below.

In D41787#959376, @kib wrote:

In D41787#959334, @markj wrote:

Am I right that if a thread successively locks small ranges, starting from the end of the file and proceeding to the beginning, we will not clean up marked entries on the rangelock queue?

I wonder if some periodic mechanism, e.g., VFS_SYNC, should try to clean freed entries from vnode rangelocks.

If all of them are writes then no, indeed, since rl_w_validate() goes from head to inserted lock.

I think less drastic measure can be taken, e.g. rl_w_validate() could continue the loop after cur == e is true, by only doing cleanup, and only sometimes. But I did not tried this, lets finish with the current review first.

Hmm, why can't rangelock_unlock() clean up its own entry? Because it doesn't know which entry belongs to it (rl_q_owner is INVARIANTS-only)? Or is it because the design is trying to improve scalability by reducing the number of threads that might be performing queue operations? Or something else?

sys/kern/kern_rangelock.c
133	What mechanism provides this synchronization for, e.g., shm objects?
345	Assert that all entries are marked?
400	Can the broadcast be conditional on `lock->sleepers`? If `lock->sleepers` is false, then we scan the sleepqueue hash chain needlessly, I think.
568	This test can move into the `if (cur != NULL)` block.

markj added inline comments.Oct 4 2023, 2:16 PM

sys/kern/kern_rangelock.c
345	Never mind, I forgot (again) that rl_e_unmark() asserts that the entry is marked.

markj added inline comments.Oct 4 2023, 2:17 PM

sys/kern/kern_rangelock.c
566	You might as well load `cur` earlier, in the check for an empty queue.

I may be missing something obvious, but I don't quite understand the need for rl_q_free. Isn't the purpose of UMA's SMR zones exactly to handle deferred freeing? In other words, why isn't it possible to replace pieces of code inserting unlocked entries into rl_q_free with straight calls to uma_zfree_smr()? Is is because rl_insert(), and thus these pieces, are called in a SMR section?

It is too bad that the list has to be traversed from the start in rl_w_validate(). Using a doubly-linked list would allow to start from e, but has several drawbacks as well (concurrency will be harder, more memory consumption). Does anyone have an idea of how many rangelocks there typically are on a single vnode on average and at worst?

sys/kern/kern_rangelock.c
105	Why use `PRI_USER` instead of 0 here? In other words, why boosting awaken user threads?
133	I don't think there is any need of acquire/release semantics for cheat locks, since AFAIU their whole state is contained in the `head` field (and also in the sleepqueue when draining, but this part is protected by the sleepqueue chain lock), so there is no synchronization needs beyond that already provided by atomic reads and fcmpset operations and mutex lock/unlock.
198–210	(Micro-)optimization: Acquiring the sleepqueue (chain) lock could be performed only when there are waiters to wakeup. This amounts to pushing down `sleepq_lock()` into the `if (x == 0)`'s then block and add there a pairing `sleepq_release()` call, and removing all the other `sleepq_release()` calls.
293–295	I don't understand how the use of the `RL_CHEAT_*` flags mandates any non-natural alignment. Could you please explain why?
471–472	How can `rl_q_next` on a newly created entry that has just been put on the queue be marked? I don't think this is possible.
473–474	Same as above at start of `rl_r_validate()`, but for a different reason (head of list can't be marked).
563	I doubt there is a benefit in doing the raw comparison before the CAS.

In D41787#959571, @markj wrote:

I think I'm ok with the diff now. My main complaint is that there is no bound on the number of garbage queue entries, see below.

Hmm, why can't rangelock_unlock() clean up its own entry? Because it doesn't know which entry belongs to it (rl_q_owner is INVARIANTS-only)? Or is it because the design is trying to improve scalability by reducing the number of threads that might be performing queue operations? Or something else?

To clean entry, we need to have a stable pointer to the previous entry in the lock list. We do know which entry we clear, but to find previous one to CAS-remove unlocked entry, we need to walk over the list from the start.

In D41787#959693, @olce.freebsd_certner.fr wrote:

I may be missing something obvious, but I don't quite understand the need for rl_q_free. Isn't the purpose of UMA's SMR zones exactly to handle deferred freeing? In other words, why isn't it possible to replace pieces of code inserting unlocked entries into rl_q_free with straight calls to uma_zfree_smr()? Is is because rl_insert(), and thus these pieces, are called in a SMR section?

Because the lock list is traversed unlocked. You cannot call uma_zfree_smr() from smr section anyway.

It is too bad that the list has to be traversed from the start in rl_w_validate(). Using a doubly-linked list would allow to start from e, but has several drawbacks as well (concurrency will be harder, more memory consumption). Does anyone have an idea of how many rangelocks there typically are on a single vnode on average and at worst?

How would you implement lock-free double-linked list? There are some designs, but they are so hard that it is not practical.

There is exactly one rangelock on vnode for all its lifetime. Number of locks taken in parallel from is completely determined by userspace. I think it is practically huge for consumers like postgres/any other database/varnish, but typically at most one for 'trivial' consumers like usual unix text processing tools.

sys/kern/kern_rangelock.c
133	Page busy, same as of pgread() on tmpfs/ufs.
198–210	If we are draining there are waiters.
293–295	On 32bit arches, pointers are aligned to the word (4 bytes) only. Cheat locks require 3 lower bits to be useful. In fact the RL_CHEAT_CHEATING was strategically placed in the lowest bit that cannot be non-zero for real pointer, but still.
563	It avoids atomic for the case of non-empty list. Also it pre-populates cache line.

kib marked 4 inline comments as done.Oct 4 2023, 6:18 PM

kib added inline comments.

sys/kern/kern_rangelock.c
133	Also note that for non-cheating lock, the acquire and release atomics for lock and unlock occurs on different memory locations right now. For acq, it is next pointer for the previous lock entry, and for rel it is next pointer of the current lock entry. So this effectively depends on our acq/rel implementations being seq cst (which they are at least on x86 and arm, on arm due to store release total ordering).

kib marked 7 inline comments as done.Oct 4 2023, 7:10 PM

kib added inline comments.

sys/kern/kern_rangelock.c
71–73	Isn't it all explained in the herald comment?
105	Because the thread already got a penalty by not getting the access to the resource immediately. So when it given CPU access after sleep, it is fair to increase it priority both to compensate for the competition, and to facilitate use of the resource to free it quicker. The later is achieved by not competing in scheduling with threads that not yet experienced block.

Do not forget kick_proc0().
Only wakeup if there are waiters.
Minor review tweaks.

markj added inline comments.Oct 4 2023, 7:23 PM

sys/kern/kern_rangelock.c
481	What happens if two threads attempt to unmark the same node? Presumably one of them will trip the assertion in rl_e_unmark().
531	Same here, I think two threads might both try to unmark the same node.

kib marked 2 inline comments as done.Oct 4 2023, 7:26 PM

kib added inline comments.

sys/kern/kern_rangelock.c
481	Isn't next the local var? If other thread unmarked the memory (or really, removed the entry from the list), cas below fails and we retry the iteration.

markj accepted this revision.Oct 4 2023, 7:30 PM

markj added inline comments.

sys/kern/kern_rangelock.c
481	Oops, yes.

This revision is now accepted and ready to land.Oct 4 2023, 7:30 PM

In D41787#959772, @kib wrote:

Because the lock list is traversed unlocked. You cannot call uma_zfree_smr() from smr section anyway.

Then that sounds like a useful new feature to add to uma_zfree_smr(), if technically possible. Something for later.

How would you implement lock-free double-linked list? There are some designs, but they are so hard that it is not practical.

Yes, that would be my fear. But browsing the list from the start can easily be wasteful with lots of ranges. I'll think about it.

Number of locks taken in parallel from is completely determined by userspace. I think it is practically huge for consumers like postgres/any other database/varnish, but typically at most one for 'trivial' consumers like usual unix text processing tools.

Yes, this is what my question was really about, thanks. If there are lots of ranges, the previous point could become a problem.

sys/kern/kern_rangelock.c
47–50	I think the text is slightly misleading (the last part describes the "cheating" mode, not the other one). Here are suggestions to amend it. (You'll have to refill the paragraph.)
51–53	Suggestions to improve clarity.
71–73	Indeed. I just missed it initially. Error between chair and keyboard.
105	This is already handled by ULE, although probably not to the same extent, and generally should probably be left for the scheduler to figure out on its own. But I understand this is still existing practice, and changing that is out-of-scope for this review.
198–210	Sorry, the first sentence was wrong in expressing my original idea. There are waiters indeed, but we are not going to signal them unless we are the last waiter. So I think the rest of the point is valid: there is no need to take the lock if we are not the last waiter (i.e., we are not going to signal), and the fcmpset doesn't need a lock.
293–295	Cheat locks require 3 lower bits to be useful. They need several more, depending on the number of readers (logarithmically). In fact the RL_CHEAT_CHEATING was strategically placed in the lowest bit that cannot be non-zero for real pointer, but still. Exactly. This requires only 2-byte alignment. And since `RL_CHEAT_CHEATING` gates the use of all other flags, 4 or 8 bytes alignment doesn't matter. Could you please simply use `UMA_ALIGNOF()` instead (or whatever is the proper idiom for natural alignment)?
563	I thought the "compare and compare and swap" idiom would have been long gone by 2023, with CPUs performing the relevant optimizations, but maybe not after all.

kib marked 9 inline comments as done.Oct 4 2023, 11:15 PM

kib added inline comments.

sys/kern/kern_rangelock.c
105	Specifically this is not handled by a scheduler, and I do not see how could it be handled, unless we pre-assign a priority to each wait channel.

When unlocking read-locked cheat and draining, only lock sleepqueue on the last wakeup.
Editorial changes in comments.

This revision now requires review to proceed.Oct 4 2023, 11:18 PM

Thanks!

sys/kern/kern_rangelock.c
105	It is "handled", indirectly, in the sense that a (user) thread waiting for a resource will have its sleep time increase, which will then influence the final priority ULE assigns to it at wakeup (normally, increase it). This takes care of the first point ("compensate for the competition") in a different way, imperfectly for sure, but still in a much better manner than explicitly setting an absolute priority, which is also a globally unfair action. That an action is needed for the second point rather reflects that we are missing a mechanism to ensure fairness for received rangelock requests across draining. I don't think it necessarily needs to be added now (or perhaps at all), time and tests will tell.

This revision is now accepted and ready to land.Oct 5 2023, 7:21 AM

markj added inline comments.Oct 5 2023, 1:54 PM

sys/kern/kern_rangelock.c
105	Note that this priority parameter is effectively ignored by ULE by default. See the handling of `static_boost` in sched_sleep().

kib marked an inline comment as done.Oct 5 2023, 11:56 PM

kib added inline comments.

sys/kern/kern_rangelock.c
105	I do not see it ignored. static_boost is PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE, and if the current thread priority is larger than then then boost is applied. So this is strange in fact, because interactive threads are put after non-interactive (as classified by ULE) after off cpu events, and all PVM/PVFS... are higher comparing with timeshare.

markj added inline comments.Oct 6 2023, 12:00 AM

sys/kern/kern_rangelock.c
105	But note that sched_sleep() does `sched_prio(td, static_boost)` in this case, not `sched_prio(td, prio)`. So this merely ensures that an awoken timesharing thread will get some preference over other timesharing threads.

olce added inline comments.Oct 6 2023, 7:45 AM

sys/kern/kern_rangelock.c
105	Ah indeed. I had already analyzed it a while ago but didn't connect the dots. Thanks. A sleeping thread's priority during sleep is normally used when a sleepqueue chooses a thread to wakeup (unless `SLEEPQ_UNFAIR` is passed) via `sleepq_signal()`. Here, `sleep_broadcast()` is used and will wake threads in the order in which they were put to sleep, regardless of their priorities. Then, it's up to the scheduler to choose CPUs and possibly preempt existing threads there (and possibly preempt threads it has just started as it wakes up more sleeping ones, if they are numerous). Low priority threads may be started before high priority ones, but probably not by a large margin. This is probably fair enough (efficiency could be improved). This gives me the opportunity to correct part of what I said above: That an action is needed for the second point rather reflects that we are missing a mechanism to ensure fairness for received rangelock requests across draining. In fact, ensuring fairness is not only a concern across draining, but in general to serve range requests. And there doesn't seem to be an obvious problem here, given what's above. The only "problem" is if some resource (range lock here) is held for too long by a low priority thread, but there, given we are talking about long-term sleep ("sleepable" in FreeBSD parlance), I don't think borrowing priorities is a good answer. The only ones that make sense that I can see for now would be to penalize such a thread at the next resource acquisition, and/or, more drastically, to forcibly break the lock and send a signal to the offending thread (with default action, e.g., to kill the process).