Paths

Table of Contentst

runq/sched: Switch to 256 distinct levels
Needs ReviewPublic
Actions

Authored by olce on May 27 2024, 10:28 PM.

Details

Reviewers

jhb
markj
mav
jeff

Group Reviewers

cam

Summary

Please see overview of project at D45393.

Internal scheduling priorities: Always use symbolic ones without any hardcoded shift, to be agnostic to a change in semantics for a difference of priorities. Differences of less than 4 (RQ_PPQ) are insignificant and are simply removed. No functional change (intended).
zfs: spa: ZIO_TASKQ_ISSUE: Remove hardcoded shift, create a separate macro for its priority.
Finally switch to 256 distinct levels, by setting RQ_PPQ to 1 (builds upon previous changes).
Bump __FreeBSD_version.
Remove userland references to RQ_PPQ.
Restrict <sys/runq.h> to kernel usage.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 59347
Build 56234: arc lint + arc unit

Event Timeline

olce created this revision.May 27 2024, 10:28 PM

Herald added a reviewer: cam. · View Herald TranscriptMay 27 2024, 10:28 PM

Herald added subscribers: glebius, melifaro, donner and 4 others. · View Herald Transcript

olce requested review of this revision.May 27 2024, 10:28 PM

Harbormaster completed remote builds in B57943: Diff 139159.May 27 2024, 10:28 PM

olce added a parent revision: D45389: sched_ule: Use a single runqueue per CPU.May 27 2024, 10:29 PM

olce added a child revision: D45391: sched: Internal priority ranges: Reduce kernel, increase timeshare.

olce mentioned this in D45393: runq/sched: Switch to 256 distinct queues and extend the timeshare range.May 27 2024, 10:36 PM

olce edited the summary of this revision. (Show Details)May 27 2024, 10:38 PM

Differences of less than 4 (RQ_PPQ) are insignificant and are simply removed. No functional change (intended).

This is sure the easiest (least invasive answer) answer, but does it make it the right answer? Simply doing nothing here could expose the original code author's ideas. There could be some rationale that you are dropping. It would be nice to look what was it, unless you've done it already.

PS: Changes to openzfs will need to be upstreamed.

In D45390#1035772, @mav wrote:

Differences of less than 4 (RQ_PPQ) are insignificant and are simply removed. No functional change (intended).

This is sure the easiest (least invasive answer) answer, but does it make it the right answer? Simply doing nothing here could expose the original code author's ideas. There could be some rationale that you are dropping. It would be nice to look what was it, unless you've done it already.

There is one thing the original code author intended that we will have to validate. The relative impact of nice levels depends on their distance in the runq. ridx/idx (I'm not sure why they were renamed in one diff) march forward at a fixed rate in real time. Changing the number of queues and priorities changes that rate. I believe it will have the effect of increasing the impact of nice. We may need to change the way nice values are scaled to priorities to compensate.

olce mentioned this in D45387: runq: API rationalization, code factorization, revised implementation.May 30 2024, 2:21 PM

In D45390#1035772, @mav wrote:

This is sure the easiest (least invasive answer) answer, but does it make it the right answer? Simply doing nothing here could expose the original code author's ideas. There could be some rationale that you are dropping. It would be nice to look what was it, unless you've done it already.

As discussed in person, yes, I looked up most of these lines (normally all of them) and I could not find any justification for them in commit messages/reviews. Some of them actually are as old as the BSD 4.4 import, and it seems some others come from copy/paste between drivers. Moreover, priority specifications at sleep have not been obeyed by ULE for a while ('static_boost'). We could maybe contact some of the authors, but all in all, I don't think it's worth the trouble.

PS: Changes to openzfs will need to be upstreamed.

Sure.

In D45390#1035791, @jeff wrote:

There is one thing the original code author intended that we will have to validate. The relative impact of nice levels depends on their distance in the runq. ridx/idx (I'm not sure why they were renamed in one diff) march forward at a fixed rate in real time. Changing the number of queues and priorities changes that rate. I believe it will have the effect of increasing the impact of nice. We may need to change the way nice values are scaled to priorities to compensate.

The rename is to clear a possible confusion. After this change, these fields of struct tdq don't store the absolute index of a queue anymore, but rather the offset of the "start" queue in the range assigned to the timesharing selection policy. Moreover, I found that a single letter of difference sometimes impair quick reading, so I chose to be more explicit with _deq.

I don't think the relative impact of nice levels is changed. tdq_ts_deq_off (the old tdq_ridx) is not incremented by sched_clock() unless the queue it points to is empty, so the march forward, AFAIU, doesn't happen at a fixed rate, but rather depends on the time spent executing all queued threads (so at most the quantum by that number of threads, but will usually be less). Moreover, since the nice value is used in the computation of the *priority* to assign to a thread, that priority doesn't depend on the number of queues (as it should). Obviously, the final index of the chosen queue now changes (that's the point of this diff), but the only difference I can see is that the scheduler now will always run threads with lower numerical priority sooner than those with higher ones even if the priority difference is less than 4. Relative execution between these "clusters" of threads is unchanged. Do you agree?

In D45390#1035982, @olce wrote:

In D45390#1035791, @jeff wrote:

There is one thing the original code author intended that we will have to validate. The relative impact of nice levels depends on their distance in the runq. ridx/idx (I'm not sure why they were renamed in one diff) march forward at a fixed rate in real time. Changing the number of queues and priorities changes that rate. I believe it will have the effect of increasing the impact of nice. We may need to change the way nice values are scaled to priorities to compensate.

The rename is to clear a possible confusion. After this change, these fields of struct tdq don't store the absolute index of a queue anymore, but rather the offset of the "start" queue in the range assigned to the timesharing selection policy. Moreover, I found that a single letter of difference sometimes impair quick reading, so I chose to be more explicit with _deq.

I don't think the relative impact of nice levels is changed. tdq_ts_deq_off (the old tdq_ridx) is not incremented by sched_clock() unless the queue it points to is empty, so the march forward, AFAIU, doesn't happen at a fixed rate, but rather depends on the time spent executing all queued threads (so at most the quantum by that number of threads, but will usually be less). Moreover, since the nice value is used in the computation of the *priority* to assign to a thread, that priority doesn't depend on the number of queues (as it should). Obviously, the final index of the chosen queue now changes (that's the point of this diff), but the only difference I can see is that the scheduler now will always run threads with lower numerical priority sooner than those with higher ones even if the priority difference is less than 4. Relative execution between these "clusters" of threads is unchanged. Do you agree?

The runq index only moves at a rate slower than once per-tick when the system is very overloaded. This is how ULE implements priority decay. The difference in priority determines how many slices the lower (better) priority task will get for each slice the worse priority task will get.

This won't just affect nice but it will be most evident with it. Run two cpu hogs that never terminate and consume 100% cpu on the same core with cpuset. Let one be nice +20. Report the % cpu consumed by each before and after this patch set. I believe the nice +20 process should get 1/4 the cpu it was allotted before.

In D45390#1036063, @jeff wrote:

The runq index only moves at a rate slower than once per-tick when the system is very overloaded. This is how ULE implements priority decay. The difference in priority determines how many slices the lower (better) priority task will get for each slice the worse priority task will get.

You're absolutely right. I have been knowing that already but was so focused on the offset part that I forgot to consider how the head (the base index) moves (and indeed, unless the system has a high load, the head moves with tick frequency).

Two thoughts on that:

It's easy to counter-balance the effect of the higher number of queues by just incrementing the head by 4 (the old RQ_PPQ) at each tick. I'll actually do that in the commit that switches to 256 queues, and then will change it to 1 in a separate commit (unless you prefer to wait for that part), just so it is later easier to bisect. If you're OK with the second commit, this won't change the revision here (since both commits would be grouped into it anyway).
I've always found niceness in FreeBSD to be way too weak, so I wouldn't be against giving it a stronger (and possibly configurable) effect. That said, the side effect of the change here is still far from being enough from my POV. When using nice -n 20, I would really want it to only use at most 10% of the CPU time other processes with a nice value of 0 would get. I'd even argue to going as far as 1%. My expectation would be that nice -n 20 should be almost the same as using an idle priority. Another possible problem with our implementation of nice levels is that they are not logarithmic, which seems inconsistent with POSIX specifying that the nice() interface takes increments. This indeed hints at making increments have the same relative effect regardless of the value they are applied to, which then leads by composition to the logarithmic scale. It's what Linux does today IIRC. So I think there is work to do in this area also. It may make sense to try to avoid changing the nice behavior in this round of changes and save it for the time when we make other ones. Grouping those could be... nicer on users.

This won't just affect nice but it will be most evident with it. Run two cpu hogs that never terminate and consume 100% cpu on the same core with cpuset. Let one be nice +20. Report the % cpu consumed by each before and after this patch set. I believe the nice +20 process should get 1/4 the cpu it was allotted before.

I'll do the test just to be sure but I'm already convinced this is exactly what I'm going to observe.

In D45390#1037322, @olce wrote:

In D45390#1036063, @jeff wrote:

The runq index only moves at a rate slower than once per-tick when the system is very overloaded. This is how ULE implements priority decay. The difference in priority determines how many slices the lower (better) priority task will get for each slice the worse priority task will get.

You're absolutely right. I have been knowing that already but was so focused on the offset part that I forgot to consider how the head (the base index) moves (and indeed, unless the system has a high load, the head moves with tick frequency).

Two thoughts on that:

It's easy to counter-balance the effect of the higher number of queues by just incrementing the head by 4 (the old RQ_PPQ) at each tick. I'll actually do that in the commit that switches to 256 queues, and then will change it to 1 in a separate commit (unless you prefer to wait for that part), just so it is later easier to bisect. If you're OK with the second commit, this won't change the revision here (since both commits would be grouped into it anyway).

I've always found niceness in FreeBSD to be way too weak, so I wouldn't be against giving it a stronger (and possibly configurable) effect. That said, the side effect of the change here is still far from being enough from my POV. When using nice -n 20, I would really want it to only use at most 10% of the CPU time other processes with a nice value of 0 would get. I'd even argue to going as far as 1%. My expectation would be that nice -n 20 should be almost the same as using an idle priority. Another possible problem with our implementation of nice levels is that they are not logarithmic, which seems inconsistent with POSIX specifying that the nice() interface takes increments. This indeed hints at making increments have the same relative effect regardless of the value they are applied to, which then leads by composition to the logarithmic scale. It's what Linux does today IIRC. So I think there is work to do in this area also. It may make sense to try to avoid changing the nice behavior in this round of changes and save it for the time when we make other ones. Grouping those could be... nicer on users.

This won't just affect nice but it will be most evident with it. Run two cpu hogs that never terminate and consume 100% cpu on the same core with cpuset. Let one be nice +20. Report the % cpu consumed by each before and after this patch set. I believe the nice +20 process should get 1/4 the cpu it was allotted before.

I'll do the test just to be sure but I'm already convinced this is exactly what I'm going to observe.

Moving by 4 each tick will require you to evaluate 4 run queues each time you want to advance. You could instead maintain the same number of queues for timeshare and map the priority to index with /4 to maintain backwards compat and we could expand the number of queues and priorities later if necessary. I'm not convinced that more timeshare queues is actually a benefit. It may create more granularity than is useful. I've actually experimented with making NEEDRESCHED and preemption in the time share range less sensitive because it can cause extra context switches for no real gain. Small differences in priority are not that meaningful for timeshare. Should I preempt thread X because thread Y is ready to run and it was consuming 2% less cpu most recently? 5%? 10%?

The behavior of nice using the time-decay calendar queue algorithm is limited in the ratios it can express. The effect gives the less nice process proportionally more slices according to how far apart they are in queue space. How far apart they are depends both on cpu% and nice. However if they get too close in priority space the less nice thread will not really get extra full slices. ULE also supports dynamic slice sizes which are currently only used to control worst case latency but it would also be an option for more granular control of nice impact.

I'm ok with changing the slope of response to nice as long as it is done on purpose and with discussion and measurement. I tried to approximately match what I got on 4BSD IIRC.

In D45390#1037328, @jeff wrote:

Moving by 4 each tick will require you to evaluate 4 run queues each time you want to advance.

True, but I don't see how that could be a problem or have any real influence on performance, if that is your concern. We'll just test for 4 bits in the status words instead of 1, which are all already in the L1 cache.

You could instead maintain the same number of queues for timeshare and map the priority to index with /4 to maintain backwards compat and we could expand the number of queues and priorities later if necessary.

I'm finding this hacky, so I'd rather wait to be sure that there is actually a problem with the new approach before contemplating doing that. Please see also below.

I'm not convinced that more timeshare queues is actually a benefit. It may create more granularity than is useful. I've actually experimented with making NEEDRESCHED and preemption in the time share range less sensitive because it can cause extra context switches for no real gain. Small differences in priority are not that meaningful for timeshare. Should I preempt thread X because thread Y is ready to run and it was consuming 2% less cpu most recently? 5%? 10%?

You're mainly raising here a concern about the frequency of preemptions. An approach that tries to fix frequency of preemption by conflating priority levels the scheduler assigns to threads has the following important drawbacks:

It shadows the level assignment that the algorithm performs, so makes it much harder to evaluate it and in particular check that it serves its intended purpose.
It doesn't allow to finely control when preemptions occur, whereas inefficiencies due to preemption are mostly caused by a thread not running enough time to contain the overhead of context switching. In this light, whether some thread was consuming more or less %CPU recently, and the consequence this has on assigned timesharing priorities, doesn't appear to be a relevant metric to decide whether preemption should occur.

Consequently, I'd rather favor an approach where preemption (still for timesharing) only occurs after some minimum time has elapsed. It seems to me the purpose of sched_slice_min is exactly to prevent that performance problem. So, unless it has another use that I don't see, we should use it. This hints at a modification to sched_clock()'s descheduling part at the very least (but also ast_scheduler() and maybe other places).

It is also worth noting that currently, timeshare threads that get ready to run never forcibly preempt other currently running ones (as can be seen in sched_shouldpreempt(), given preempt_thresh is normally PRI_MIN_KERN), re-scheduling happening only on return to userland, which IIUC happens only at next tick/interrupt for computationally intensive processes. I can certainly guess that doing so allows re-scheduling without having to go into the kernel for just this purpose (since we are already there). But, if the goal here is also to throttle timesharing preemption, then that feels quite ad-hoc and unstable duration-wise, and doesn't obey sched_slice_min. Do you know/remember other reasons for that design?

Rebase.

Harbormaster completed remote builds in B59347: Diff 143032.Fri, Sep 6, 1:44 PM

Please have a look at D46566 for the recovery of nice and anti-starvation behavior.

This won't just affect nice but it will be most evident with it. Run two cpu hogs that never terminate and consume 100% cpu on the same core with cpuset. Let one be nice +20. Report the % cpu consumed by each before and after this patch set. I believe the nice +20 process should get 1/4 the cpu it was allotted before.

This is in fact far from being the case. After more analysis, on the contrary, it appears that raising the insertion offset (calendar queue head) change rate leads to decreasing the effect of nice values.

The behavior of nice using the time-decay calendar queue algorithm is limited in the ratios it can express. The effect gives the less nice process proportionally more slices according to how far apart they are in queue space.

The current effect of nice values turns out to be more complex than that, and the relation is not proportional at all. I spent a fair amount of time trying to express their effect through an exact mathematical formula but couldn't, although I was relatively close according to the large number of experiments I launched on top of some ad-hoc debugging infrastructure showing the values of all the involved parameters over time. The nice effect doesn't depend only on how far apart they are in the queue space (which is affected by nice priority scaling), but also on the insertion offset (calendar queue head) change rate itself, the remainder of the priority difference modulo the rate and the dynamic changes of each thread priority's %CPU contribution in reaction to their actual scheduling (since it is based on real %CPU, and not on runtime vs. sleeptime, as the interactivity score), which propagates back to the priority difference. On looking closely at some cases, it appears that the resulting %CPU ratio between two competing threads does not reach a convergence point, but perpetually oscillates.

In any case, as already said, nice values handling is today mostly ineffective (a nice -20 task on the same CPU as another nice 20 task only gets 2/3 of the CPU!), and I tend to think it cannot be fixed satisfactorily (i.e., with enough flexibility) within the current framework (we could try to play on each thread's slice value, as you suggested, but this has limits also). I also think that anti-starvation through the calendar queue mechanism is unfair in the case of threads creating lots of other threads. I'll send you some direct messages with proposals on how to fix both these problems, as well as that of the unfairness for timesharing threads not using up their time slice, with more fundamental changes.

Before that though, I'd really like to get the current series in. Once you've agreed with D46566, do you see anything else problematic? If you don't, I'll see how to get the series tested at a wider scale.