Page MenuHomeFreeBSD

sched_ule: Re-implement stealing on top of runq common-code
Needs ReviewPublic

Authored by olce on May 27 2024, 10:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Nov 11, 3:52 AM
Unknown Object (File)
Oct 10 2024, 8:18 AM
Unknown Object (File)
Oct 2 2024, 10:08 PM
Unknown Object (File)
Sep 21 2024, 2:08 PM
Unknown Object (File)
Sep 21 2024, 11:42 AM
Unknown Object (File)
Sep 11 2024, 4:52 AM
Unknown Object (File)
Sep 8 2024, 1:40 AM
Unknown Object (File)
Sep 8 2024, 1:38 AM
Subscribers

Details

Reviewers
jhb
markj
mav
jeff
Summary

Please see overview of project at D45393.

Stop using internal knowledge of runqueues, remove duplicated code (in
sometimes unnecessarily different variants).

runq_steal_from(): Suppress first thread special case

This special case, introduced as soon as commit "ULE 3.0"
(ae7a6b38d53f, r171482, from July 2007), has no apparent justification.
All the reasons we can second-guess are dubious at best. In absence of
objections, let's just remove this twist, which caused bugs in the past.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 59345
Build 56232: arc lint + arc unit

Event Timeline

olce requested review of this revision.May 27 2024, 10:27 PM

I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.

This special case, introduced as soon as commit "ULE 3.0"
(ae7a6b38d53f, r171482, from July 2007), has no apparent justification.
All the reasons we can second-guess are dubious at best. In absence of
objections, let's just remove this twist, which caused bugs in the past.

The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.

In D45388#1035721, @mav wrote:

I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.

I don't think that's a possibility with the current code, right?

The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.

I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.

As for benchmarking, of course this patchset needs wider benchmarking. I'll need help for that since I don't have that many different machines to test it on, and moreover it has to undergo testing with a variety of different workloads. I plan to contact olivier@ to see if he can test that patch set (perhaps slightly amended) at Netflix.

In D45388#1035721, @mav wrote:

I suspect that first thread was skipped to avoid stealing a thread that was just scheduled to a CPU, but was unable to run yet.

I don't think that's a possibility with the current code, right?

The point was to move threads that are least likely to benefit from affinity because they are unlikely to run soon enough to take advantage of it. We leave a thread that may execute soon. I would want to see this patchset benchmarked together with a wide array of tests to make sure there's no regression in the totality of it. I don't feel particularly strongly about this case but taken together there is some chance of unintended consequences.

I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.

As for benchmarking, of course this patchset needs wider benchmarking. I'll need help for that since I don't have that many different machines to test it on, and moreover it has to undergo testing with a variety of different workloads. I plan to contact olivier@ to see if he can test that patch set (perhaps slightly amended) at Netflix.

It does steal in that case though, it will just prefer the second thread over the first. We don't permanently skip the first thread.

If we have a pervasive cpu hz granular clock we can do much better with affinity than I could when I wrote the scheduler using a 1khz or 128hz clock. These periods are too long for any real measure of the impact of affinity. This would be a worthwhile set of improvements. Precise time counting could tell us if it has been longer than X us since the thread was scheduled which is likely a better filter for whether affinity is valid.

I agree this may improve affinity in some cases, but at the same time we don't really know when the next thread on the queue is to run. Not stealing in this case also amounts to slightly violating the expected execution ordering and fairness.

It does steal in that case though, it will just prefer the second thread over the first. We don't permanently skip the first thread.

Yes, I'm aware. My sentence above should have read "(snip) Not stealing *it* in this case (snip)", sorry for the induced confusion. In practice, however, barring explicit cpuset assignment, the first thread is always skipped since steal_thresh is 2.

If we have a pervasive cpu hz granular clock we can do much better with affinity than I could when I wrote the scheduler using a 1khz or 128hz clock. These periods are too long for any real measure of the impact of affinity. This would be a worthwhile set of improvements. Precise time counting could tell us if it has been longer than X us since the thread was scheduled which is likely a better filter for whether affinity is valid.

Yes, but I doubt it's enough (it may cause odd decisions in some cases). A priori, I think that these refinements are likely to be compulsory to avoid odd behaviors:

  1. Whether the new thread is from the same process as the previous one, or sharing the same vmspace.
  2. Whether one thread has a tendency to trash cache a lot, which could perhaps be measured by HW performance counters (to eventually compute the percentage of cache thrashed by unit of time, and use it in conjunction with the runtime), or by computing statistics of performance for threads scheduled next soon after they start running again (but I fear this second mechanism may not be practical).

And, selecting a thread for affinity in violation of the "normal ordering" (which is not really defined currently) should probably lead to penalizing it (e.g., by adding to its elapsed runtime after the run). But this is probably going too far for the current discussion.