Thread switching used to be atomic with respect to the current CPU's
tdq lock. Since commit 686bcb5c14ab that is no longer the case. Now
sched_switch() does this:
- lock tdq (might already be locked)
- maybe put the current thread in the tdq, choose a new thread to run
2a. update tdq_lowpri
- unlock tdq
- switch CPU context, update curthread
Some code paths in ULE will load pc_curthread from a remote CPU with
that CPU's tdq lock held, usually to inspect its priority. But, as of
the aforementioned commit this is racy.
The problem I noticed is in tdq_notify(), which optionally sends an IPI
to a remote CPU when a new thread is added to its runqueue. If the new
thread's priority is higher (lower) than the currently running thread's
priority, then we deliver an IPI. But inspecting
pc_curthread->td_priority doesn't work, since pc_curthread might be
between steps 3 and 4 above. If pc_curthread's priority is higher than
that of the newly added thread, but pc_curthread is switching to a
lower-priority thread, then tdq_notify() might fail to deliever an IPI,
leaving a high priority thread stuck on the runqueue for longer than it
should. This can cause multi-millisecond stalls in
interactive/ithread/realtime threads.
Fix this problem by modifying tdq_add() and tdq_move() to return the
value of tdq_lowpri before the addition of the new thread. This ensures
that tdq_notify() has the correct priority value to compare against.
The other two uses of pc_curthread are susceptible to the same race. To
fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
value for curthread. Thus, introduce a new tdq_curthread field to the
tdq which gets updated any time a new thread is selected to run on the
CPU. Because this field is synchronized by the thread lock, its
priority reflects the correct lowpri value for the tdq.
PR: 264867