x86: Add a required store-load barrier in cpu_idle()
ULE's tdq_notify() tries to avoid delivering IPIs to the idle thread.
In particular, it tries to detect whether the idle thread is running.
There are two mechanisms for this:
- tdq_cpu_idle, an MI flag which is set prior to calling cpu_idle(). If tdq_cpu_idle == 0, then no IPI is needed;
- idle_state, an x86-specific state flag which is updated after cpu_idleclock() is called.
The implementation of the second mechanism is racy; the race can cause a
CPU to go to sleep with pending work. Specifically, cpu_idle_*() set
idle_state = STATE_SLEEPING, then check for pending work by loading the
tdq_load field of the CPU's runqueue. These operations can be reordered
so that the idle thread observes tdq_load == 0, and tdq_notify()
observes idle_state == STATE_RUNNING.
Some counters indicate that the idle_state check in tdq_notify()
frequently elides an IPI. So, fix the problem by inserting a fence
after the store to idle_state, immediately before idling the CPU.
PR: 264867
Reviewed by: mav, kib, jhb
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 03f868b163ad46d6f7cb03dc46fb83ca01fb8f69)