ipmi: use a queue for kcs driver requests when possible
ClosedPublic
Actions

Authored by chs on Sep 13 2022, 11:23 PM.

Details

Reviewers

jhb
imp

Commits

rGf0f3e3e961d3: ipmi: use a queue for kcs driver requests when possible

Summary

The ipmi watchdog pretimeout action can trigger unintentionally in
certain rare, complicated situations. What we have seen at Netflix
is that the BMC can sometimes be sent a continuous stream of
writes to port 0x80, and due to what is a bug or misconfiguration
in the BMC software, this results in the BMC running out of memory,
becoming very slow to respond to KCS requests, and eventually being
rebooted by its own internal watchdog. While that is going on in
the BMC, back in the host OS, a number of requests are pending in
the ipmi request queue, and the kcs_loop thread is working on
processing these requests. All of the KCS accesses to process
those requests are timing out and eventually failing because the
BMC is responding very slowly or not at all, and the kcs_loop thread
is holding the IPMI_IO_LOCK the whole time that is going on.
Meanwhile the watchdogd process in the host is trying to pat the
BMC watchdog, and this process is sleeping waiting to get the
IPMI_IO_LOCK. It's not entirely clear why the watchdogd process
is sleeping for this lock, because the intention is that a thread
holding the IPMI_IO_LOCK should not sleep and thus any thread
that wants the lock should just spin to wait for it. My best guess
is that the kcs_loop thread is spinning waiting for the BMC to
respond for so long that it is eventually preempted, and during
the brief interval when the kcs_loop thread is not running,
the watchdogd thread notices that the lock holder is not running
and sleeps. When the kcs_loop thread eventually finishes processing
one request, it drops the IPMI_IO_LOCK and then immediately takes the
lock again so it can process the next request in the queue.
Because the watchdogd thread is sleeping at this point, the kcs_loop
always wins the race to acquire the IPMI_IO_LOCK, thus starving
the watchdogd thread. The callout for the watchdog pretimeout
would be reset by the watchdogd thread after its request to the BMC
watchdog completes, but since that request never processed, the
pretimeout callout eventually fires, even though there is nothing
actually wrong with the host.

To prevent this saga from unfolding:

when kcs_driver_request() is called in a context where it can sleep, queue the request and let the worker thread process it rather than trying to process in the original thread.
add a new high-priority queue for driver requests, so that the watchdog patting requests will be processed as quickly as possible even if lots of application requests have already been queued.

With these two changes, the watchdog pretimeout action does not trigger
even if the BMC is completely out to lunch for long periods of time
(as long as the watchdogd check command does not also get stuck).

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

chs created this revision.Sep 13 2022, 11:23 PM

Herald added a subscriber: imp. · View Herald TranscriptSep 13 2022, 11:23 PM

chs requested review of this revision.Sep 13 2022, 11:23 PM

Harbormaster completed remote builds in B47370: Diff 110532.Sep 13 2022, 11:23 PM

imp added inline comments.Sep 19 2022, 5:24 PM

sys/dev/ipmi/ipmi_kcs.c
529	what happens to requests that are queued when we panic?

chs added inline comments.Oct 11 2022, 10:47 PM

sys/dev/ipmi/ipmi_kcs.c
529	any request that is still in the queue at the time of the panic will not be processed, because the worker thread will never run again. if a request is in the middle of being processed by the worker thread when the panic occurs, then the host and the BMC will probably get out of sync in the KCS protocol and not be able to communicate. in both cases the behavior after this patch is effectively the same as the current code.