Page MenuHomeFreeBSD

newbus: Create a knob to disable devices that fail to attach.
ClosedPublic

Authored by imp on Nov 28 2022, 8:00 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 9, 7:22 AM
Unknown Object (File)
Sep 16 2024, 10:03 PM
Unknown Object (File)
Sep 12 2024, 11:44 AM
Unknown Object (File)
Sep 5 2024, 12:12 AM
Unknown Object (File)
Aug 31 2024, 6:51 AM
Unknown Object (File)
Aug 9 2024, 5:18 AM
Unknown Object (File)
Jul 29 2024, 3:23 PM
Unknown Object (File)
Jul 26 2024, 2:10 AM
Subscribers

Details

Summary

Normally, when a device fails to attach, we tear down the newbus state
for that device so that future driver loads can try again (maybe with a
different driver, or maybe with a re-loaded and fixed kld).

Sometimes, however, it is desirable to have the device fail
permanantly. We do this by calling device_disable() on a failed
attached, as well as keeping the device in DS_ATTACHING forever. This
prevents retries on that device. This is enabled via
hw.bus.disable_failed_devices=1 in either a hint via the loader, or at
runtime with a sysctl setting. Setting from 1 -> 0 at runtime will not
affect previously disabled devices, however: they remain disabled.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

imp requested review of this revision.Nov 28 2022, 8:00 AM
imp created this revision.

Thanks for this. I tested this and confirmed that it works as described.

This revision is now accepted and ready to land.Nov 28 2022, 9:52 PM

Looks good.

sys/kern/subr_bus.c
72

static's are always zero. You don't need to set it equal to false :-) And false is always zero, and if you don't trust that compile time assert it.

sys/kern/subr_bus.c
2546

Note that this isn't quite true (you can devctl enable foo0 post-boot to undo device_disable without needing to reboot). I would be tempted to trim much of this paragraph as I'm not sure it helps anyone who might want to understand if they want to turn this knob on.

I think the use case Drew has is preserving unit numbers if some devices fail but others succeed for a given driver, correct? That might be a better reason to give here.

2550
2566–2567

Might as well fix the style bug since you are nearby

sys/kern/subr_bus.c
2546

My use case is to prevent repeated probes of the same device and was prompted by a bad Mellanox NIC. Though I have to say: preserving unit numbers is important as well.

Mlx5 has a rather large timeout for FW to initialize, and this was done out of the probe. On the bad NIC, the FW initialization was timing out, causing boot to be delayed. However, as subsequent drivers were initialized, the device was rematched to the mlx5 driver over and over and over again, with the probe routine being called multiple times. Due the repeated (very long) timeouts, boot was blocked for long enough to trigger a watchdog reset of the host, and the host appeared "bricked" until I was able to get a console and disable the device.

Maybe also check || cold ?

What would that accomplish. I'm not sure I understand how it would be useful. Seems like an orthogonal concept to me.

sys/kern/subr_bus.c
2567

Yea, I know this is gratuitous, but I thought it too trivial to split out.

sys/kern/subr_bus.c
2546

Were you kldloading additional drivers? We should only reprobe a device due to a kldload. During boot we should be probing devices only once with the set of drivers in the kernel + modules loaded by the loader. If you have a MINIMAL-style kernel with most drivers loaded by devmatch post-boot then I could see descending into this type of downward spiral. One of the reasons why I'm asking is we need a good explanation for why the default is the way it is. Right now my guess is that the main reason to keep the default is POLA/inertia. In a world where more drivers are loaded by devmatch it probably makes sense for this knob to default to on rather than off.

2567

Oh, I'm fine fixing it, I would just fully fix it. :)

sys/kern/subr_bus.c
2546

There's some drivers that aren't in our RED1 kernel that are automatically loaded by devmatch. when those are loaded, these reprobe and take forever...

I'm happy flipping the bit on the default, and I'll rework the comments and we can chat about it when I resubmit.

2567

I'm just going to fix this in -current and then rebase on top of that fix then :)

better comments to address review comments, hopefully.

This revision now requires review to proceed.Dec 1 2022, 4:34 AM
This revision is now accepted and ready to land.Dec 4 2022, 11:19 PM