rdmsr_safe/wrmsr_safe: handle pcb_onfault nesting
ClosedPublic
Actions

Authored by avg on Jan 28 2024, 9:37 PM.

Details

Reviewers

kib
jhb
markj

Commits

rG5d24ae53b622: rdmsr_safe/wrmsr_safe: handle pcb_onfault nesting
rG007b84e6c159: rdmsr_safe/wrmsr_safe: handle pcb_onfault nesting
rG486b265a8fb6: rdmsr_safe/wrmsr_safe: handle pcb_onfault nesting

Summary

rdmsr_safe and wrmsr_safe can be called while pcb_onfault is already
set, so the functions are modified to preserve the handler rather than
resetting it before returning.

One case where that happens is when AMD microcode update routine
is executed on a stack where copyin / copyout was already active.

Here is a sample panic message from a crash caused by resetting the
handler:

<118>Updating CPU Microcode...

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x11ed0de6000
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff80c2df03
stack pointer           = 0x28:0xfffffe01ce4a4c70
frame pointer           = 0x28:0xfffffe01ce4a4c70
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 117 (logger)
trap number             = 12
panic: page fault
cpuid = 3
time = 1681462027
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff80615deb = db_trace_self_wrapper+0x2b/frame 0xfffffe01ce4a4830
kdb_backtrace() at 0xffffffff80943c77 = kdb_backtrace+0x37/frame 0xfffffe01ce4a48e0
vpanic() at 0xffffffff808f5fe5 = vpanic+0x185/frame 0xfffffe01ce4a4940
panic() at 0xffffffff808f5da3 = panic+0x43/frame 0xfffffe01ce4a49a0
trap_fatal() at 0xffffffff80c31849 = trap_fatal+0x379/frame 0xfffffe01ce4a4a00
trap_pfault() at 0xffffffff80c318b5 = trap_pfault+0x65/frame 0xfffffe01ce4a4a60
trap() at 0xffffffff80c30f5f = trap+0x29f/frame 0xfffffe01ce4a4b80
trap_check() at 0xffffffff80c31c29 = trap_check+0x29/frame 0xfffffe01ce4a4ba0
calltrap() at 0xffffffff80c07fd8 = calltrap+0x8/frame 0xfffffe01ce4a4ba0
--- trap 0xc, rip = 0xffffffff80c2df03, rsp = 0xfffffe01ce4a4c70, rbp = 0xfffffe01ce4a4c70 ---
copyout_nosmap_std() at 0xffffffff80c2df03 = copyout_nosmap_std+0x63/frame 0xfffffe01ce4a4c70
uiomove_faultflag() at 0xffffffff8095f0d5 = uiomove_faultflag+0xe5/frame 0xfffffe01ce4a4cb0
uiomove() at 0xffffffff8095efeb = uiomove+0xb/frame 0xfffffe01ce4a4cc0
pipe_read() at 0xffffffff80968860 = pipe_read+0x230/frame 0xfffffe01ce4a4d30
dofileread() at 0xffffffff809653cb = dofileread+0x8b/frame 0xfffffe01ce4a4d80
sys_read() at 0xffffffff80964fa0 = sys_read+0xc0/frame 0xfffffe01ce4a4df0
amd64_syscall() at 0xffffffff80c3221a = amd64_syscall+0x18a/frame 0xfffffe01ce4a4f30
fast_syscall_common() at 0xffffffff80c088eb = fast_syscall_common+0xf8/frame 0xfffffe01ce4a4f30
--- syscall (3, FreeBSD ELF64, read), rip = 0x11ece41cfaa, rsp = 0x11ecbec4908, rbp = 0x11ecbec4920 ---
Uptime: 41s

And another one:

Fatal trap 12: page fault while in kernel mode
cpuid = 4; apic id = 04
fault virtual address   = 0x800a22000
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff80b2c7ca
stack pointer           = 0x28:0xfffffe01c55b5480
frame pointer           = 0x28:0xfffffe01c55b5480
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 68418 (pfctl)
trap number             = 12
panic: page fault
cpuid = 4
time = 1625184463
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff805c1e8b = db_trace_self_wrapper+0x2b/frame 0xfffffe01c55b5040
kdb_backtrace() at 0xffffffff808874b7 = kdb_backtrace+0x37/frame 0xfffffe01c55b50f0
vpanic() at 0xffffffff808449d8 = vpanic+0x188/frame 0xfffffe01c55b5150
panic() at 0xffffffff808445f3 = panic+0x43/frame 0xfffffe01c55b51b0
trap_fatal() at 0xffffffff80b300a5 = trap_fatal+0x375/frame 0xfffffe01c55b5210
trap_pfault() at 0xffffffff80b30180 = trap_pfault+0x80/frame 0xfffffe01c55b5280
trap() at 0xffffffff80b2f729 = trap+0x289/frame 0xfffffe01c55b5390
trap_check() at 0xffffffff80b304d9 = trap_check+0x29/frame 0xfffffe01c55b53b0
calltrap() at 0xffffffff80b0bb28 = calltrap+0x8/frame 0xfffffe01c55b53b0
--- trap 0xc, rip = 0xffffffff80b2c7ca, rsp = 0xfffffe01c55b5480, rbp = 0xfffffe01c55b5480 ---
copyout_nosmap_std() at 0xffffffff80b2c7ca = copyout_nosmap_std+0x15a/frame 0xfffffe01c55b5480
pfioctl() at 0xffffffff85539358 = pfioctl+0x4d28/frame 0xfffffe01c55b5940
devfs_ioctl() at 0xffffffff807176cf = devfs_ioctl+0xcf/frame 0xfffffe01c55b59a0
VOP_IOCTL_APV() at 0xffffffff80bb26e2 = VOP_IOCTL_APV+0x92/frame 0xfffffe01c55b59c0
VOP_IOCTL() at 0xffffffff80928014 = VOP_IOCTL+0x34/frame 0xfffffe01c55b5a10
vn_ioctl() at 0xffffffff80923330 = vn_ioctl+0xc0/frame 0xfffffe01c55b5b00
devfs_ioctl_f() at 0xffffffff80717bbe = devfs_ioctl_f+0x1e/frame 0xfffffe01c55b5b20
fo_ioctl() at 0xffffffff808abc6b = fo_ioctl+0xb/frame 0xfffffe01c55b5b30
kern_ioctl() at 0xffffffff808abc01 = kern_ioctl+0x1d1/frame 0xfffffe01c55b5b80
sys_ioctl() at 0xffffffff808ab982 = sys_ioctl+0x132/frame 0xfffffe01c55b5c50
syscallenter() at 0xffffffff80b30cc9 = syscallenter+0x159/frame 0xfffffe01c55b5ca0
amd64_syscall() at 0xffffffff80b309a5 = amd64_syscall+0x15/frame 0xfffffe01c55b5d30
fast_syscall_common() at 0xffffffff80b0c44e = fast_syscall_common+0xf8/frame 0xfffffe01c55b5d30

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 55670
Build 52559: arc lint + arc unit

Event Timeline

avg created this revision.Jan 28 2024, 9:37 PM

Herald added a subscriber: imp. · View Herald TranscriptJan 28 2024, 9:37 PM

avg requested review of this revision.Jan 28 2024, 9:37 PM

Harbormaster completed remote builds in B55670: Diff 133470.Jan 28 2024, 9:37 PM

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276426

This revision is now accepted and ready to land.Jan 29 2024, 4:52 AM

Let me clarify something. I do think that the change is right and should be committed. But it is not enough to make amd ucode update safe (and intel variant too it seems).

Problem is that the rendezvous IPI might interrupt a locked section, in which case witness reports non-sleepable lock ownership and we panic anyway. See trap_pfault() !TRAP_NOFAULTING path.
I highly suspect that we need to park other CPUs before issuing IPI. Also I remember there were some requirements on Broadwell about other SMT threads state when doing update, so it might be useful for other reason as well.

@kib , thank you for the feedback.

I had some work in progress on "graceful" CPU stopping / commandeering.
The idea was to start a high priority thread bound to each CPU except for a "master".
Once such a thread is running it enters a critical section to ensure that it is not preempted / descheduled.
Once all subordinate CPUs are taken over, the master can tell them (all or a subset) to execute a function.
If you think that adding such mechanism would be useful, I'll try to revive the patch.

In D43639#995220, @avg wrote:

@kib , thank you for the feedback.

I had some work in progress on "graceful" CPU stopping / commandeering.
The idea was to start a high priority thread bound to each CPU except for a "master".
Once such a thread is running it enters a critical section to ensure that it is not preempted / descheduled.
Once all subordinate CPUs are taken over, the master can tell them (all or a subset) to execute a function.
If you think that adding such mechanism would be useful, I'll try to revive the patch.

It sounds useful, but some details still need elaboration. For instance, a thread in critical section cannot take (restartable) faults, see td_critnest check in trap_pfault().

markj added inline comments.Jan 29 2024, 4:13 PM