Page MenuHomeFreeBSD

vmm: avoid spurios rendezvous
ClosedPublic

Authored by corvink on Nov 15 2022, 11:02 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 23, 6:25 PM
Unknown Object (File)
Thu, Jan 23, 12:05 PM
Unknown Object (File)
Thu, Jan 23, 6:52 AM
Unknown Object (File)
Sat, Jan 18, 4:48 AM
Unknown Object (File)
Fri, Jan 17, 8:51 PM
Unknown Object (File)
Tue, Jan 14, 6:48 AM
Unknown Object (File)
Mon, Jan 13, 9:42 PM
Unknown Object (File)
Wed, Jan 8, 9:53 AM

Details

Summary

A vcpu only checks if a rendezvous is ongoing or not to decide if it should handle a rendezvous. This could lead to spurios rendezvous where a vcpu tries a handle a rendezvous it isn't part of. This situation is properly handled by vm_handle_rendezvous but it could potentially degrade the performance. Avoid that by an early check if the vcpu is part of the rendezvous or not.

At the moment, rendezvous are only used to spin up application processors and to send ioapic interrupts. Spinning up application processor is done in the guest boot phase by sending INIT SIPI sequences to single vcpus. This is known to cause spurious rendezvous but only occurs in the boot phase. Sending ioapic interrupts is rare because modern guest will use msi and the rendezvous is always send to all vcpus.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

gusev.vitaliy_gmail.com added inline comments.
sys/amd64/include/vmm.h
163

I think you don't need mutex here (&vm->rendezvous_mtx), because taking the mutex every time during check can be too expensive for VM with many VCPUs. So this code can introduce bottleneck.

Instead I would suggest to move that engine (rendezvous_req_cpus ) to atomic operations (CPU_SET_ATOMIC, etc) and barriers, it should be fast on reading (check) and safe.

panic: acquiring blockable sleep lock with spinlock or critical section help (sleep mutex) ... I hope the crashdump worked and I'll follow-up ..

bz requested changes to this revision.Nov 15 2022, 12:41 PM
panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) vm rendezvous lock @ ./machine/vmm.h:325
cpuid = 3
time = 1668515758
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0114f3c580
vpanic() at vpanic+0x151/frame 0xfffffe0114f3c5d0
panic() at panic+0x43/frame 0xfffffe0114f3c630
witness_checkorder() at witness_checkorder+0xd3e/frame 0xfffffe0114f3c7f0
__mtx_lock_flags() at __mtx_lock_flags+0x94/frame 0xfffffe0114f3c840
vmx_run() at vmx_run+0x91f/frame 0xfffffe0114f3c9a0
vm_run() at vm_run+0x223/frame 0xfffffe0114f3caa0
vmmdev_ioctl() at vmmdev_ioctl+0x507/frame 0xfffffe0114f3cb40
devfs_ioctl() at devfs_ioctl+0xcd/frame 0xfffffe0114f3cb90
vn_ioctl() at vn_ioctl+0x131/frame 0xfffffe0114f3cca0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe0114f3ccc0
kern_ioctl() at kern_ioctl+0x202/frame 0xfffffe0114f3cd30
sys_ioctl() at sys_ioctl+0x12a/frame 0xfffffe0114f3ce00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe0114f3cf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0114f3cf30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x260d25b0833a, rsp = 0x260db3befe58, rbp = 0x260db3beff10 ---
KDB: enter: panic

Sadly we don't dump ddb output as part of the msgbuf; so show alllocks isn't there; sorry no serial console on that one to capture..

This revision now requires changes to proceed.Nov 15 2022, 12:41 PM
  • implement vcpu_rendezvous_pending without a mtx
sys/amd64/vmm/vmm.c
1447

This is a "point" of rendezvous. All vCPU threads are sitting here until the last is done.

vcpu_rendezvous_pending() will never have side effect on an old rendezvous_req_cpus value and therefore added OR/AND logic looks excessive.

2650

In general it should work. But in terms of atomicity, you can see a little window between setting rendezvous_req_cpus and then setting vm->rendezvous_func = func. Setting order is wrong. Reader can check rendezvous_req_cpus and see the bit set and then see rendezvous_func is NULL.

When all code is protected by mutex or spinlock, all changes in that section are seen as atomic (consistent) . But if we change data under protected mutex, but read (access) data without mutex protection, it can be inconsistent.

As solution you can move ATOMIC operation after 'vm->rendezvous_func = func' assignment. That would protect ordering for data writing.

Another alternative for atomic is use memory barriers:
a) between changing rendezvous_func and cpuset (setting rendezvous_func must be first)
b) after CPU_ISSET() in vcpu_rendezvous_pending

Also CPU_OR_ATOMIC can be replaced by CPU_SET_ATOMIC - adding AND/OR is excessive because it is rendezvous.

  • use memory barriers instead of atomic operations
corvink added inline comments.
sys/amd64/vmm/vmm.c
1447

There one minor mistake in your comment which causes the bug of bz. It's only guaranteed that a subset of all vcpu thread (those specified by rendezvous_req_cpus) are sitting here.

Yeah, I thought about it and you're right. Adding atomic isn't required here.

2650

You're right but I don't understand why it should be required to use a memory barrier after CPU_ISSET() in vcpu_rendezvous_pendig.

Btw: CPU_SET_ATOMIC wouldn't work properly for vms with many vcpu. It only sets a single long value. So, you can only set 64 vcpus. Look at it's implementation:

#define	__BIT_SET_ATOMIC(_s, n, p)					\
	atomic_set_long(&(p)->__bits[__bitset_word(_s, n)],		\
	    __bitset_mask((_s), n))

So the problem is:

  1. vcpu1 initiates a rendezvous with some set of vcpus
  2. vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true
  3. vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

?

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

So the problem is:

  1. vcpu1 initiates a rendezvous with some set of vcpus
  2. vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true
  3. vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

?

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

Yes, that's the issue. It's caused by vm_handle_ipi. If the OS sends INIT to a single cpu (what most OS do), vm_handle_ipi will initiate a rendezvous for a single vcpu. Don't know if the costs for unneccessary vm_exit_rendezvous calls are noticable for vms with high vcpu counts.

So the problem is:

  1. vcpu1 initiates a rendezvous with some set of vcpus
  2. vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true
  3. vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

?

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

Yes, that's the issue. It's caused by vm_handle_ipi. If the OS sends INIT to a single cpu (what most OS do), vm_handle_ipi will initiate a rendezvous for a single vcpu. Don't know if the costs for unneccessary vm_exit_rendezvous calls are noticable for vms with high vcpu counts.

Then I think the right solution is to just remove the assertion. A vCPU will only call vm_exit_rendezvous unnecessarily if it happens to be exiting anyway, so I'm quite skeptical that there is any noticeable cost. If so, then it can be addressed as a separate follow-up commit.

So the problem is:

  1. vcpu1 initiates a rendezvous with some set of vcpus
  2. vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true
  3. vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

?

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

Yes, that's the issue. It's caused by vm_handle_ipi. If the OS sends INIT to a single cpu (what most OS do), vm_handle_ipi will initiate a rendezvous for a single vcpu. Don't know if the costs for unneccessary vm_exit_rendezvous calls are noticable for vms with high vcpu counts.

Then I think the right solution is to just remove the assertion. A vCPU will only call vm_exit_rendezvous unnecessarily if it happens to be exiting anyway, so I'm quite skeptical that there is any noticeable cost. If so, then it can be addressed as a separate follow-up commit.

I agree with Mark. I think the right solution is to relax the assertion and explicitly handle spurious calls to vm_exit_rendezvous().

sys/amd64/include/vmm.h
336–343

OTOH, this probably needs to be something like CPU_EMPTY() instead of just checking the first word, or even the check you have here of checking the cpuid might be sensible. That would be a separate followup though. Until we have more vCPUs than bits in a pointer though (which we can't until my other series lands) this isn't yet a problem. I probably should fix this in my other series actually.

sys/amd64/include/vmm.h
336–343

No, this shouldn't be CPU_EMPTY. This function should check if the vcpu identified by vcpuid has a pending rendevzous or not. So, this should only test if the bit at position vcpuid is set or not. That's exactly what CPU_ISSET does and it will work even with thousands of vcpus.

I've created a new revision for relaxing the rendezvous assertion: https://reviews.freebsd.org/D37417

corvink retitled this revision from vmm: fix race condition in rendezvous detection to vmm: avoid spurios rendezvous.Nov 17 2022, 7:08 AM
corvink edited the summary of this revision. (Show Details)

This patch avoids spurios rendezvous. I've updated the description. Please comment if you like it or not. If you don't like it, I'll drop it.

sys/amd64/include/vmm.h
336–343

I had thought the current code was setting rptr to the cpuset, and CPU_EMPTY() would be a more correct version of checking for an empty mask in that case (to preserve existing semantics).

I do think if we want a comment anywhere we want it here as this is the one read outside of the rendezvous_mtx. The comment could just say that the check is racy, but that's ok as if we see a stale value we will get an IPI to force an exit, and if we see a stale value with it set, the handler will ignore the spurious rendezvous.

sys/amd64/vmm/vmm.c
1443

I don't know that we need quite this long a comment. We didn't have one before for clearing rendezvous_func above.

1861

Oh, this is why the previous code worked (I had thought it was always pointing to the cpuset)

2660

We generally use atomic_* now instead of these. Also, what you want here instead is probably atomic_thread_fence_rel or you want to use atomic_store_ptr_rel for setting vm->rendezvous_func.

OTOH, I don't know that you need this barrier at all since you've removed the assertion. The fact that the rendezvous handler grabs the rendezvous_mtx before checking any of these members means that we don't actually depend on any real ordering here. It probably is better to set rendezvous_req_cpus earlier as the existing code does now so that any vCPUs doing a VM enter concurrent with this have chance to see it and get ready for rendezvous without needing an IPI to force an exit.

sys/amd64/include/vmm.h
336–343

I can add a comment.

sys/amd64/vmm/vmm.c
1443

The first intention of this patch was to avoid spurious rendezvous completely (and those fixing the issue of bz). This comment explains some details how it's done. As we are going to remove the rendezvous assertion, it's not required any more to avoid spurious rendezvous completely. So, this comment isn't neccessary.

2660

Thanks for the explanation how to use barriers.

You're right as we are going to remove the rendezvous assertion.

  • rebase
  • do not use any memory barriers
sys/amd64/include/vmm.h
336–343

What if use additional check on 'rendezvous_func' ? It should be faster than use only CPU_ISSET() on every vmexit that doesn't break vmx/svm loop. Of course that optimisation makes sense if there are a lot of such vmexits.

vcpu_rendezvous_pending()
{ 
 	return (*((uintptr_t *)(info->rptr)) != 0)  && CPU_ISSET(vcpuid, (cpuset_t *)info->rptrset);
}

rptr -> rendezvous_func
rptrset -> rendezvous_req_cpus

  • fix compile
  • update comment of vcpu_rendezvous_pending
sys/amd64/include/vmm.h
336–343

Not sure if it's worth. CPU_ISSET is a single lookup (_cpuset[bitoffset] & bitmask). So, it should be fast.
Well, I'm not sure if it's worth to commit this patch at all. Spurious rendezvous are rare and at the moment only occur when sending INIT/SIPI. OTOH, this patch is small, keeps the rendezvous simple and is a slight improvement.

@jhb or @markj may have more experience on that.

sys/amd64/include/vmm.h
336–343

I don't think we need to overcomplicate the check with checking both things. Rendezvous should be rare. Modern guests should be using MSI interrupts and not I/O APIC interrupts in which case rendezvous will only occur during boot to start APs.

341

Perhaps more accurate to say the the races are harmless. They still exist, they are just all handled

343

We could also change the type of info->rptr to be cpuset_t * which would remove the need for casts. It's only used for this.

  • update comment
  • turn rptr into cpuset_t *

Not sure if the current log message is the one you plan to use? It would be good to be clear that post-boot the only rendezvous used are for I/O APIC interrupts which should be rare in most guests.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 1 2023, 12:28 PM
This revision was automatically updated to reflect the committed changes.