Page MenuHomeFreeBSD

bhyve: help rendezvous request when waiting for IDLE vcpu state
Needs ReviewPublic

Authored by kib on Jan 7 2023, 5:32 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 13, 12:32 PM
Unknown Object (File)
Mon, Oct 28, 2:03 PM
Unknown Object (File)
Tue, Oct 22, 6:28 PM
Unknown Object (File)
Tue, Oct 22, 1:29 PM
Unknown Object (File)
Sun, Oct 20, 1:20 PM
Unknown Object (File)
Sat, Oct 19, 3:50 AM
Unknown Object (File)
Oct 11 2024, 1:22 PM
Unknown Object (File)
Sep 28 2024, 9:32 AM

Details

Reviewers
None
Group Reviewers
bhyve
Summary

PR: 268794

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Jan 7 2023, 5:32 AM

I think the problem is that the vCPU needing help is not the vCPU that is contending on the lock.

Say vCPU 5 triggers a rendezvous request, and the thread normally running vCPU 2 calls vm_map_pptdev_mmio() from userland.

vCPU 5 is in the frozen state and completes the rendezvous.

vCPU 2 is in the idle state, so vcpu_lock_all() successfully obtains the lock on it, and does not rendezvous. lock_all iterates up to vCPU 5 which is not in the idle state and hangs there. It can call the new rendezvous help function, but the rendezvous is already complete on this vcpu.

One fix is if vcpu_lock_all() knows that it is vCPU 2 and tries to help the rendezvous as vCPU 2 when stalled on a lock on any other vCPU. But vm_map_pptdev_mmio() doesn't pass in its vCPU to kernel land, so there's a lot of threading to make that work.

I thought about returning EAGAIN to userland from vcpu_lock_all() if a rendezvous competes with vcpu_lock_all(), but it isn’t clear to me what the userland is supposed to do about it. Call vm_run() so that the rendezvous happens then invoke the desired function again? Seems inefficient.

I think the problem is that the vCPU needing help is not the vCPU that is contending on the lock.

Say vCPU 5 triggers a rendezvous request, and the thread normally running vCPU 2 calls vm_map_pptdev_mmio() from userland.

vCPU 5 is in the frozen state and completes the rendezvous.

vCPU 2 is in the idle state, so vcpu_lock_all() successfully obtains the lock on it, and does not rendezvous. lock_all iterates up to vCPU 5 which is not in the idle state and hangs there. It can call the new rendezvous help function, but the rendezvous is already complete on this vcpu.

Right, but vCPU5 would unfreeze after the rendezvous is completed and allow the progress on vCPU2 then.

One fix is if vcpu_lock_all() knows that it is vCPU 2 and tries to help the rendezvous as vCPU 2 when stalled on a lock on any other vCPU. But vm_map_pptdev_mmio() doesn't pass in its vCPU to kernel land, so there's a lot of threading to make that work.

I thought about returning EAGAIN to userland from vcpu_lock_all() if a rendezvous competes with vcpu_lock_all(), but it isn’t clear to me what the userland is supposed to do about it. Call vm_run() so that the rendezvous happens then invoke the desired function again? Seems inefficient.

For this approach, you might return some special error code (e.g. ERESTART) and loop on it in top level ioctl handler, inside the kernel. Or even not loop in kernel, leaving the normal syscall restart mechanism do the job.

In D37972#862712, @kib wrote:

Right, but vCPU5 would unfreeze after the rendezvous is completed and allow the progress on vCPU2 then.

But the rendezvous won't complete, because no one will complete vCPU 2's rendezvous. The thread that normally runs vCPU 2 is stuck in vcpu_lock_all() on vcpuid = 5. And vCPU 5 is stuck waiting for vCPU 2's rendezvous to finish.

I'll apply the patch and try it anyway in case I am mistaken.

Unfortunately I see the same behaviour with the patch. (Though I had to make some small tweaks to backport the patch to 13/Stable.)

sys/amd64/vmm/vmm.c
1339 ↗(On Diff #114815)

Ok, what if we perform vm_handle_rendezvous_help_one() here on /all/ locked CPUs (cpus with id < vcpu)?

We are doing this from the wrong thread: but we know that those CPUs are frozen because we already locked them, and the redezvous mutex is a memory fence, so those kernel threads will see any state changes before the vCPUs wake up.

sys/amd64/vmm/vmm.c
1339 ↗(On Diff #114815)

I am not sure what do you mean there. Are you suggesting to call vm_handle_rendezvous_help_one(vcpuxx) for vcpuxx with ids < vcpu->id? If yes, I do not think it would work, because the rendezvous handler needs to access CPU-local (as in, local CPU registers emulating lapic state for guest in VMX) data.

But could we just unfreeze all vcpus there if we note a rendezvous request?

Do you have the implementation of bool vm_handle_rendezvous_help_one() ? It didn't make it into this patch.

Figured out what bool vm_handle_rendezvous_help_one(struct vcpu *vcpu) it must be -- but nope, with sufficient CPUs I still get a deadlock. This time I have

  • vcpu == 0 has invoked its rendezvous func but is spinning waiting for the rest of the rendezvous to complete;
  • thread 10 in vcpu_set_state_locked() waiting for the lock on vcpu 0. The rendezvous for vcpu 0 is already complete, so it just spins -- as before.
  • all other cpus have completed their rendezvous and are waiting for vcpu 10 to do its rendezvous.

Figured out what bool vm_handle_rendezvous_help_one(struct vcpu *vcpu) it must be -- but nope, with sufficient CPUs I still get a deadlock. This time I have

  • vcpu == 0 has invoked its rendezvous func but is spinning waiting for the rest of the rendezvous to complete;
  • thread 10 in vcpu_set_state_locked() waiting for the lock on vcpu 0. The rendezvous for vcpu 0 is already complete, so it just spins -- as before.
  • all other cpus have completed their rendezvous and are waiting for vcpu 10 to do its rendezvous.

Hm, so this is exactly what the first version of the patch tried to fix. Ok, lets try to combine both approaches.

I found I still have the same problem where, for m < n,

  • vCPU m requests rendezvous and executes its own rendezvous func
  • thread for vCPU n tries to obtain lock_all
  • thread for vCPU n gets stalled waiting for vCPU m to go to idle state and, since m < n, it never reaches vCPU n
    • it can perform the rendezvous for vCPU m, but that's already complete
  • vCPU m is waiting for vCPU n to do its rendezvous.

I have been able to fix it with a slight adaptation to your patch, in which lock_all greedily obtains as many locks as possible, performs a rendezvous on all the vcpus it has a lock for, then releases them all, and starts again. Probably this will require much more careful thought to get into main, but it works for me at least. Also it is against 13/stable, not main. {F54442637}

Isn't it simply enough to handle rendezvous request for all vcpus we freeze? And other vcpus should be fine because non-frozen vcpus would handle the request on vmexit.

My previous patches were completely wrong since they called requested function on non-idle vcpus.
Could you try this one?

In D37972#862875, @kib wrote:

Isn't it simply enough to handle rendezvous request for all vcpus we freeze?

No, because

  • vcpu 1 can be waiting for rendezvous completion;
  • vcpu 2 can be in vcpu_lock_all trying to lock vcpu 1 (and won't therefore perform the rendezvous for vcpu 2)

And other vcpus should be fine because non-frozen vcpus would handle the request on vmexit.

  • vcpu 2 is not running, it is in vcpu_lock_all() waiting for vcpu 1. Whether vcpu2 is frozen/idle/running is not important until vcpu1 goes to idle, because we need to get vcpu1's lock first before we do vcpu2.

But I will try your new patch.

In D37972#862875, @kib wrote:

Isn't it simply enough to handle rendezvous request for all vcpus we freeze?

No, because

  • vcpu 1 can be waiting for rendezvous completion;
  • vcpu 2 can be in vcpu_lock_all trying to lock vcpu 1 (and won't therefore perform the rendezvous for vcpu 2)

This is perhaps the point where I am getting confused. How could vcpu2 be in vcpu_lock_all()? Rather, vcpu_lock_all() is executed by some thread which is performing ioctl op on behalf of the userspace counterpart. In other words, the look at code suggests that vcpu_lock_all()/vcpu_lock_one() are never called from a context which represents running vcpu thread.

If you see something that makes you think it is possible, could you please get the backtraces of all threads in kernel for bhyve vm? procstat -kk should be enough for start.

And other vcpus should be fine because non-frozen vcpus would handle the request on vmexit.

  • vcpu 2 is not running, it is in vcpu_lock_all() waiting for vcpu 1. Whether vcpu2 is frozen/idle/running is not important until vcpu1 goes to idle, because we need to get vcpu1's lock first before we do vcpu2.

It should be important, because vcpu2 is not frozen. Then existing notification mechanism should get vcpu 2 to vmexit, where rendezvous request is to be completed.

But I will try your new patch.

Thanks.

This is perhaps the point where I am getting confused. How could vcpu2 be in vcpu_lock_all()? Rather, vcpu_lock_all() is executed by some thread which is performing ioctl op on behalf of the userspace counterpart. In other words, the look at code suggests that vcpu_lock_all()/vcpu_lock_one() are never called from a context which represents running vcpu thread.

Yes my language is sloppy, sorry for confusion. bhyve has a thread which it uses to run vcpu2. But sometimes it also uses that thread to perform ioctl on the whole machine. And maybe that is the problem.

What happens is:

  • Windows guest writes to a passed-through PCI register in some vCPU,
  • writes to passed-through cfg registers are not handled in the kernel but fall out into userspace in usr.sbin/bhyve/bhyverun.c.
    • so the thread running that vCPU is no longer executing in kernel space, and the vCPU is idle.
  • bhyverun.c dispatches to vmexit_inst_emul() ... ultimately this lands us in pci_passthru.c in passthru_addr()
  • passthru_addr() in pci_passthru.c calls passthru_mmio_addr() or passthru_msix_addr()
  • these functions call vm_unmap_pptdev_mmio() or vm_map_pptdev_mmio(), either of which re-enters kernel space
  • Both of these kernel calls begin with vpcu_lock_all()

If you see something that makes you think it is possible, could you please get the backtraces of all threads in kernel for bhyve vm? procstat -kk should be enough for start.

This is the backtrace from when I first discovered this bug last week on a four vCPU machine: {F54468511}

This is perhaps the point where I am getting confused. How could vcpu2 be in vcpu_lock_all()? Rather, vcpu_lock_all() is executed by some thread which is performing ioctl op on behalf of the userspace counterpart. In other words, the look at code suggests that vcpu_lock_all()/vcpu_lock_one() are never called from a context which represents running vcpu thread.

Yes my language is sloppy, sorry for confusion. bhyve has a thread which it uses to run vcpu2. But sometimes it also uses that thread to perform ioctl on the whole machine. And maybe that is the problem.

What happens is:

  • Windows guest writes to a passed-through PCI register in some vCPU,
  • writes to passed-through cfg registers are not handled in the kernel but fall out into userspace in usr.sbin/bhyve/bhyverun.c.
    • so the thread running that vCPU is no longer executing in kernel space, and the vCPU is idle.

Right, and at this moment the thread is no longer the vcpu thread, as you noted, the vpcu state becoming IDLE.
My change should just run the rendezvous handler on this vcpu.

  • bhyverun.c dispatches to vmexit_inst_emul() ... ultimately this lands us in pci_passthru.c in passthru_addr()
  • passthru_addr() in pci_passthru.c calls passthru_mmio_addr() or passthru_msix_addr()
  • these functions call vm_unmap_pptdev_mmio() or vm_map_pptdev_mmio(), either of which re-enters kernel space
  • Both of these kernel calls begin with vpcu_lock_all()

If you see something that makes you think it is possible, could you please get the backtraces of all threads in kernel for bhyve vm? procstat -kk should be enough for start.

This is the backtrace from when I first discovered this bug last week on a four vCPU machine: {F54468511}

I am not sure what does this mean.

In D37972#863506, @kib wrote:

Right, and at this moment the thread is no longer the vcpu thread, as you noted, the vpcu state becoming IDLE.
My change should just run the rendezvous handler on this vcpu.

The bug is, there's no guarantee that the loop in lock_all will iterate up to this vcpu. We can get stuck before then.

I applied your patch to -CURRENT, but unfortunately I get a panic in vm_iommu_modify(): "vm mem_segs not locked @ /usr/src/sys/amd64/vmm/vmm.c:1189"

That doesn't seem related to this change.

I applied your patch to -CURRENT, but unfortunately I get a panic in vm_iommu_modify(): "vm mem_segs not locked @ /usr/src/sys/amd64/vmm/vmm.c:1189"

I've got a couple patches in my tree to address this, I'll put them up in a few.

I applied your patch to -CURRENT, but unfortunately I get a panic in vm_iommu_modify(): "vm mem_segs not locked "

That doesn't seem related to this change.

Can you try the following to fix vm mem_segs not locked @ /usr/src/sys/amd64/vmm/vmm.c:1189

https://reviews.freebsd.org/D37962
https://reviews.freebsd.org/D38071

In D37972#864943, @rew wrote:

I applied your patch to -CURRENT, but unfortunately I get a panic in vm_iommu_modify(): "vm mem_segs not locked "

That doesn't seem related to this change.

Can you try the following to fix vm mem_segs not locked @ /usr/src/sys/amd64/vmm/vmm.c:1189

https://reviews.freebsd.org/D37962
https://reviews.freebsd.org/D38071

I'll try them this week.

btw, I notice in a 12 core VM, sampling with procstat -k shows a Win11 guest in vm_handle_rendezvous > 95% of the time. I don't know what percentage is expected, but it seems like a lot of vmexits.

btw2: I saw that the SmartOS people found this bug a while ago. https://smartos.org/bugview/OS-7622

In D37972#864943, @rew wrote:

I applied your patch to -CURRENT, but unfortunately I get a panic in vm_iommu_modify(): "vm mem_segs not locked "

That doesn't seem related to this change.

Can you try the following to fix vm mem_segs not locked @ /usr/src/sys/amd64/vmm/vmm.c:1189

https://reviews.freebsd.org/D37962
https://reviews.freebsd.org/D38071

I saw these got merged into main and I confirm that the panic goes away.

However, with the latest patch here applied, I still get the hang in vm_handle_rendezvous(). See procstat, as requested, on a twelve core VM: {F55142419}

Hey guys,

Do you have any news on this?
As far as I know, this issue prevents Windows 11 guests from using PCIe passthru.

https://forums.freebsd.org/threads/bhyve-vm-stuck-when-passthru-enabled.92854