Page MenuHomeFreeBSD

vmm: save/restore pir_desc is required for posted-interrupts and VID
ClosedPublic

Authored by gusev.vitaliy_gmail.com on Jun 10 2022, 6:26 PM.
Tags
Referenced Files
F102463843: D35447.diff
Tue, Nov 12, 3:45 PM
Unknown Object (File)
Thu, Oct 24, 3:00 AM
Unknown Object (File)
Oct 11 2024, 4:19 PM
Unknown Object (File)
Oct 4 2024, 11:03 PM
Unknown Object (File)
Oct 1 2024, 2:39 PM
Unknown Object (File)
Sep 27 2024, 1:41 AM
Unknown Object (File)
Sep 16 2024, 8:57 PM
Unknown Object (File)
Sep 16 2024, 2:25 PM

Details

Summary

Losing pir_desc[] after restore causes VM stuck.

VID - Virtual Interrupt Delivery.

Sponsored by vStack.

Test Plan

Run VM on CPU that supports PostIntr and VID.

Do suspend and then resume on heavy loaded VM with Networking or Disk, i.e. with a lot of interrupts inside VM.

Verify that VM working fine, i.e. without hanging.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

In illumos bhyve, we chose to "flatten" queued interrupts in the PIR descriptor into the normal vlapic state, so there was only one source of truth for the IRR data:

https://github.com/illumos/illumos-gate/blob/e760f15095bdc9fa107e7c20ed2a5e4fb5865c1d/usr/src/uts/intel/io/vmm/intel/vmx.c#L3532-L3597

The sync operation is called when we need the IRR data to be up to date, like when checking for pending interrupts (from the HLT handler, or during VM entry) and when the vlapic data is about to be exported for save/restore. That way, neither source nor destination need to be explicitly aware of APICv.
Perhaps it's an approach worth considering in FreeBSD.

In illumos bhyve, we chose to "flatten" queued interrupts in the PIR descriptor into the normal vlapic state, so there was only one source of truth for the IRR data:

https://github.com/illumos/illumos-gate/blob/e760f15095bdc9fa107e7c20ed2a5e4fb5865c1d/usr/src/uts/intel/io/vmm/intel/vmx.c#L3532-L3597

The sync operation is called when we need the IRR data to be up to date, like when checking for pending interrupts (from the HLT handler, or during VM entry) and when the vlapic data is about to be exported for save/restore. That way, neither source nor destination need to be explicitly aware of APICv.
Perhaps it's an approach worth considering in FreeBSD.

Thanks, for pointing this commit in illumos. I think it should be synced to FreeBSD. But this commit should also suffer from not saved pir_desc after vmx_apicv_set_ready() is performed. Am I right?

But this commit should also suffer from not saved pir_desc after vmx_apicv_set_ready() is performed. Am I right?

The point of approach is to merge the contents of pir_desc into the vlapic (really the APIC page) itself. The asserted bits in pir_desc are simply requests to queue a pending interrupt in the virtual APIC. That is effectively the same as an asserted bit in the IRR fields of the APIC page. By flattening those bits into the APIC page, it eliminates the need to save PIR data independently.

I think it should be synced to FreeBSD.

We've made several changes and improvements to our APIC emulation. If someone is interested in porting patches over, they should probably take a holistic look at what all is different in order to do so effectively.

But this commit should also suffer from not saved pir_desc after vmx_apicv_set_ready() is performed. Am I right?

The point of approach is to merge the contents of pir_desc into the vlapic (really the APIC page) itself. The asserted bits in pir_desc are simply requests to queue a pending interrupt in the virtual APIC. That is effectively the same as an asserted bit in the IRR fields of the APIC page. By flattening those bits into the APIC page, it eliminates the need to save PIR data independently.
s

Sorry, If vlapic_set_intr_ready() is called, it will change pir_desc->pir in some condition:

vmx_apicv_set_ready():
       ...
        /*                                                                                                   
         * If the interrupt request does not require manipulation of the TMRs                                
         * for delivery, set it in PIR descriptor.  It cannot be inserted into                               
         * the APIC page while the vCPU might be running.                                                    
         */
        atomic_set_int(&pir_desc->pir[idx], mask);

I assume that it can lead that pir_desc->pir is only container that keeps a interrupt request. Therefore if suspend VM and then restore, that interrupt will be lost.

I think it should be synced to FreeBSD.

We've made several changes and improvements to our APIC emulation. If someone is interested in porting patches over, they should probably take a holistic look at what all is different in order to do so effectively.

But you don't have suspend/resume code yet?

Sorry, If vlapic_set_intr_ready() is called, it will change pir_desc->pir in some condition:
I assume that it can lead that pir_desc->pir is only container that keeps a interrupt request. Therefore if suspend VM and then restore, that interrupt will be lost.

Operations like save/restore require that all vCPUs and associated emulation be held from running. In that moment pir_desc should be unchanging, and thus safe to flatten in the APIC page without losing any interrupts.

But you don't have suspend/resume code yet?

Correct. We chose not to take the suspend/resume code as it stood. Our code for import/export of device and vcpu state is under review and testing.

Sorry, If vlapic_set_intr_ready() is called, it will change pir_desc->pir in some condition:
I assume that it can lead that pir_desc->pir is only container that keeps a interrupt request. Therefore if suspend VM and then restore, that interrupt will be lost.

Operations like save/restore require that all vCPUs and associated emulation be held from running. In that moment pir_desc should be unchanging, and thus safe to flatten in the APIC page without losing any interrupts.

I guess you meant that vmx_apicv_sync() should be called instead saving pir_desc directly ? I understand what you mean, but it will also require saving of vlapic_vtx. Current implementation in FreeBSD also requies saving vlapic_vtx, but just one field pending_prio. I am going to create review for that, once current review is finished.

But you don't have suspend/resume code yet?

Correct. We chose not to take the suspend/resume code as it stood.

I am going to ginger up the suspend/resume code and I need help in reviewing and committing several patches. What if commit that patch ? It works fine for more than year.

I guess you meant that vmx_apicv_sync() should be called instead saving pir_desc directly ? I understand what you mean, but it will also require saving of vlapic_vtx. Current implementation in FreeBSD also requies saving vlapic_vtx, but just one field pending_prio. I am going to create review for that, once current review is finished.

If proper care is taken (and if you look at the patches related to vmx_apicv_sync, you'll see), then none of the state in vlapic_vtx is required to be saved/restored. Any pending interrupt requests in pir_desc are flattened into the APIC page. The pending_prio field is irrelevant, because it's the APIC page data itself which is used to determine if an interrupt is pending (we also sync the pir_desc data into the APIC page as part of vlapic_pending_intr).

I am going to ginger up the suspend/resume code and I need help in reviewing and committing several patches. What if commit that patch ? It works fine for more than year.

We decided to take a different approach (as detailed in those CRs I linked) for saving/restoring instance state. As such, we will not be taking the FreeBSD code for that functionality.

Ok. Long term Patrick is more correct and we should not be storing this state but instead only trying to store "architectural" state like the APIC page, not the PIR, but I understand why for now we might need to do this.

In D35447#809008, @jhb wrote:

Ok. Long term Patrick is more correct and we should not be storing this state but instead only trying to store "architectural" state like the APIC page, not the PIR, but I understand why for now we might need to do this.

@jhb I can pick up Patrick's changes in the background, whereas 'pir_desc save/restore fix' could be applied for now.

@corvink Could you look at this ? While Patrick's changes requires (in comment) a lot of modifying generic code, this simple fix could solve issue related for suspend/resume.

Patch looks good but it doesn't apply any more. Could you send an updated version please?

Patch looks good but it doesn't apply any more. Could you send an updated version please?

Done.

This revision is now accepted and ready to land.May 8 2023, 12:14 PM

Perhaps adjust the commit message to say something like "Failing to preserve pir_desc can result in pending interrupts being lost on resume leading to a hung VM" or some such to clarify that the hang is due to missing interrupts.

This revision was automatically updated to reflect the committed changes.