Make pmap_invalidate_ept() wait synchronously for guest exits
ClosedPublic
Actions

Authored by markj on Oct 22 2020, 7:40 PM.

Details

Reviewers

alc
kib
jeff
grehan
jhb

Group Reviewers

bhyve

Commits

rS367593: vmm: Make pmap_invalidate_ept() wait synchronously for guest exits

Summary

Currently EPT TLB invalidation is done by incrementing a generation
counter and issuing an IPI to all CPUs currently running vCPU threads.
The VMM inner loop caches the most recently observed generation on each
host CPU and invalidates TLB entries before executing the VM if the
cached generation number is not the most recent value.
pmap_invalidate_ept() issues IPIs to force each vCPU to stop executing
guest instructions and reload the generation number. However, it does
not actually wait for vCPUs to exit, potentially creating a window where
guests are referencing stale TLB entries. This change attempts to fix
that using SMR.

The idea is to bracket guest execution with an SMR read section which is
entered before loading the generation counter. Then,
pmap_invalidate_ept() increments the current write sequence before
loading pm_active and sending IPIs, and polls readers to ensure that all
vCPUs potentially operating with stale TLB entries have exited before
pmap_invalidate_ept() returns.

I left the erratum383 workaround as-is; it also increments pm_eptgen but
uses an SMP rendezvous to ensure that guest CPUs exit before returning.
The workaround is relevant only for fairly old hardware and I have no
means to test changes to it.

I also annotated some loads of pm_eptgen using atomic_load_long().
Otherwise I'm not sure that the code is safe in the face of compiler
reordering.

Another, simpler, solution would be to replace the ipi_selected() call
in pmap_invalidate_ept() with an SMP rendezvous. However,
smp_rendezvous_cpus() doesn't scale particularly well today: it is
serialized using a spin mutex (and so interrupts are disabled while
waiting for acknowledgements), and all CPUs must increment a global
variable in the IPI handler. This could be fixed. However, the use of
SMR also allows the initiator to return earlier since it only needs to
wait until vCPUs have exited. Finally having a general mechanism to
wait for vCPUs to exit (with or without an IPI) may be useful in other
contexts.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 34327
Build 31455: arc lint + arc unit

Event Timeline

markj created this revision.Oct 22 2020, 7:40 PM

Herald added a reviewer: bhyve. · View Herald TranscriptOct 22 2020, 7:40 PM

Herald added subscribers: bcran, rgrimes. · View Herald Transcript

markj requested review of this revision.Oct 22 2020, 7:40 PM

Harbormaster completed remote builds in B34327: Diff 78616.Oct 22 2020, 7:40 PM

markj added reviewers: alc, kib, jeff, grehan.Oct 22 2020, 7:42 PM

markj added a reviewer: jhb.

markj added inline comments.Oct 22 2020, 8:02 PM

sys/amd64/amd64/pmap.c
2754	I'm not sure why _acq is used here. The increment is atomic so other instructions cannot be reordered with it. As a side note, I think some of our atomic__acq_ atomics are incorrect on at least arm64. For example, atomic_add_acq_long() without LSE results in an exclusive load-acquire and exclusive store pair, and nothing disallows reordering of subsequent loads with the store, contrary to the expectations of some consumers. The atomic-ordered-before property described in the ARM ARM only appears to prevent reordering of a subsequent load-acquire with a preceding exclusive store.
sys/amd64/vmm/amd/svm.c
2087	Why is this using _ACQ? I can't see a reason. VMX uses plain CPU_SET_ATOMIC, though up until recently it was implemented in assembly.

kib accepted this revision.Oct 24 2020, 9:32 AM

This revision is now accepted and ready to land.Oct 24 2020, 9:32 AM

Would it be possible to cache the smr_seq_t instead of pmap->pm_eptgen ? One less atomic to increment.

sys/amd64/amd64/pmap.c
2754	I think it was just paranoia.
sys/amd64/vmm/amd/svm.c
2087	Same paranoia.

In D26910#601253, @grehan wrote:

Would it be possible to cache the smr_seq_t instead of pmap->pm_eptgen ? One less atomic to increment.

I thought about it a bit but wanted to keep the initial change simple since there are other places where we increment the eptgen. One problem with using the sequence number alone is that it's 32 bits wide. smr_advance() will block if it gets close to wraparound, but that only checks active readers.

alc added inline comments.Oct 26 2020, 6:48 PM

sys/amd64/amd64/pmap.c
2754	In regards to the side note, can you point to an example or two through email? I suspect that the real issue is that these examples are using the wrong fencing. I've seen people express a desire for a store acquire-like operation, but that is an utterly bogus notion under the release consistency model.