Currently EPT TLB invalidation is done by incrementing a generation
counter and issuing an IPI to all CPUs currently running vCPU threads.
The VMM inner loop caches the most recently observed generation on each
host CPU and invalidates TLB entries before executing the VM if the
cached generation number is not the most recent value.
pmap_invalidate_ept() issues IPIs to force each vCPU to stop executing
guest instructions and reload the generation number. However, it does
not actually wait for vCPUs to exit, potentially creating a window where
guests are referencing stale TLB entries. This change attempts to fix
that using SMR.
The idea is to bracket guest execution with an SMR read section which is
entered before loading the generation counter. Then,
pmap_invalidate_ept() increments the current write sequence before
loading pm_active and sending IPIs, and polls readers to ensure that all
vCPUs potentially operating with stale TLB entries have exited before
pmap_invalidate_ept() returns.
I left the erratum383 workaround as-is; it also increments pm_eptgen but
uses an SMP rendezvous to ensure that guest CPUs exit before returning.
The workaround is relevant only for fairly old hardware and I have no
means to test changes to it.
I also annotated some loads of pm_eptgen using atomic_load_long().
Otherwise I'm not sure that the code is safe in the face of compiler
reordering.
Another, simpler, solution would be to replace the ipi_selected() call
in pmap_invalidate_ept() with an SMP rendezvous. However,
smp_rendezvous_cpus() doesn't scale particularly well today: it is
serialized using a spin mutex (and so interrupts are disabled while
waiting for acknowledgements), and all CPUs must increment a global
variable in the IPI handler. This could be fixed. However, the use of
SMR also allows the initiator to return earlier since it only needs to
wait until vCPUs have exited. Finally having a general mechanism to
wait for vCPUs to exit (with or without an IPI) may be useful in other
contexts.