Page MenuHomeFreeBSD

amd64: use INVLPGB for kernel pmap invalidations
ClosedPublic

Authored by kib on May 13 2024, 11:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 4, 10:22 AM
Unknown Object (File)
Sat, Jan 4, 12:31 AM
Unknown Object (File)
Dec 7 2024, 9:49 PM
Unknown Object (File)
Dec 7 2024, 9:46 PM
Unknown Object (File)
Dec 7 2024, 9:45 PM
Unknown Object (File)
Dec 7 2024, 9:45 PM
Unknown Object (File)
Dec 7 2024, 9:25 PM
Unknown Object (File)
Dec 5 2024, 9:19 PM
Subscribers

Details

Summary

Newest AMD CPUs have the new broadcast invpg instruction INVLPGB. It performs invalidation locally and broadcasts the request for same invalidation to all logical CPUs. I tried to apply it to our pmap, but it is not convenient due to our use of the sliding PCID algorithm.

But I do think that one use of the instruction could be very profitable, esp. on large machines. We can do kernel pmap invalidations without IPIs. These invalidations always use PCID 0, and are always totally broadcast. This seems to be an ideal application.

I suspect that intense kmalloc/malloc(9) loads like ZFS might get significant speedup due to elimination of IPIs.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kib requested review of this revision.May 13 2024, 11:34 PM

Not tested at all, machine is not yet available. I want to discuss this before debugging.

sys/amd64/amd64/mp_machdep.c
735 ↗(On Diff #138510)

I don't understand the subtraction here.

740 ↗(On Diff #138510)

Should the second parameter logically be PMAP_PCID_KERN?

sys/amd64/include/cpufunc.h
540

The parameter names should reflect how they are actually used.

kib marked 2 inline comments as done.May 14 2024, 11:27 PM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
740 ↗(On Diff #138510)

We request the global invalidation, and do not request PCID-specific. So it should not be PMAP_PCID_KERN even formally, I believe.

sys/amd64/include/cpufunc.h
540

The doc calls both rax and edx descriptors, unfortunately. And ecx is not a counter as such.

If you have suggestions I will apply it, for me the direct register names are easiest to match with the documentation.

kib marked an inline comment as done.

Fix cnt calculation for 4k step in invlrng.
Flip default for invlpgb_works to 1.

Correct cnt, one more time.

On amd64, translation invalidation routines don't specify whether intermediate TLB entries need to be invalidated, but on arm64 they do. Now that we have INVLPGB_FIN, I wonder if (in a separate patch certainly) it would be worth augmenting amd64's pmap_invalidate_page() etc. to specify whether intermediate entries need to be invalidated.

sys/amd64/amd64/mp_machdep.c
736 ↗(On Diff #138560)

How can cnt == 0 be true? va is not required to be page-aligned?

739 ↗(On Diff #138560)

This is the maximum count of additional pages; since we subtract 1 below, from my reading of AMD's spec we can technically use cnt = invlpgb_maxcnt + 1 here.

sys/amd64/amd64/pmap.c
555 ↗(On Diff #138560)
sys/amd64/include/cpufunc.h
540

I see, maybe va, desc, and count. But yes, the reader will have to look at invlpgb documentation to fully understand what is going on.

kib marked 3 inline comments as done.May 17 2024, 4:27 PM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
736 ↗(On Diff #138560)

That was left from the previous invalid formula.

739 ↗(On Diff #138560)

I initialized maxcnt by the cpuid val + 1.

sys/amd64/amd64/pmap.c
555 ↗(On Diff #138560)

Note the above two sysctl descr line styles.

kib marked 3 inline comments as done.

+1 for maxcnt
remove cnt == 0 check

sys/amd64/amd64/mp_machdep.c
745 ↗(On Diff #138660)

Although it makes no functional difference, I would contend that NPTEPG is the correct constant here, as it represents the number of PTEs in a page table page referenced by a 2MB PDE.

751 ↗(On Diff #138660)

I think multiplication by the same constant that we divided by above makes this easier to understand. I suspect that the compiler will implement this as a shift anyway.

kib marked 2 inline comments as done.

Use NPTEPG

markj added inline comments.
sys/amd64/amd64/mp_machdep.c
739 ↗(On Diff #138560)

IMO this is a bit confusing. It would be cleaner to keep the global invlpgb_maxcnt equal to the "official" value, and handle the bias locally.

This revision is now accepted and ready to land.May 20 2024, 2:44 PM
kib marked an inline comment as done.

Use raw maxcnt value

This revision now requires review to proceed.May 20 2024, 4:54 PM
This revision is now accepted and ready to land.May 20 2024, 4:55 PM

Successfully booted an updated version from kib's branch

Jun 25 14:57:48 dut kernel: AMD Extended Feature Extensions ID EBX=0x79bef25f<CLZERO,IRPerf,XSaveErPtr,INVLPGB,RDPRU,BE,WBNOINVD,IBPB,INT_WBINVD,IBRS,STIBP,STIBP_ALWAYSON,PREFER_IBRS,SAMEMODE_IBRS,NOLMSLE,INVLPGBNEST,PPIN,SSBD,CPPC,PSFD,BTC_NO,IBPB_RET>

root@dut:~ # sysctl vm.pmap.invlpgb_works
vm.pmap.invlpgb_works: 1

Fixes from Ed' testing on real hw:

  1. Postpone calculating invlpgb_works until AMD ext features word is initialized
  2. Always truncate VA passed to INVLPGB, the instruction does not accept unaligned addresses
  3. Do not leak pin count
  4. Add comment explaining why TLBSYNC is done on same CPU
This revision now requires review to proceed.Jun 25 2024, 5:08 PM
sys/amd64/amd64/mp_machdep.c
688 ↗(On Diff #140230)

Would it make sense to assert that this is true? I see there's such an assertion in smp_targeted_tlb_shootdown_native, but maybe one here would be good for documentary purposes?

sys/amd64/amd64/mp_machdep.c
688 ↗(On Diff #140230)

Yes, indeed. The best place to do that is in fact sched_unpin(), and it is somewhat surprising that there is no such assert already. [The code calls sched_unpin() right after the tlbsync()]

D45736

sys/amd64/amd64/mp_machdep.c
727 ↗(On Diff #140230)

Suppose addr1 = 0x800, addr2 = 0x1800. Then I'd expect this routine to invalidate translations for 0x0000 and 0x1000, but we have total = 1.

kib marked 2 inline comments as done.Jun 30 2024, 6:15 PM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
727 ↗(On Diff #140230)

I suspect that ranged invalidation always get the aligned addresses, but fixed. Thanks.

kib marked an inline comment as done.

Fix loop count calculation for ranged invalidation when start/end are not page aligned.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 21 2024, 4:36 PM
This revision was automatically updated to reflect the committed changes.