Page MenuHomeFreeBSD

amd64: Reduce the amount of cpuset copying done for TLB shootdowns
ClosedPublic

Authored by markj on Nov 1 2021, 11:01 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 5, 9:27 AM
Unknown Object (File)
Mon, Nov 4, 4:40 AM
Unknown Object (File)
Mon, Nov 4, 4:40 AM
Unknown Object (File)
Mon, Nov 4, 4:38 AM
Unknown Object (File)
Mon, Nov 4, 4:38 AM
Unknown Object (File)
Mon, Nov 4, 4:37 AM
Unknown Object (File)
Mon, Nov 4, 4:37 AM
Unknown Object (File)
Mon, Nov 4, 4:18 AM
Subscribers

Details

Summary

We use pmap_invalidate_cpu_mask() to get the set of active CPUs. This
(32 byte) set is copied by value through multiple frames until we get to
smp_targeted_tlb_shootdown(), where it is copied yet again.

Avoid this copying by having pmap_invalidate_*() make a local copy of
the active CPU set and passing it by reference. Also leverage the use
of the non-destructive CPU_FOREACH_ISSET to avoid unneeded copying
within smp_targeted_tlb_shootdown().

No functional change intended.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Nov 1 2021, 11:01 PM
sys/amd64/amd64/mp_machdep.c
636 ↗(On Diff #97839)

Can we remove mask as well? We need to filter out curcpu in the CPU_FOREACH_ISSET loops below, and we would need something similar to ipi_selected_but_self(), which is again nothing but CPU_FOREACH_ISSET loop.

I think that there is a demand for CPU_FOREACH_ISSET_BUT_SELF() loop

sys/amd64/amd64/mp_machdep.c
636 ↗(On Diff #97839)

You mean, remove other_cpus? It could be done yes, I considered it when I wrote the patch. I suspect it is slightly preferable to make a single copy of the cpuset rather than adding an additional branch to each of three loops though? I'm happy to implement it, but I can't really see why it's the right tradeoff.

sys/amd64/amd64/mp_machdep.c
636 ↗(On Diff #97839)

Yes, other_cpus.

Not using it frees one cache line, and this is arguably more beneficial than having three jumps. There are already enough conditions in BIT_FOREACH_ADVANCE() for this new one being a noise.

If you are so concerned with these additional jumps, two of them, in the CPU_FOREACH_ISSET() in smp_targeted_tlb_shootdown(), are not strictly needed: first loop would be fine as is. Second loop can avoid it if we preset curcpu slot with generation number right after curcpu_cb().

Avoid making a copy in smp_targeted_tlb_shootdown(), add ipi_selected_but_self()

sys/amd64/amd64/mp_machdep.c
636 ↗(On Diff #97839)

Ok. There is also the "am I the only one?" check at the beginning of this function which becomes more expensive, but I think this is nicer.

I left ipi_selected_but_self() MD for now. Other routines might benefit from it, at least smp_rendezvous_cpus(). But I would defer this to a separate patch.

Oops, I think my last suggestion would not work, sorry. Please see the comment in pmap_invalidate_preipi_pcid(). Mask must be read before that fence.

sys/amd64/amd64/pmap.c
3017–3018

Can we re-initialize kernel_pmap->pm_active with all_cpus. and avoid this condition?

3018
markj marked an inline comment as done.Nov 3 2021, 2:32 PM
In D32792#740403, @kib wrote:

Oops, I think my last suggestion would not work, sorry. Please see the comment in pmap_invalidate_preipi_pcid(). Mask must be read before that fence.

The comment says that the mask must be loaded after the fence. That is the case with and without the patch, and note that pmap_invalidate_* makes a local copy of the mask. Am I missing something there?

sys/amd64/amd64/pmap.c
3017–3018

Instead of using CPU_FILL()? Yes, I think it is doable. smp_targeted_tlb_shootdown() ignores the mask if !smp_started and I cannot see anything else that would be affected. I think I would do this in a separate patch, I already have some other follow-ups to this one.

In D32792#740403, @kib wrote:

Oops, I think my last suggestion would not work, sorry. Please see the comment in pmap_invalidate_preipi_pcid(). Mask must be read before that fence.

The comment says that the mask must be loaded after the fence. That is the case with and without the patch, and note that pmap_invalidate_* makes a local copy of the mask. Am I missing something there?

I mixed the direction, indeed.

sys/amd64/amd64/pmap.c
3017–3018

Yes, I think it removes the unneeded specialness of the kernel pmap.

This revision is now accepted and ready to land.Nov 4 2021, 12:18 AM
In D32792#740961, @kib wrote:
In D32792#740403, @kib wrote:

Oops, I think my last suggestion would not work, sorry. Please see the comment in pmap_invalidate_preipi_pcid(). Mask must be read before that fence.

The comment says that the mask must be loaded after the fence. That is the case with and without the patch, and note that pmap_invalidate_* makes a local copy of the mask. Am I missing something there?

I mixed the direction, indeed.

I wonder if pmap_invalidate_preipi() should return the cpuset directly, so that one can write:

cpuset_t mask;
pmap_invalidate_preipi(pmap, &mask);
smp_masked_invlpg(&mask, pmap, cb);

Then there is less code duplication and coupling between functions.

In D32792#740961, @kib wrote:
In D32792#740403, @kib wrote:

Oops, I think my last suggestion would not work, sorry. Please see the comment in pmap_invalidate_preipi_pcid(). Mask must be read before that fence.

The comment says that the mask must be loaded after the fence. That is the case with and without the patch, and note that pmap_invalidate_* makes a local copy of the mask. Am I missing something there?

I mixed the direction, indeed.

I wonder if pmap_invalidate_preipi() should return the cpuset directly, so that one can write:

cpuset_t mask;
pmap_invalidate_preipi(pmap, &mask);
smp_masked_invlpg(&mask, pmap, cb);

Then there is less code duplication and coupling between functions.

If kernel_pmap pm_active becomes true active cpuset instead of all 1's, then I think that the mask arg of smp_masked_* really not needed anymore (just the names become somewhat strange). We pass pmap anyway.

In D32792#741135, @kib wrote:
In D32792#740961, @kib wrote:
In D32792#740403, @kib wrote:

Oops, I think my last suggestion would not work, sorry. Please see the comment in pmap_invalidate_preipi_pcid(). Mask must be read before that fence.

The comment says that the mask must be loaded after the fence. That is the case with and without the patch, and note that pmap_invalidate_* makes a local copy of the mask. Am I missing something there?

I mixed the direction, indeed.

I wonder if pmap_invalidate_preipi() should return the cpuset directly, so that one can write:

cpuset_t mask;
pmap_invalidate_preipi(pmap, &mask);
smp_masked_invlpg(&mask, pmap, cb);

Then there is less code duplication and coupling between functions.

If kernel_pmap pm_active becomes true active cpuset instead of all 1's, then I think that the mask arg of smp_masked_* really not needed anymore (just the names become somewhat strange). We pass pmap anyway.

I tried to eliminate mask when I first implemented this, but smp_cache_flush() does not use pm_active. And I think the pmap layer should be responsible for loading pm_active at the right time.

I tried to eliminate mask when I first implemented this, but smp_cache_flush() does not use pm_active. And I think the pmap layer should be responsible for loading pm_active at the right time.

For smp_cache_flush(), we can pass kernel_pmap, which would be not a significant abuse. IMO removing one trivially deductible argument is good.

That said, smp_masked_* are really continuation of pmap implementation, for the proof you might look at the IPI handlers which are absolutely tied to the pmap variants.
Might be, we can eliminate smp_masked_*() functions at all, they do nothing than redirect to smp_targeted_tlb_shootdown() and increment counters, which can be inlined in pmap_invalidate_*() without loss of readability.

  • During SI_SUB_CPU+1, make kernel_pmap's pm_active bitset equal to all_cpus.
  • Drop conditional in pmap_invalidate_cpu_mask(), and export it.
  • Drop mask parameter to smp_targeted_tlb_shootdown().
  • Drop ipi_selected_but_self(), it isn't needed since smp_targeted_tlb_shootdown() has to make a local copy anyway.

I left smp_masked_* for now.

This revision now requires review to proceed.Nov 5 2021, 7:16 PM

I will look at this change over the weekend.

In D32792#741160, @kib wrote:

I tried to eliminate mask when I first implemented this, but smp_cache_flush() does not use pm_active. And I think the pmap layer should be responsible for loading pm_active at the right time.

For smp_cache_flush(), we can pass kernel_pmap, which would be not a significant abuse. IMO removing one trivially deductible argument is good.

That said, smp_masked_* are really continuation of pmap implementation, for the proof you might look at the IPI handlers which are absolutely tied to the pmap variants.

I implemented this suggestion. I'm still a bit skeptical: on i386 there are uses of smp_masked_*() outside of the pmap layer, and the preipi barrier is somewhat subtle. The IPI handlers could arguably live in pmap.c, given that they dig into pmap entrails.

Might be, we can eliminate smp_masked_*() functions at all, they do nothing than redirect to smp_targeted_tlb_shootdown() and increment counters, which can be inlined in pmap_invalidate_*() without loss of readability.

I left them alone for now. IMO it would be a nice readability improvement to convert the IPI counters to counter(9) and get rid of the ugly COUNT_IPI ifdefs as well. It is not quite trivial since the rendezvous counter is incremented in asm.

This revision is now accepted and ready to land.Nov 6 2021, 2:33 AM
sys/amd64/amd64/pmap.c
3017–3018

Could we instead define this as an inline in amd64/include/pmap.h?

markj marked an inline comment as done.

Move pmap_invalidate_cpu_mask() to pmap.h.

This revision now requires review to proceed.Nov 7 2021, 8:29 PM
This revision is now accepted and ready to land.Nov 8 2021, 5:36 PM