Page MenuHomeFreeBSD

arm64: Include NUMA locality info in the CPU topology
ClosedPublic

Authored by markj on Feb 10 2021, 9:11 PM.
Tags
None
Referenced Files
F108440895: D28579.id83833.diff
Fri, Jan 24, 7:16 PM
Unknown Object (File)
Dec 4 2024, 8:45 PM
Unknown Object (File)
Oct 26 2024, 8:55 AM
Unknown Object (File)
Oct 26 2024, 2:43 AM
Unknown Object (File)
Oct 20 2024, 1:43 AM
Unknown Object (File)
Oct 19 2024, 8:03 AM
Unknown Object (File)
Oct 19 2024, 8:03 AM
Unknown Object (File)
Oct 19 2024, 7:42 AM

Details

Summary

ULE uses the CPU topology to try and make locality-preserving scheduling
decisions. On NUMA systems we should provide info about the NUMA
topology so that the kernel avoids migrating threads between domains as
much as possible. Add some checking to make sure that each domain's CPU
set is contiguous before trying to use the generic SMP topology-building
routines.

On arm64 we also have the affinity values from mpidr registers to
provide more fine-grained locality information, but the values seem to
be largely arbitrary on systems that I've looked at. They are used by
the gicv3 driver and seem to have some arbitrary contraints. So don't
use them for now.

Tested on an Ampere Altra with a single package partitioned into four
NUMA domains.

Test Plan

On the above-mentioned system I have:

kern.sched.topology_spec: <groups>                                                                                                                                                                                                                                                                                            
 <group level="1" cache-level="0">                                                                                                                                                                                                                                                                                            
  <cpu count="80" mask="ffffffffffffffff,ffff,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 6
9, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79</cpu>                                                                                                                                                                                                                                                                               
  <children>                                                                                                                                                                                                                                                                                                                  
   <group level="2" cache-level="3">                                                                                                                                                                                                                                                                                          
    <cpu count="20" mask="fffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19</cpu>                                                                                                                                                                                                             
   </group>                                                                                                                                                                                                                                                                                                                   
   <group level="2" cache-level="3">                                                                                                                                                                                                                                                                                          
    <cpu count="20" mask="fffff00000,0,0,0">20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39</cpu>                                                                                                                                                                                              
   </group>                                                                                                                                                                                                                                                                                                                   
   <group level="2" cache-level="3">                                                                                                                                                                                                                                                                                          
    <cpu count="20" mask="fffff0000000000,0,0,0">40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59</cpu>
   </group>
   <group level="2" cache-level="3">
    <cpu count="20" mask="f000000000000000,ffff,0,0">60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79</cpu>
   </group>
  </children>
 </group>
</groups>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Feb 10 2021, 9:11 PM
markj added a reviewer: arm64.

Should we include big.LITTLE cluster info in the topo info returned from cpu_topo (in a future patch)? The each cluster will most likely share an L2 cache, although I'm not sure if this is required.

Should we include big.LITTLE cluster info in the topo info returned from cpu_topo (in a future patch)? The each cluster will most likely share an L2 cache, although I'm not sure if this is required.

Certainly, this commit's just trying to do the bare minimum for NUMA ARM servers. If other info about the topology is available we should incorporate it.

I agree with Andrew. We should use mpidr to build a cores topology. Nowadays it's easy, mpidr is stored in pcpu for all enumerated cores.
But there is another problem - cpuid should be taken as arbitrarily chosen value without any connection to cores topology - nobody can guarantee that cores are numbered sequentially within NUMA domain. Also assumption that NUMA domains are always symmetric (have same number of cores) looks too optimistic. Ampere, as well as LX2160, uses multicluster of dual cores with per cluster l2 cache - I think that L2 cache locality should also be included in initial implementation.
By that I mean that I offer help with implementation and testing on FDT based systems (unfortunately, ACPI is out of my scope and also setup).

In D28579#641105, @mmel wrote:

I agree with Andrew. We should use mpidr to build a cores topology. Nowadays it's easy, mpidr is stored in pcpu for all enumerated cores.

I am skeptical that MPIDR can generically be used to create a CPU topology. The definitions of the affinity levels is too abstract and different vendors are not consistent in how they are set. In some cases it does not provide a full picture. For instance, on an Altra with the processor, caches and memory controller partitioned into four "fake" NUMA domains (CPUs 0-19, 20-39, 40-59, 60-79), we have: https://reviews.freebsd.org/P482

In particular, I cannot see how one would determine from this that there should be 4 top-level CPU groups.

But there is another problem - cpuid should be taken as arbitrarily chosen value without any connection to cores topology - nobody can guarantee that cores are numbered sequentially within NUMA domain. Also assumption that NUMA domains are always symmetric (have same number of cores) looks too optimistic. Ampere, as well as LX2160, uses multicluster of dual cores with per cluster l2 cache - I think that L2 cache locality should also be included in initial implementation.

I rewrote the patch to drop these assumptions, I will upload shortly. I wanted to use the generic smp_topo_* routines, which force these assumptions, but it is not a lot more work to just build the topology here.

How can we determine L2 cache locality? Is this info sometimes provided in the FDT?

By that I mean that I offer help with implementation and testing on FDT based systems (unfortunately, ACPI is out of my scope and also setup).

Certainly that would be welcome. I don't intend for this to be a complete solution, just a slight improvement for NUMA server systems.

  • Permit non-contiguous domain cpuset masks.
  • Permit asymmetric configurations where some domains have more CPUs than others.

I've spent some time digging up the ARM documentation, but unfortunately we don't seem to be able to determine the exact cache topology. But I think we can estimate it with a reasonable degree of accuracy. For rest, I assume that bit [24] bits is set (otherwise the affinity fields are shifted).

We can assume that the L1 cache is located on the core (identified by the aff1 field), the L2 cache is located on the cluster (identified by the aff2 field). The L3 cache itself cannot be determined from the MPIDR, and it is questionable whether we can apply it to anything -> if knowledge of the exact location and distribution of L3 can be used at all. On most SoCs, L3 is not present or is shared by the entire SoC. In these cases, the L3 topology has no effect on OS optimization.

On big irons, such as ALTRA, CNM is usually used, so the cache 13 is segmented, distributed. Knowledge of the exact topology is again irrelevant. Misusing L3 to pass NUMA partitions to a scheduler makes sense, but I'm very curious how much it helps, or whether it’s only papering over missing of the L2 topology.

From your dump I can assume that ALTRA have 40 clusters of dual cores, where l2 cache is shared by cores in cluster. I’m right?

In D28579#641968, @mmel wrote:

I've spent some time digging up the ARM documentation, but unfortunately we don't seem to be able to determine the exact cache topology. But I think we can estimate it with a reasonable degree of accuracy. For rest, I assume that bit [24] bits is set (otherwise the affinity fields are shifted).

We can assume that the L1 cache is located on the core (identified by the aff1 field), the L2 cache is located on the cluster (identified by the aff2 field). The L3 cache itself cannot be determined from the MPIDR, and it is questionable whether we can apply it to anything -> if knowledge of the exact location and distribution of L3 can be used at all. On most SoCs, L3 is not present or is shared by the entire SoC. In these cases, the L3 topology has no effect on OS optimization.

Sure, the scheduler will collapse redundant layers in the topology in any case.

On big irons, such as ALTRA, CNM is usually used, so the cache 13 is segmented, distributed. Knowledge of the exact topology is again irrelevant. Misusing L3 to pass NUMA partitions to a scheduler makes sense, but I'm very curious how much it helps, or whether it’s only papering over missing of the L2 topology.

From your dump I can assume that ALTRA have 40 clusters of dual cores, where l2 cache is shared by cores in cluster. I’m right?

The L2 cache is not shared, unlike the eMAG. Compare with the MPIDR affinity values there: https://dmesgd.nycbug.org/index.cgi?do=view&id=5074

Some experimentation showed that the amount of L3 cache space available to a given core depends on whether the system is partitioned. In "quadrant" mode (4 domains), there are write bandwidth cliffs between 1 and 2MB and 9 and 10MB. In "monolithic mode" (1 domain) the second cliff moves to about 34MB. So I believe that this diff gives the correct behaviour even if it does not illustrate the "true" cache topology: when migrating a thread the scheduler will first search for idle cores in the same domain.

Some experimentation showed that the amount of L3 cache space available to a given core depends on whether the system is partitioned. In "quadrant" mode (4 domains), there are write bandwidth cliffs between 1 and 2MB and 9 and 10MB. In "monolithic mode" (1 domain) the second cliff moves to about 34MB. So I believe that this diff gives the correct behaviour even if it does not illustrate the "true" cache topology: when migrating a thread the scheduler will first search for idle cores in the same domain.

To be clear, the Altra has a 1MB per-core L2 data cache and 32MB L3 (SLC). I'm pointing out that the L3 cache capacity is partitioned according to the NUMA configuration of the system.

Ahh, right, I forgot that Neoverse has (from this point of view) cache levels shifted – see slide 5 of https://www.slideshare.net/linaroorg/getting-the-most-out-of-dynamiq-enabling-support-of-dynamiq-sfo17104
So for purpose of OS optimization we can take real L1 + L2 caches as L1 in pre-neoverse meaning, real l3 in DynamIQ Shared Unit (DSU) block as L2 in pre-neoverse, and CMN as L3.
I think we can still handle all the cases using a two-level hierarchy, where NUMA domains as CG_SHARE_L3 groups and clusters as CG_SHARE_L2 groups will be exported. It should work on a server system, on a big.LITTLE (RK3399) and also on a medium SoC (LX2160A, which have 8 dual-core clusters). Do you think so?
I'm not sure if you want to implement this "extension", and I don't want to block you. The code in this review looks fine to me and doesn't block anything, so push it as needed.

This revision is now accepted and ready to land.Feb 16 2021, 1:47 PM
In D28579#642467, @mmel wrote:

Ahh, right, I forgot that Neoverse has (from this point of view) cache levels shifted – see slide 5 of https://www.slideshare.net/linaroorg/getting-the-most-out-of-dynamiq-enabling-support-of-dynamiq-sfo17104
So for purpose of OS optimization we can take real L1 + L2 caches as L1 in pre-neoverse meaning, real l3 in DynamIQ Shared Unit (DSU) block as L2 in pre-neoverse, and CMN as L3.
I think we can still handle all the cases using a two-level hierarchy, where NUMA domains as CG_SHARE_L3 groups and clusters as CG_SHARE_L2 groups will be exported. It should work on a server system, on a big.LITTLE (RK3399) and also on a medium SoC (LX2160A, which have 8 dual-core clusters). Do you think so?

Yes, this sounds right. We may introduce a third CG_SHARE_L1 level for threaded cores, I believe there is a bit in MPIDR to indicate this. Not sure if it's common at all in the ARM64 world.

I'm not sure if you want to implement this "extension", and I don't want to block you. The code in this review looks fine to me and doesn't block anything, so push it as needed.

Thanks. At the moment I don't have any hardware to test such an extension, so I doubt I will work on it anytime soon.