Page MenuHomeFreeBSD

netmap: Make memory pools NUMA-aware
ClosedPublic

Authored by markj on Sep 13 2024, 8:00 PM.
Tags
None
Referenced Files
F107605938: D46666.diff
Thu, Jan 16, 1:49 PM
Unknown Object (File)
Thu, Jan 9, 1:36 AM
Unknown Object (File)
Mon, Jan 6, 12:24 PM
Unknown Object (File)
Fri, Jan 3, 3:33 PM
Unknown Object (File)
Fri, Dec 27, 1:02 PM
Unknown Object (File)
Wed, Dec 25, 12:47 PM
Unknown Object (File)
Nov 30 2024, 11:10 PM
Unknown Object (File)
Nov 30 2024, 7:46 AM
Subscribers

Details

Summary

Each netmap adapter associated with a physical adapter is attached to a
netmap memory pool. contigmalloc() is used to allocate physically
contiguous memory for the pool, but ideally we would ensure that all
such memory is allocated from the NUMA domain local to the adapter.

Augment netmap's memory pools with a NUMA domain ID, similar to how
IOMMU groups are handled in the Linux port. That is, when attaching to
a physical adapter, ensure that the associated memory pools are local to
the adapter's associated memory domain, creating new pools as needed.

Some types of ifnets do not have any defined NUMA affinity; in this case
the domain ID in question is IF_NODOM.

Sponsored by: Klara, Inc.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 59806
Build 56691: arc lint + arc unit

Event Timeline

@vmaffione I wonder if you have any thoughts on this patch, or the general direction?

Sorry for the delay. Thanks for the patch.
I wonder if nm_grp can be reused.

sys/dev/netmap/netmap_mem2.c
178

The nm_grp is already used in Linux for a similar purpose.
Would it be possible to reuse this field rather than adding another one, where nm_grp is only used in Linux and nm_numa_domain is only used in FreeBSD? Maybe with some macros to alias the field?

578

Following the reasoning above, we could define IF_NODOM as -1 in Linux

sys/dev/netmap/netmap_mem2.c
178

I read the Linux implementation of nm_iommu_group_id() and came to the conclusion that nm_grp was only used for the IOMMU domain. Does that naturally map 1-1 onto NUMA domains? I didn't think so in general, but maybe I'm missing something.

I then added the separate nm_numa_domain field since reusing nm_grp for a different purpose on FreeBSD felt wrong, but sure, we can reuse the field.

Remove the separate NUMA domain field and overload nm_grp instead.

sys/dev/netmap/netmap_mem2.c
578

In the revised patch I just use -1 as a sentinel value everywhere, since that's a bit clearer IMO.

Hi Mark,
You were actually right, and with a fresh mind I realized I overlooked some important aspects (sorry for that).
Yes, IOMMU domains and NUMA domains are not the same thing at all, so it's probably not wise to overload the field. But the main point here is that the difference between the two has implications for the purpose of this patch.

When IOMMU is in place, you must have different netmap allocators for each IOMMU "domain", simply because if device em0 and device em1 are not in the same IOMMU domain, they cannot access the same memory, (at least in general, and in any case not necessarily at the same DMA address). This is important for applications, because you cannot zerocopy move (e.g. by swapping instances of struct netmap_slot) between rings belonging to different devices if the two devices do not use the same netmap allocator. Zerocopy move is an essential feature of netmap, e.g. see the bridge example or lb, or thing of a router application.

Now, I guess this patch is an optimization to make sure that we don't end up in a situation where a device attached to NUMA domain 1 needs to DMA to/from banks of memory attached to NUMA domain 0, because the overhead is higher w.r.t. a situation where DMA memory access is local to the domain.
This optimization would for sure be useful for applications like pkt-gen, where the application uses only a single device.

However, by creating a separate netmap allocators for each NUMA domain, you loose the possibility to zerocopy move between devices that are in different NUMA domains. If you think about the bridge example, you may want to bridge em0 and em1, where em0 is attached to NUMA domain 0, and em1 is attached to NUMA domain 1. Without your patch, say that the memory of the global netmap allocator ends up in NUMA domain 0. Then if you receive a packet from em0, DMA access will be NUMA local; then bridge will zerocopy swap the slot and the subsequent transmission on em1 would be NUMA non-local. With your patch, in the same situation, you would have NUMA local access for receive on em0, then packet copy from NUMA 0 memory to NUMA 1 memory (so NUMA non-local), and then again NUMA local access for transmit on em1. I can't be sure without measuring, but I suspect this may worsen the performance.

So maybe we can follow a different approach? We could have an option such as NETMAP_REQ_OPT_NUMA to be checked at NETMAP_REQ_REGISTER time. If the option is in on, then you would create a new NUMA-aware allocator for the device you are registering (if the allocator does not already exist). In addition to that, to avoid the need for application changes, we could also have a sysctl control (off by default) which enables the same behaviour of the option, for all the devices.

What do you think?

Hi Mark,
You were actually right, and with a fresh mind I realized I overlooked some important aspects (sorry for that).
Yes, IOMMU domains and NUMA domains are not the same thing at all, so it's probably not wise to overload the field. But the main point here is that the difference between the two has implications for the purpose of this patch.

When IOMMU is in place, you must have different netmap allocators for each IOMMU "domain", simply because if device em0 and device em1 are not in the same IOMMU domain, they cannot access the same memory, (at least in general, and in any case not necessarily at the same DMA address). This is important for applications, because you cannot zerocopy move (e.g. by swapping instances of struct netmap_slot) between rings belonging to different devices if the two devices do not use the same netmap allocator. Zerocopy move is an essential feature of netmap, e.g. see the bridge example or lb, or thing of a router application.

Now, I guess this patch is an optimization to make sure that we don't end up in a situation where a device attached to NUMA domain 1 needs to DMA to/from banks of memory attached to NUMA domain 0, because the overhead is higher w.r.t. a situation where DMA memory access is local to the domain.
This optimization would for sure be useful for applications like pkt-gen, where the application uses only a single device.

However, by creating a separate netmap allocators for each NUMA domain, you loose the possibility to zerocopy move between devices that are in different NUMA domains. If you think about the bridge example, you may want to bridge em0 and em1, where em0 is attached to NUMA domain 0, and em1 is attached to NUMA domain 1. Without your patch, say that the memory of the global netmap allocator ends up in NUMA domain 0. Then if you receive a packet from em0, DMA access will be NUMA local; then bridge will zerocopy swap the slot and the subsequent transmission on em1 would be NUMA non-local. With your patch, in the same situation, you would have NUMA local access for receive on em0, then packet copy from NUMA 0 memory to NUMA 1 memory (so NUMA non-local), and then again NUMA local access for transmit on em1. I can't be sure without measuring, but I suspect this may worsen the performance.

What you say is true, but if the system performance depends on forwarding data across NUMA domains, it is already quite suboptimal. That is, when designing a NUMA system that will be forwarding packets between interfaces, the interfaces already ought to be attached to the same NUMA domain if one is concerned about performance. It will still be incurring a lot of cross-socket memory bus traffic.

Do you have some experience which suggests otherwise?

So maybe we can follow a different approach? We could have an option such as NETMAP_REQ_OPT_NUMA to be checked at NETMAP_REQ_REGISTER time. If the option is in on, then you would create a new NUMA-aware allocator for the device you are registering (if the allocator does not already exist). In addition to that, to avoid the need for application changes, we could also have a sysctl control (off by default) which enables the same behaviour of the option, for all the devices.

What do you think?

I think that's reasonable. To be explicit, it should end up being something like this:

  1. Restore the previous version of the patch, since we don't want to conflate the NUMA and IOMMU domains.
  2. Add a sysctl which controls whether nm_numa_domain() returns the interface's local domain, or -1.
  3. Add a NETMAP_REQ_OPT_NUMA option which can be used to select the desired pool's memory domain, overriding the global behaviour from nm_numa_domain(). The option can select a specific NUMA domain, or use -1 to request memory from the domain local to the adapter.
  4. Add some libnetmap helper that can be used to specify a NUMA domain prior to port registration.
  5. Update man pages (I should have done this anyway).

Does that sound right? My slight preference is to omit steps 3 and 4, and just provide a sysctl (I don't care about the default very much, but IMO having domain-local memory by default makes more sense) and documentation.

  • Restore the previous patch, i.e., don't use nm_grp to represent the NUMA domain, as that's logically distinct from the IOMMU domain.
  • Add a sysctl to restore the old behaviour, and document it.
  • Avoid using IF_NODOM in generic code. Instead, translate it to -1 in the FreeBSD layer to avoid having to define constants for other operating systems.

What do you think?

I think that's reasonable. To be explicit, it should end up being something like this:

  1. Restore the previous version of the patch, since we don't want to conflate the NUMA and IOMMU domains.
  2. Add a sysctl which controls whether nm_numa_domain() returns the interface's local domain, or -1.
  3. Add a NETMAP_REQ_OPT_NUMA option which can be used to select the desired pool's memory domain, overriding the global behaviour from nm_numa_domain(). The option can select a specific NUMA domain, or use -1 to request memory from the domain local to the adapter.
  4. Add some libnetmap helper that can be used to specify a NUMA domain prior to port registration.
  5. Update man pages (I should have done this anyway).

Does that sound right? My slight preference is to omit steps 3 and 4, and just provide a sysctl (I don't care about the default very much, but IMO having domain-local memory by default makes more sense) and documentation.

I updated the diff accordingly, please let me know what you think, and I'll go ahead with adding a NUMA option to the registration code, probably in a separate patch.

No, I don't have experience on that. I was just saying that only by measuring (maybe with variable packet length) one can assess the real impact. It may well be, as you say, that once you need to copy data across NUMA domain, the zerocopy feature is not too relevant.

Sounds good. For the default, I would stand by the principle of least astonishment... if you default to domain-local memory, applications that used to see (and maybe rely on) zerocopy will change behaviour (or even break if they are erroneously not checking that zerocopy is actually possible). Then maybe in a future patch, giving some time, we can change the default.

This revision is now accepted and ready to land.Oct 9 2024, 7:17 AM

No, I don't have experience on that. I was just saying that only by measuring (maybe with variable packet length) one can assess the real impact. It may well be, as you say, that once you need to copy data across NUMA domain, the zerocopy feature is not too relevant.

Sounds good. For the default, I would stand by the principle of least astonishment... if you default to domain-local memory, applications that used to see (and maybe rely on) zerocopy will change behaviour (or even break if they are erroneously not checking that zerocopy is actually possible). Then maybe in a future patch, giving some time, we can change the default.

Hmm, is it possible that enabling use of domain-local memory will actually break functionality? Or is zero-copy forwarding just an optimization?

If the application wants to zerocopy between two netmap ports, it must check that the two ports are associated to the same memory allocator. In libnetmap, for example, you would check that nmport_d::mem for the two ports point to the same address.
If the application is well written, it will check and fall back to copy if zerocopy is not possible. That is, changing the default will only potentially disable an optimization.

However, existing application may miss this check, assuming the zerocopy is possibile. This is not desirable, but may happen in the real world. For those cases, the NUMA-local default will break the application.

  • Make the sysctl a read-only tunable - it must be set at boot time, since native netmap drivers select a memory pool at driver attach time.
  • Invert the default behaviour.
  • Update documentation a bit.
This revision now requires review to proceed.Oct 10 2024, 2:36 PM

If the application wants to zerocopy between two netmap ports, it must check that the two ports are associated to the same memory allocator. In libnetmap, for example, you would check that nmport_d::mem for the two ports point to the same address.
If the application is well written, it will check and fall back to copy if zerocopy is not possible. That is, changing the default will only potentially disable an optimization.

However, existing application may miss this check, assuming the zerocopy is possibile. This is not desirable, but may happen in the real world. For those cases, the NUMA-local default will break the application.

I see, thank you. I updated the patch to keep the existing behaviour by default.

This revision is now accepted and ready to land.Oct 11 2024, 1:06 PM
This revision was automatically updated to reflect the committed changes.