Page MenuHomeFreeBSD

sendfile: allocate pages from the local NUMA domain when SF_NOCACHE is set
AbandonedPublic

Authored by gallatin on Apr 25 2019, 11:14 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sep 8 2024, 1:20 PM
Unknown Object (File)
Sep 5 2024, 7:03 AM
Unknown Object (File)
Aug 31 2024, 9:46 PM
Unknown Object (File)
Aug 30 2024, 10:21 PM
Unknown Object (File)
Aug 29 2024, 5:16 AM
Unknown Object (File)
Aug 18 2024, 6:10 PM
Unknown Object (File)
Aug 17 2024, 6:39 PM
Unknown Object (File)
Aug 16 2024, 8:38 AM
Subscribers

Details

Summary

When we do not anticipate reuse of the pages backing a sendfile() request, it makes sense to allocate them from the NUMA domain which is local to the inpcb backing the socket making the sendfile() request. This is useful to reduce cross-domain traffic in multi-socket systems to at most one cross-domain DMA write from the storage controller to the backing page. Having pages local is especially useful for software kernel TLS, where the pages may be accessed by the CPU.

This change adds domain awareness to sendfile() when SF_NOCACHE is set

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 23939

Event Timeline

sys/vm/vm_page.c
4021 ↗(On Diff #56678)

A problem with this (and the motivation for the _domainset() KPIs) is that allocations from the local domain may fail even if there is plenty of free memory. vm_page_grab_pages() by default will sleep if the allocation fails.

4043 ↗(On Diff #56678)

Style: missing newline.

4048 ↗(On Diff #56678)

Style: indentation should be by four spaces.

IMO it is too special-purpose, and potentially affecting other loads.

Vm objects can have domain policies assigned, and object policy trumps a thread policy, if present. I would prefer that we added infrastructure to assign policy to the file' vm_object, e.g. using some new ioctl on file descriptor, and userspace did the job of explicitly stating the desired policy.

In D20062#431501, @kib wrote:

IMO it is too special-purpose, and potentially affecting other loads.

Vm objects can have domain policies assigned, and object policy trumps a thread policy, if present. I would prefer that we added infrastructure to assign policy to the file' vm_object, e.g. using some new ioctl on file descriptor, and userspace did the job of explicitly stating the desired policy.

I would prefer not to issue an ioctl for every file; that would increase the system call overhead substantially for workloads with small files. Would you be opposed to a sysctl to control the behavior? How about a new sendfile flag?

sys/vm/vm_page.c
4021 ↗(On Diff #56678)

I think what you're saying is that rather than plumb a path through to vm_page_alloc_domain_after(), what I really should be doing is to change set obj->domain.dr_policy to DOMAINSET_PREF() (or FIXED()), and set obj->domainset.dr_iter to the domain I want?

Note that in my use case, I'm totally fine with the way it currently behaves.

I've updated the patch to remove my hand-plumbed path down to vm_page_alloc_domain_after(), and changed to setting the domain policy on the object.

In D20062#431501, @kib wrote:

IMO it is too special-purpose, and potentially affecting other loads.

Vm objects can have domain policies assigned, and object policy trumps a thread policy, if present. I would prefer that we added infrastructure to assign policy to the file' vm_object, e.g. using some new ioctl on file descriptor, and userspace did the job of explicitly stating the desired policy.

Thinking more about this, I'm having a hard time understanding why you think that this is too special purpose. In this case, the application has told the kernel that he wants the backing pages recycled quickly by using the SF_NOCACHE flag. If that is the case, then we don't expect anybody else to be using those pages, so what is the harm of allocating them in a way which is correct for this application's needs?

sys/kern/kern_sendfile.c
752

Is there a way to determine that the object does not have the policy assigned ? I mean, if object got the policy assigned by some way, perhaps we will grow the ability for user to apply the policy to files/mount points, then it should not be overwritten by this place.

BTW, suppose we have two domains, two network cards with affinity to the corresponding domains, and two incoming connections coming from distinct input pathes, and the same file is served to both. I do not think it is reasonable to ping-pong the object domain setting.

Also, please add a comment explaining the reasoning. I initially intended to say that I want a sysctl there to enable this behavior, but later decided that there is no point, assuming the policy does not override more prioritized setting.

sys/kern/kern_sendfile.c
752

To check for an object policy, we can simply test whether obj->domain.dr_policy is non-NULL, like vm_domainset_iter_page_init() does. I agree that we should not override an existing policy.

I do not believe it is necessary to initialize dr_iter for the DOMAINSET_PREF policy; it's only used for round-robin. Similarly it's sufficient to test the dr_policy pointer.

754

We should re-check after acquiring the object lock.

Address comments by markj and kib

  • add a comment explaining what is happening
    • only replace domain policy when it is NULL
    • re-check policy after acquiring object lock

I'm going to have to re-test with this

sys/kern/kern_sendfile.c
754–763

It would appear that subsequent sendfile(SF_NOCACHE) calls on the same file will use the policy from the first call even if the current socket's pcb resides within a different domain.

sys/kern/kern_sendfile.c
754–763

That's what I was worried about (and why I need to re-test). I am not familiar with the vm_object lifecycle, and was concerned objects may linger, and create this sort of behavior.

sys/kern/kern_sendfile.c
754–763

Indeed, this is a problem. It causes at least a 50% increase in cross-domain NUMA traffic in my test setup (13% -> 20%).

Is there a solution I'm not thinking of? Kib has a point about different active senders of the same file "ping-ponging" if I revert part of this, and change the check from policy == NULL back to policy != DOMAINSET_PREF(domain).

For our (Netflix) workload, reverting back to forcing a change if the policy prefers a domain other than the current one is the obvious choice. I think it may also be the obvious choice in general, as I'm not sure how the "ping-ponging" situation is any worse than round-robin allocations, which is roughly what it degrades to.

As alc pointed out (and as I've confirmed), the vm_object will linger with the domain policy set, causing an increase in cross domain traffic for our workload. I've restored the feature from the older patch where we override the current domain policy. This reduces cross domain traffic to previous levels.

As kib suggested, this may be sub-optimal for some, so I've added a sysctl to override this behavior.

I still do not quite understand the benefits of this change.

Don't you (Netflix) have configurations where machine has at least two netwoks cards, each plugged into lane from separate NUMA domain ? Could it be that you have two tcp connections coming each from its own card and streaming the same file ?

In D20062#432560, @kib wrote:

I still do not quite understand the benefits of this change.

There should be some benefit from just this and D20060 (lacp egress port selection for numa) on machines with more 4 (or more) domains. The combination will at least reduce a non-encrypted sendfile() to a 75% chance of a single domain crossing on DMA write by the storage controllers. Without these 2 changes, then the connection could be local to domain0, the page could be on domain1, and the egress NIC could be on domain 2, for a total of 2 domain crossings for the bulk data (75% chance of a crossing on DMA write from storage, 75% chance of a crossing on DMA read to NIC). With this and D20060, there will be a 75% chance of a single domain crossing for the DMA write, but then the page will be local for the egress NIC. Sadly, the only 4-domain machine I have is AMD, and I have no way to measure Infinity Fabric utilization.

For a dual socket system, without quite a few more patches, the benefit are not going to be very big. That's the problem with NUMA, there are not a lot of measurable incremental improvements unless you have all the pieces in place. However, the following future changes from me, Jeff and John Baldwin make this very important. Specifically

  1. My patch to alter the behavior of SO_REUSEPORT_LB listen sockets to filter incoming connections by domain makes it possible for sendfile() to always called by a webserver running on the connection's local domain
  2. A patch (done horribly be me, refined by Jeff) to allocate the VM page array and vm_domain (and pa_locks) backed by domain-specific memory make actions to manage the page local
  3. The major benefit comes when software kernel TLS is in use, as the page allocated by sendfile() is then read by the CPU as the source for crypto by a TLS worker thread running on the connection's local domain. John is working on upstreaming kernel TLS for us.
  4. My patch in D20060 to select lacp egress ports by domain.

Don't you (Netflix) have configurations where machine has at least two netwoks cards, each plugged into lane from separate NUMA domain ? Could it be that you have two tcp connections coming each from its own card and streaming the same file ?

Yes, we are testing configurations with one NIC per NUMA domain. Such configurations are the motivation for this patchset.

At least in our workload, the webserver marks sendfile() requests with SF_NOCACHE when it is serving an "unpopular" file. Eg, files that the data science folks do not anticipate will be streamed again in the near future. Hence it is unlikely that the same file will be bouncing between domains. I was slightly worried about this bouncing effect, and that's why I re-measured things.

The results that I have for UPI cross-domain traffic as reported by Intel's PCM.x as "QPI data traffic/Memory controller traffic" are below.. The system where I'm taking these numbers is a dual-socket Xeon Silver 4216, with 4 NVME and 1 100GbE NIC on each domain, serving 100% TLS encrypted traffic at 180-190Gb/s. This machine is running my entire patch stack, as well as some things from Jeff (UMA cross domain free support).

13%: Original patch (hand plumbed patch to vm_page_alloc_domain_after(), not touching object domain preference): 13%
21%+: Patch as of early yesterday (set domain preference on object only when it is not currently set). Note that the cross-domain traffic seemed to be climbing.
13%: This patch

I think that there is a better approach than what you are attempting here. Once you have the patch that you describe as "My patch to alter the behavior of SO_REUSEPORT_LB listen sockets ...", the thread executing sendfile(2) will be running on the same domain as the socket. Recall Kostik's first comment, "Vm objects can have domain policies assigned, and object policy trumps a thread policy, if present." In other words, if you don't define an object policy, the page allocations will be governed by the thread's policy. A thread policy that says allocate from the local domain should achieve your desired outcome.

In D20062#432787, @alc wrote:

I think that there is a better approach than what you are attempting here. Once you have the patch that you describe as "My patch to alter the behavior of SO_REUSEPORT_LB listen sockets ...", the thread executing sendfile(2) will be running on the same domain as the socket. Recall Kostik's first comment, "Vm objects can have domain policies assigned, and object policy trumps a thread policy, if present." In other words, if you don't define an object policy, the page allocations will be governed by the thread's policy. A thread policy that says allocate from the local domain should achieve your desired outcome.

Brilliant. I feel so silly.

I had not put the thread affinity together, as the SO_REUSEPORT_LB change I made is quite recent, and I've had the sendfile() patch for nearly a year. Thank you so much for the insight, and sorry for the waste of time.