Page MenuHomeFreeBSD

vm: implement vm_page_reclaim_contig_domain_ext() to reclaim multiple contiguous regions at once
ClosedPublic

Authored by gallatin on Apr 20 2023, 8:40 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 17, 1:24 AM
Unknown Object (File)
Sun, Jan 12, 2:57 AM
Unknown Object (File)
Sun, Jan 12, 1:41 AM
Unknown Object (File)
Sun, Jan 12, 12:08 AM
Unknown Object (File)
Sat, Jan 11, 2:51 AM
Unknown Object (File)
Sat, Dec 28, 11:23 PM
Unknown Object (File)
Dec 17 2024, 6:53 AM
Unknown Object (File)
Dec 14 2024, 5:47 PM
Subscribers

Details

Summary

Ktls uses 16K physically contiguous crypto destination buffers as an optimization. These buffers are managed by UMA, using a cache zone where the import function allocates these buffers using vm_page_alloc_noobj_contig_domain(). After a server has been up for a while serving a "Netfllix" workload, physical memory becomes severely fragmented, resulting in these contig allocations failing when UMA tries to expand the zone.

When attempting to fix this by calling vm_page_reclaim_contig_domain() from the ktls alloc thread, I observed that the ktls alloc thread wound up consuming an entire core for hours. This is because vm_page_reclaim_contig_domain() scans all of physical memory looking for runs of relocatable pages, but then reclaims just one of those runs. The problem with this is that on a large memory machine (order of a 100GB or more), each call can take several seconds to scan all of physical memory.

The algorithm for vm_page_reclaim_contig_domain() is the way it is for good reasons (see https://reviews.freebsd.org/D28924#649775). So rather than modifying the core algorithm, I extended vm_page_reclaim_contig_domain() to take a "desired_runs" argument to allow the caller to request that it reclaim more than just a single run. There is no functional change intended for all existing callers. The first user for this interface is the ktls code (https://reviews.freebsd.org/D39421). By reclaiming multiple runs, ktls goes from consuming hours of CPU to refill its buffer zone to just tens of seconds.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

This makes sense to me overall.

sys/vm/vm_page.c
3041–3042
3059–3062
  • Updated diff to fix blank line issues pointed out by @jhb
  • remove ktls diff that just snuck in

Overall, I think that this is okay.

sys/vm/vm_page.c
3028

I would suggest "desired_runs" instead. Then, ...

3048–3049

That said, I would consider making this array larger than desired_runs, because some candidate runs will not wind up getting reclaimed due to concurrent activity.

3055

Given the way that this code uses the word "run", when I see the variable name "runsize", I would expect it to be used to represent the number of contiguous pages that you want a run to consist of. Here, however, this variable is used to describe the number of runs, not the length of a run. "nruns" would seem more appropriate.

3056
3159

The indentation here is not style(9) compliant.

Addressed Alan's feedback:

  • changed new arg name to "desired_runs"
  • increased array size
  • renamed runsize -> nruns
  • fixed style(9) indent issue (grr.. emacs..)

Missed MAX suggestion.. since it doesn't show up in email..

  • update to accept MAX() suggestion from Alan that I missed previously

I am in the middle of making up a final exam for my class. I will take a final look at this on Wednesday.

sys/vm/vm_page.c
3090

We can leak the array here.

  • update to avoid potential array leak, as pointed out by @markj
sys/vm/vm_page.c
3034–3035

Just a nit: nruns should be between min_reclaim and options.

3057

This could move outside (and after) the if/else. In fact, if desired_runs == 2 and npages == 3, I'd still like min_reclaim to be MIN_RECLAIM.

3091

You've already initialized ret to false.

3116

Edit:

3160

Isn't there a missing blank line after the closing brace?

sys/vm/vm_page.c
3059

Admittedly, I'm being pedantic here. :-) If NRUNS is to get us 1 desired run, then we add 1 for the second desired run, and so on.

dougm added inline comments.
sys/vm/vm_page.c
3121

If count * npages < min_reclaim on the first scan, we're doomed to make a second scan, and up to count * npages reclamations won't be available on the second scan, because they were reclaimed on the first scan, increasing the chances that the second scan will also fail. So I suggest that doing any reclamation when another scan is inevitable is a bad idea.

Address @alc 's feedback:

  • sort nruns declaration alphabetically between min_reclaim and option
  • move min_reclaim assignment outside if(){}
  • adjust nruns by -1
  • skip extra assignment of false to ret
  • add missing blank line between vm_page_reclaim_contig_domain() and vm_page_reclaim_contig()
gallatin added inline comments.
sys/vm/vm_page.c
3121

My intent was not to change the algorithm, but just to expand it to allow reclaiming more runs in a single (set of) pass(es).

I don't pretend to fully understand the nuances here.. Note that the comment above would seem to contradict what you are saying ("Reset "reclaimed" each time because each reclamation is idempotent, and runs will (likely) recur from one scan to the next as restrictions are relaxed")

If this is indeed a concern, can it be addressed separately from this change by someone who understands this better?

This revision is now accepted and ready to land.May 5 2023, 11:18 PM