Page MenuHomeFreeBSD

vm: Handle VM_ALLOC_ZERO in the page allocator
AbandonedPublic

Authored by markj on Feb 19 2021, 8:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Nov 3, 7:44 AM
Unknown Object (File)
Mon, Oct 21, 10:15 PM
Unknown Object (File)
Mon, Oct 21, 7:34 PM
Unknown Object (File)
Thu, Oct 17, 2:03 PM
Unknown Object (File)
Oct 7 2024, 6:15 PM
Unknown Object (File)
Oct 1 2024, 11:32 PM
Unknown Object (File)
Sep 29 2024, 6:25 AM
Unknown Object (File)
Sep 28 2024, 8:25 AM
Subscribers

Details

Reviewers
alc
kib
jeff
Group Reviewers
manpages
Summary

Currently VM_ALLOC_ZERO is advisory and callers must remember to zero
the pages themselves if necessary, which is most of the time. This is
error-prone, especially considering that vm_page_alloc() and
vm_page_grab() interpret VM_ALLOC_ZERO differently, and it complicates
callers. The main benefit appears to be an optimization for the fault
handler such that it can release the object lock before zeroing the
page. It also provides a minor improvement for platforms without a
direct map, since if they map the returned page anyway they can zero the
returned memory without having to use pmap_zero_page().

I propose zeroing pages in the allocator and handling
performance-sensitive cases more surgically. A follow-up commit will
modify consumers.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 37209
Build 34098: arc lint + arc unit

Event Timeline

sys/vm/vm_page.c
2393–2394

Wouldn't it be useful to set PG_ZERO if VM_ALLOC_ZERO was set? At least the work we did would not be lost.

markj marked an inline comment as done.

Set PG_ZERO if radix insertion fails and we zeroed the page.

If the page was already zeroed and VM_ALLOC_ZERO was not specified,
then we will not restore PG_ZERO. It seems like too much work to
handle this rare case, but if others disagree I will handle it.

This revision is now accepted and ready to land.Feb 20 2021, 12:49 PM
sys/vm/vm_page.c
1960

This is going to change the meaning of this counter from zero-fill on page faults (which is where the "d" for "demand" as in "demand paging" comes from) to zero-fill on any allocation.

1985

Why the new "blank line" here? These are all optional flags, and there is no similar blank line added in similar comments below.

2372

Just an observation: Per table 5 (https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf), we should really stop zeroing one page at a time. On machines with a direct map, pmap_zero_page_area() would already "do the right thing". We would just need to modify the implementation on the other architectures.

2496

"pages" -> "page", "their" -> "its".

I really have mixed feelings about this change. If anything, we should try to discourage page zeroing while the object lock is held. On the other hand, for VM_ALLOC_NOOBJ allocations, I think that this change makes perfect sense, which brings me to the following proposal: Remove VM_ALLOC_NOOBJ and VM_ALLOC_ZERO from vm_page_alloc{,_contig}(), and provide separate allocation functions to replace VM_ALLOC_NOOBJ, akin to vm_page_alloc_freelist(). Virtually all VM_ALLOC_NOOBJ call sites are being changed, so we as well change the function being called. This will slightly simplify vm_page_alloc(), e.g., it can assume that the object is always non-NULL and that it will only allocate from the default pool. Also, when I say, "Remove ... VM_ALLOC_ZERO from vm_page_alloc()", I would probably let PG_ZERO pass through vm_page_alloc() unchanged. Right now, we clear that flag unless VM_ALLOC_ZERO was specified. As for the functions that replace VM_ALLOC_NOOBJ, I would drop not only the object parameter, but also the pindex. The few callers that (ab)use the pindex can set it themselves.

In D28805#645337, @alc wrote:

I really have mixed feelings about this change. If anything, we should try to discourage page zeroing while the object lock is held. On the other hand, for VM_ALLOC_NOOBJ allocations, I think that this change makes perfect sense, which brings me to the following proposal: Remove VM_ALLOC_NOOBJ and VM_ALLOC_ZERO from vm_page_alloc{,_contig}(), and provide separate allocation functions to replace VM_ALLOC_NOOBJ, akin to vm_page_alloc_freelist().
Virtually all VM_ALLOC_NOOBJ call sites are being changed, so we as well change the function being called. This will slightly simplify vm_page_alloc(), e.g., it can assume that the object is always non-NULL and that it will only allocate from the default pool. Also, when I say, "Remove ... VM_ALLOC_ZERO from vm_page_alloc()", I would probably let PG_ZERO pass through vm_page_alloc() unchanged. Right now, we clear that flag unless VM_ALLOC_ZERO was specified. As for the functions that replace VM_ALLOC_NOOBJ, I would drop not only the object parameter, but also the pindex. The few callers that (ab)use the pindex can set it themselves.

So to be clear, your proposal is to add a vm_page_alloc_anon() (or _noobj()?) which accepts a VM_ALLOC_ZERO flag and implements it by zeroing the page, whereas vm_page_alloc(_contig)() should stop taking VM_ALLOC_ZERO and instead return PG_ZERO unchanged? I like the idea of splitting the allocator functions and preserving PG_ZERO. It feels a bit odd to have inconsistent handling with respect to VM_ALLOC_ZERO though. There are situations where allocations are rare enough that zeroing under the object lock is not a problem (or it is required), and splitting the allocator entry points would make it easier to spot calls where it is likely to be a problem. The pti_obj object used to manage userspace-visible "holes" in the kernel address space is an example of this.

sys/vm/vm_page.c
1960

That is true, I'm shoehorning them in here. My weak argument is that the vmstat -s descriptions are vague enough to be correct under this new usage. However with your proposal implemented I would move these increments back to their original location.

2372

I think we'd want to additionally widen the types of the size and off to size_t.

In D28805#645337, @alc wrote:

I really have mixed feelings about this change. If anything, we should try to discourage page zeroing while the object lock is held. On the other hand, for VM_ALLOC_NOOBJ allocations, I think that this change makes perfect sense, which brings me to the following proposal: Remove VM_ALLOC_NOOBJ and VM_ALLOC_ZERO from vm_page_alloc{,_contig}(), and provide separate allocation functions to replace VM_ALLOC_NOOBJ, akin to vm_page_alloc_freelist().
Virtually all VM_ALLOC_NOOBJ call sites are being changed, so we as well change the function being called. This will slightly simplify vm_page_alloc(), e.g., it can assume that the object is always non-NULL and that it will only allocate from the default pool. Also, when I say, "Remove ... VM_ALLOC_ZERO from vm_page_alloc()", I would probably let PG_ZERO pass through vm_page_alloc() unchanged. Right now, we clear that flag unless VM_ALLOC_ZERO was specified. As for the functions that replace VM_ALLOC_NOOBJ, I would drop not only the object parameter, but also the pindex. The few callers that (ab)use the pindex can set it themselves.

So to be clear, your proposal is to add a vm_page_alloc_anon() (or _noobj()?) which accepts a VM_ALLOC_ZERO flag and implements it by zeroing the page, whereas vm_page_alloc(_contig)() should stop taking VM_ALLOC_ZERO and instead return PG_ZERO unchanged?

Yes, and vm_page_alloc_noobj() makes sense to me. (I think that vm_page_alloc_anon() could be too easily confused with allocation of pages to OBJ_ANON vm objects. In other words, I would avoid using any derivatives of the word anonymous.)

I like the idea of splitting the allocator functions and preserving PG_ZERO. It feels a bit odd to have inconsistent handling with respect to VM_ALLOC_ZERO though. There are situations where allocations are rare enough that zeroing under the object lock is not a problem (or it is required), and splitting the allocator entry points would make it easier to spot calls where it is likely to be a problem. The pti_obj object used to manage userspace-visible "holes" in the kernel address space is an example of this.

I'm not going to argue strenuously for leaving VM_ALLOC_ZERO support out of vm_page_alloc{,_contig}(). I agree that splitting the allocator entry points will make it easier to spot calls that shouldn't use VM_ALLOC_ZERO.

I started working on this again and will post new reviews based on the feedback here. Roughly, my plan is:

  • Introduce vm_page_alloc_noobj() and vm_page_alloc_noobj_domain(). These functions implement VM_ALLOC_ZERO by unconditionally zeroing the page.
  • Implement vm_page_alloc_freelist_domain() and vm_page_alloc_noobj_domain() using a common function, since they are almost identical.
  • Update vm_page_alloc() callers to use vm_page_alloc_noobj() when appropriate.
  • Modify vm_page_alloc() and vm_page_alloc_contig() to unconditionally preserve PG_ZERO. Update callers to stop passing VM_ALLOC_ZERO.
  • Bump __FreeBSD_version, patch ports to use the new KPIs, wait a bit of time for dust to settle.
  • Modify vm_page_alloc() (and maybe vm_page_alloc_contig()) to require object != NULL, remove VM_ALLOC_NOOBJ. Only this change should not be MFCed, I believe.

One thing I'm not sure about is whether to introduce vm_page_alloc_noobj_contig(), but I am leaning towards adding it, if only for consistency. OTOH the size of the page allocator interface is getting a little unwieldy, we have vm_page_alloc(), vm_page_alloc_after(), vm_page_alloc_freelist(), vm_page_alloc_contig() and now vm_page_alloc_noobj(), vm_page_alloc_noobj_contig(), and then per-domain flavours of all of these.