vm: Handle VM_ALLOC_ZERO in the page allocator
AbandonedPublic
Actions

Authored by markj on Feb 19 2021, 8:18 PM.

Details

Reviewers

alc
kib
jeff

Group Reviewers

manpages

Summary

Currently VM_ALLOC_ZERO is advisory and callers must remember to zero
the pages themselves if necessary, which is most of the time. This is
error-prone, especially considering that vm_page_alloc() and
vm_page_grab() interpret VM_ALLOC_ZERO differently, and it complicates
callers. The main benefit appears to be an optimization for the fault
handler such that it can release the object lock before zeroing the
page. It also provides a minor improvement for platforms without a
direct map, since if they map the returned page anyway they can zero the
returned memory without having to use pmap_zero_page().

I propose zeroing pages in the allocator and handling
performance-sensitive cases more surgically. A follow-up commit will
modify consumers.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 37209
Build 34098: arc lint + arc unit

Event Timeline

markj created this revision.Feb 19 2021, 8:18 PM

Herald added a reviewer: manpages. · View Herald TranscriptFeb 19 2021, 8:18 PM

Herald added a subscriber: imp. · View Herald Transcript

markj requested review of this revision.Feb 19 2021, 8:18 PM

markj added a child revision: D28806: x86/iommu: Update following VM_ALLOC_ZERO semantic change.

Harbormaster completed remote builds in B37202: Diff 84293.Feb 19 2021, 8:19 PM

markj added reviewers: alc, kib, jeff.Feb 19 2021, 8:26 PM

kib added inline comments.Feb 19 2021, 9:13 PM

sys/vm/vm_page.c
2393–2394	Wouldn't it be useful to set PG_ZERO if VM_ALLOC_ZERO was set? At least the work we did would not be lost.

Set PG_ZERO if radix insertion fails and we zeroed the page.

If the page was already zeroed and VM_ALLOC_ZERO was not specified,
then we will not restore PG_ZERO. It seems like too much work to
handle this rare case, but if others disagree I will handle it.

Harbormaster completed remote builds in B37209: Diff 84305.Feb 19 2021, 9:56 PM

kib accepted this revision.Feb 20 2021, 12:49 PM

This revision is now accepted and ready to land.Feb 20 2021, 12:49 PM

alc added inline comments.Feb 20 2021, 7:11 PM

sys/vm/vm_page.c
1960	This is going to change the meaning of this counter from zero-fill on page faults (which is where the "d" for "demand" as in "demand paging" comes from) to zero-fill on any allocation.
1985	Why the new "blank line" here? These are all optional flags, and there is no similar blank line added in similar comments below.
2372	Just an observation: Per table 5 (https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf), we should really stop zeroing one page at a time. On machines with a direct map, pmap_zero_page_area() would already "do the right thing". We would just need to modify the implementation on the other architectures.
2496	"pages" -> "page", "their" -> "its".

I really have mixed feelings about this change. If anything, we should try to discourage page zeroing while the object lock is held. On the other hand, for VM_ALLOC_NOOBJ allocations, I think that this change makes perfect sense, which brings me to the following proposal: Remove VM_ALLOC_NOOBJ and VM_ALLOC_ZERO from vm_page_alloc{,_contig}(), and provide separate allocation functions to replace VM_ALLOC_NOOBJ, akin to vm_page_alloc_freelist(). Virtually all VM_ALLOC_NOOBJ call sites are being changed, so we as well change the function being called. This will slightly simplify vm_page_alloc(), e.g., it can assume that the object is always non-NULL and that it will only allocate from the default pool. Also, when I say, "Remove ... VM_ALLOC_ZERO from vm_page_alloc()", I would probably let PG_ZERO pass through vm_page_alloc() unchanged. Right now, we clear that flag unless VM_ALLOC_ZERO was specified. As for the functions that replace VM_ALLOC_NOOBJ, I would drop not only the object parameter, but also the pindex. The few callers that (ab)use the pindex can set it themselves.

In D28805#645337, @alc wrote:

I really have mixed feelings about this change. If anything, we should try to discourage page zeroing while the object lock is held. On the other hand, for VM_ALLOC_NOOBJ allocations, I think that this change makes perfect sense, which brings me to the following proposal: Remove VM_ALLOC_NOOBJ and VM_ALLOC_ZERO from vm_page_alloc{,_contig}(), and provide separate allocation functions to replace VM_ALLOC_NOOBJ, akin to vm_page_alloc_freelist().
Virtually all VM_ALLOC_NOOBJ call sites are being changed, so we as well change the function being called. This will slightly simplify vm_page_alloc(), e.g., it can assume that the object is always non-NULL and that it will only allocate from the default pool. Also, when I say, "Remove ... VM_ALLOC_ZERO from vm_page_alloc()", I would probably let PG_ZERO pass through vm_page_alloc() unchanged. Right now, we clear that flag unless VM_ALLOC_ZERO was specified. As for the functions that replace VM_ALLOC_NOOBJ, I would drop not only the object parameter, but also the pindex. The few callers that (ab)use the pindex can set it themselves.

So to be clear, your proposal is to add a vm_page_alloc_anon() (or _noobj()?) which accepts a VM_ALLOC_ZERO flag and implements it by zeroing the page, whereas vm_page_alloc(_contig)() should stop taking VM_ALLOC_ZERO and instead return PG_ZERO unchanged? I like the idea of splitting the allocator functions and preserving PG_ZERO. It feels a bit odd to have inconsistent handling with respect to VM_ALLOC_ZERO though. There are situations where allocations are rare enough that zeroing under the object lock is not a problem (or it is required), and splitting the allocator entry points would make it easier to spot calls where it is likely to be a problem. The pti_obj object used to manage userspace-visible "holes" in the kernel address space is an example of this.

markj added inline comments.Feb 23 2021, 1:57 AM

sys/vm/vm_page.c
1960	That is true, I'm shoehorning them in here. My weak argument is that the vmstat -s descriptions are vague enough to be correct under this new usage. However with your proposal implemented I would move these increments back to their original location.
2372	I think we'd want to additionally widen the types of the `size` and `off` to size_t.

In D28805#646315, @markj wrote:

In D28805#645337, @alc wrote:

I really have mixed feelings about this change. If anything, we should try to discourage page zeroing while the object lock is held. On the other hand, for VM_ALLOC_NOOBJ allocations, I think that this change makes perfect sense, which brings me to the following proposal: Remove VM_ALLOC_NOOBJ and VM_ALLOC_ZERO from vm_page_alloc{,_contig}(), and provide separate allocation functions to replace VM_ALLOC_NOOBJ, akin to vm_page_alloc_freelist().
Virtually all VM_ALLOC_NOOBJ call sites are being changed, so we as well change the function being called. This will slightly simplify vm_page_alloc(), e.g., it can assume that the object is always non-NULL and that it will only allocate from the default pool. Also, when I say, "Remove ... VM_ALLOC_ZERO from vm_page_alloc()", I would probably let PG_ZERO pass through vm_page_alloc() unchanged. Right now, we clear that flag unless VM_ALLOC_ZERO was specified. As for the functions that replace VM_ALLOC_NOOBJ, I would drop not only the object parameter, but also the pindex. The few callers that (ab)use the pindex can set it themselves.

So to be clear, your proposal is to add a vm_page_alloc_anon() (or _noobj()?) which accepts a VM_ALLOC_ZERO flag and implements it by zeroing the page, whereas vm_page_alloc(_contig)() should stop taking VM_ALLOC_ZERO and instead return PG_ZERO unchanged?

Yes, and vm_page_alloc_noobj() makes sense to me. (I think that vm_page_alloc_anon() could be too easily confused with allocation of pages to OBJ_ANON vm objects. In other words, I would avoid using any derivatives of the word anonymous.)

I like the idea of splitting the allocator functions and preserving PG_ZERO. It feels a bit odd to have inconsistent handling with respect to VM_ALLOC_ZERO though. There are situations where allocations are rare enough that zeroing under the object lock is not a problem (or it is required), and splitting the allocator entry points would make it easier to spot calls where it is likely to be a problem. The pti_obj object used to manage userspace-visible "holes" in the kernel address space is an example of this.

I'm not going to argue strenuously for leaving VM_ALLOC_ZERO support out of vm_page_alloc{,_contig}(). I agree that splitting the allocator entry points will make it easier to spot calls that shouldn't use VM_ALLOC_ZERO.

markj mentioned this in D28717: vm_page_grab*: Consolidate common logic into vm_page_grab_release().Mar 27 2021, 2:26 PM

I started working on this again and will post new reviews based on the feedback here. Roughly, my plan is:

Introduce vm_page_alloc_noobj() and vm_page_alloc_noobj_domain(). These functions implement VM_ALLOC_ZERO by unconditionally zeroing the page.
Implement vm_page_alloc_freelist_domain() and vm_page_alloc_noobj_domain() using a common function, since they are almost identical.
Update vm_page_alloc() callers to use vm_page_alloc_noobj() when appropriate.
Modify vm_page_alloc() and vm_page_alloc_contig() to unconditionally preserve PG_ZERO. Update callers to stop passing VM_ALLOC_ZERO.
Bump __FreeBSD_version, patch ports to use the new KPIs, wait a bit of time for dust to settle.
Modify vm_page_alloc() (and maybe vm_page_alloc_contig()) to require object != NULL, remove VM_ALLOC_NOOBJ. Only this change should not be MFCed, I believe.

One thing I'm not sure about is whether to introduce vm_page_alloc_noobj_contig(), but I am leaning towards adding it, if only for consistency. OTOH the size of the page allocator interface is getting a little unwieldy, we have vm_page_alloc(), vm_page_alloc_after(), vm_page_alloc_freelist(), vm_page_alloc_contig() and now vm_page_alloc_noobj(), vm_page_alloc_noobj_contig(), and then per-domain flavours of all of these.

markj mentioned this in D32036: vm_fault: Stop specifying VM_ALLOC_ZERO.Sep 24 2021, 4:50 PM