Page MenuHomeFreeBSD

vm_phys_early_alloc(): Panic on no memory in specific domain, add diagnostics
AcceptedPublic

Authored by olce on Jan 23 2025, 5:16 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Feb 28, 7:18 AM
Unknown Object (File)
Mon, Feb 24, 7:10 AM
Unknown Object (File)
Tue, Feb 18, 9:20 AM
Unknown Object (File)
Feb 2 2025, 11:05 PM
Unknown Object (File)
Feb 2 2025, 11:28 AM
Unknown Object (File)
Feb 1 2025, 11:54 PM
Unknown Object (File)
Feb 1 2025, 2:46 AM
Unknown Object (File)
Jan 30 2025, 5:15 AM

Details

Reviewers
markj
dougm
alc
Summary

Panic after each search for biggest ranges if no range was found at all.

This includes panicking if a specific domain was requested that has not
enough memory. This is a change in behavior as, before, we would just
silently fallback to the first chunk in phys_avail[] (which, e.g., on
i386/amd64 machines, is usually very small). In other words, the passed
domain is now considered as mandatory. This is not a problem as current
callers (pmap_page_array_startup() functions) only specify an explicit
domain (i.e., not -1) when they already know that some pages exist in
that domain.

If there is not enough space for the allocation, print relevant information
when panicking.

Check (under INVARIANTS) that 'alloc_size' is not 0, as this case is not
currently supported and cannot be triggered by current callers.

Add a few comments.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 61932
Build 58816: arc lint + arc unit

Event Timeline

olce requested review of this revision.Jan 23 2025, 5:16 PM

Panic after each search for biggest ranges if no range was found at all.

So if a system has two CPU sockets and only one has memory installed, we will panic? That is not right, we have to support such configurations.

So if a system has two CPU sockets and only one has memory installed, we will panic? That is not right, we have to support such configurations.

With this change, we will panic if domain is not -1 (-1 meaning any domain) and there is no memory declared at all in this specific domain. Previously, we would just silently fall back to the first affinity range, regardless of its size.

Indeed, if some domain exists that has no memory (I presume this is possible on amd64 as domains are reported by ACPI?) and someone tries to allocate from it specifically, we will just panic.

But I don't see any reason for some code to do that. Looking at the callers of vm_phys_early_alloc*(), none of them pass an explicit domain except for the pmap_page_array_startup() functions (amd64, and powerpc: in mmu_radix.c (and the corresponding code is under #if notyet; but see below) and in mmu_oea64.c), which only specify a domain in which we know there are pages (since the goal is to allocate backing pages in the same domain as the referenced pages). There is only one exception: A call at the beginning of mmu_radix_page_array_startup(). I think it is OK to change the domain there (0 => -1) as basically this code doesn't seem to care about where the pages are allocated. I don't know if it's possible for PowerPC machines not to have pages in domain 0 but only in other domains. CCing @jhibbits to confirm the intent of this call.

The code modified here is completely rewritten in D48635. I was hesitating in squashing this revision into D48635, but I'm glad I didn't since it made that subtle semantics change clearer. If my proposal in the previous paragraph is not enough, and some code must be modified here, I'll also have to modify D48635. It would be possible there to add some allocation flag expressing a mere preference for some domain (i.e., if the allocation cannot be fulfilled from the passed domain, then allocating from any domain is acceptable instead of failing), but for now it's not apparent to me that we really need something like that.

I don't think it's possible to have domain 0 unpopulated on powerpc. I do have a patch that implements the NUMAing of mmu_radix_page_array_startup(), but it didn't improve performance any measurable degree when I tested, so never followed through with pushing it.

I don't think it's possible to have domain 0 unpopulated on powerpc. I do have a patch that implements the NUMAing of mmu_radix_page_array_startup(), but it didn't improve performance any measurable degree when I tested, so never followed through with pushing it.

I see. I've taken care of updating the code under #ifdef notyet as a future reference, just in case.

Can I change the argument to -1 here? This will in practice change nothing as long as domain 0 is populated and not too small.

I don't think it's possible to have domain 0 unpopulated on powerpc. I do have a patch that implements the NUMAing of mmu_radix_page_array_startup(), but it didn't improve performance any measurable degree when I tested, so never followed through with pushing it.

I see. I've taken care of updating the code under #ifdef notyet as a future reference, just in case.

Can I change the argument to -1 here? This will in practice change nothing as long as domain 0 is populated and not too small.

I think that should be fine.

olce retitled this revision from vm_phys_early_alloc(): Add diagnostics and comments to vm_phys_early_alloc(): Panic on specific domain without memory, add diagnostics.Fri, Feb 7, 5:34 PM
olce edited the summary of this revision. (Show Details)

Please see new parent D48888 about removing the only case that could cause a problem (and probably would not anyway according to @jhibbits) with respect to the new panic behavior.

olce retitled this revision from vm_phys_early_alloc(): Panic on specific domain without memory, add diagnostics to vm_phys_early_alloc(): Panic on no memory in specific domain, add diagnostics.Fri, Feb 7, 6:21 PM
olce edited the summary of this revision. (Show Details)

Indeed, if some domain exists that has no memory (I presume this is possible on amd64 as domains are reported by ACPI?)

Yes, it's possible. It comes up every so often, e.g., someone boots up a system that has a socket with no DIMMs populated for whatever reason, or a CPU/BIOS option allows one to partition a socket into multiple "fake" domains with segregated memory controller access, such that some domains have local RAM and others do not. The kernel should handle this gracefully.

I think you're probably right that it's ok to require the vm_page array be backed by domain-local memory (at least so long as we can use 2MB mappings), but in the future we might need to support falling back to whatever's available.

Did you test the patch on a NUMA system? It should be possible to test this particular case by setting the hw.physmem tunable to restrict the amount of RAM visible to the kernel, such that all of it comes from a single domain. I can help test if not.

sys/vm/vm_phys.c
1981
This revision is now accepted and ready to land.Tue, Feb 18, 8:21 PM

Did you test the patch on a NUMA system? It should be possible to test this particular case by setting the hw.physmem tunable to restrict the amount of RAM visible to the kernel, such that all of it comes from a single domain. I can help test if not.

Unfortunately no, as I don't have such hardware. If you have access to some (interested in knowing which kind of hardware that is; I suspect these are data center servers?; I don't know if it would be possible to rent such machines from cloud providers), could you please try it?

Unfortunately no, as I don't have such hardware. If you have access to some (interested in knowing which kind of hardware that is; I suspect these are data center servers?; I don't know if it would be possible to rent such machines from cloud providers), could you please try it?

I've committed dependent reviews. Don't hesitate to tell me if you prefer a ready-to-apply patch for this one (grabbing that from Phabricator should work out-of-the-box, but didn't try), or that I commit the change here in advance, as you find it most convenient for your tests.

Did you test the patch on a NUMA system? It should be possible to test this particular case by setting the hw.physmem tunable to restrict the amount of RAM visible to the kernel, such that all of it comes from a single domain. I can help test if not.

Unfortunately no, as I don't have such hardware. If you have access to some (interested in knowing which kind of hardware that is; I suspect these are data center servers?; I don't know if it would be possible to rent such machines from cloud providers), could you please try it?

The machine I use for smoke-testing NUMA patches is sitting under my desk. :)

Could you please send me a link to a git branch with all of your changes?

The machine I use for smoke-testing NUMA patches is sitting under my desk. :)

Convenient. :-) Is this some hardware easy to purchase?

Could you please send me a link to a git branch with all of your changes?

Sure: https://github.com/OlCe2/freebsd-src/tree/oc-vm_startup_save_memory. The branch contains all the series, but if you prefer to test specifically this change (and dependencies) then you'll want to checkout 5cc32f5b7cc9dcf6172240c86de684509cbec32b.

Thanks!

The machine I use for smoke-testing NUMA patches is sitting under my desk. :)

Convenient. :-) Is this some hardware easy to purchase?

It's a system I built myself years ago... just a supermicro board with two sockets and a couple of old Xeons.

Could you please send me a link to a git branch with all of your changes?

Sure: https://github.com/OlCe2/freebsd-src/tree/oc-vm_startup_save_memory. The branch contains all the series, but if you prefer to test specifically this change (and dependencies) then you'll want to checkout 5cc32f5b7cc9dcf6172240c86de684509cbec32b.

Thanks, I will test later this week.