pmap_growkernel(): do not panic immediately, return error
Needs ReviewPublic
Actions

Authored by kib on Dec 5 2024, 11:04 PM.

Details

Reviewers

alc
markj

Summary

(amd64 only for now)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib created this revision.Dec 5 2024, 11:04 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 5 2024, 11:04 PM

kib requested review of this revision.Dec 5 2024, 11:04 PM

What's the motivation for this change?

I'm worried that with this change, instead of cleanly panicking, the kernel will end up in a catatonic state, requiring some intervention. In particular, we do not set the vm_reclaimfn callback for kernel KVA arenas - after a failure, I think we'll end up sleeping forever.

In D47935#1093091, @markj wrote:

What's the motivation for this change?

A regular malloc(M_NOWAIT) call can end up in this function. When we ran out of pages pmap_growkernel() would panic before this patch, but with the patch the ENOMEM analog would be returned all the way up the allocation stack. Potentially machine can revive later, if memory hogs are gone. Example trace:

panic() at panic+0x43/frame 0xfffffe040404bfb0                                                                                                      
pmap_growkernel() at pmap_growkernel+0x2a3/frame 0xfffffe040404c000                                                                                 
vm_map_insert1() at vm_map_insert1+0x254/frame 0xfffffe040404c090                                                                                   
vm_map_find_locked() at vm_map_find_locked+0x4fc/frame 0xfffffe040404c160                                                                           
vm_map_find() at vm_map_find+0xaf/frame 0xfffffe040404c1e0                                                                                          
kva_import() at kva_import+0x36/frame 0xfffffe040404c220                                                                                            
vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c270                                                                                    
vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c300                                                                                         
kva_import_domain() at kva_import_domain+0x25/frame 0xfffffe040404c330                                                                              
vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c380                                                                                    
vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c410                                                                                         
vmem_alloc() at vmem_alloc+0x37/frame 0xfffffe040404c460                                                                                            
kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0                                                                      
keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520                                                                                    
zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0                                                                                          
cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620                                                                                         
cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660                                                                              
in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690
syncache_socket() at syncache_socket+0x44/frame 0xfffffe040404c710
syncache_expand() at syncache_expand+0x81f/frame 0xfffffe040404c830
tcp_input_with_port() at tcp_input_with_port+0x8e0/frame 0xfffffe040404c980
tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe040404c9b0
tcp6_input() at tcp6_input+0xb/frame 0xfffffe040404c9c0
ip6_input() at ip6_input+0x82f/frame 0xfffffe040404ca90
netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cae0
ether_demux() at ether_demux+0x129/frame 0xfffffe040404cb10
ether_nh_input() at ether_nh_input+0x2f4/frame 0xfffffe040404cb60
netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cbb0
ether_input() at ether_input+0x36/frame 0xfffffe040404cbf0
tcp_lro_flush() at tcp_lro_flush+0x31f/frame 0xfffffe040404cc20
tcp_lro_flush_all() at tcp_lro_flush_all+0x1d3/frame 0xfffffe040404cc60
mlx5e_rx_cq_comp() at mlx5e_rx_cq_comp+0x10d2/frame 0xfffffe040404cd80
mlx5_cq_completion() at mlx5_cq_completion+0x78/frame 0xfffffe040404cde0
mlx5_eq_int() at mlx5_eq_int+0x2ad/frame 0xfffffe040404ce30
mlx5_msix_handler() at mlx5_msix_handler+0x15/frame 0xfffffe040404ce40
lkpi_irq_handler() at lkpi_irq_handler+0x29/frame 0xfffffe040404ce60
ithread_loop() at ithread_loop+0x249/frame 0xfffffe040404cef0
fork_exit() at fork_exit+0x7b/frame 0xfffffe040404cf30

We can add a knob to control the behavior, panic/not panic.

Add a knob to select panicing behavior.

In D47935#1093140, @glebius wrote:

In D47935#1093091, @markj wrote:

What's the motivation for this change?

A regular malloc(M_NOWAIT) call can end up in this function. When we ran out of pages pmap_growkernel() would panic before this patch, but with the patch the ENOMEM analog would be returned all the way up the allocation stack. Potentially machine can revive later, if memory hogs are gone. Example trace:

pmap_growkernel() allocates with VM_ALLOC_INTERRUPT, which is able to allocate any available free page; vmd->vmd_interrupt_free_min pages are required to be free otherwise. Today, that means we should keep at least 2 free pages per NUMA domain. Should that threshold be increased instead?

If a thread calling malloc(M_WAITOK) hits this error, won't we end up sleeping forever in VMEM_CONDVAR_WAIT for the kernel KVA arena? kmem_free() doesn't free KVA back to kernel_arena, only to the per-domain arenas.

Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?

Of course there are more deadlocks/live locks behind this change. But we wouldn't see them until this fix is done.

Also it is under the knob. If you insist, I can flip the defaults. But IMO we do want to find and fix the next level of problems.

In D47935#1093599, @kib wrote:

Of course there are more deadlocks/live locks behind this change. But we wouldn't see them until this fix is done.

Also it is under the knob. If you insist, I can flip the defaults. But IMO we do want to find and fix the next level of problems.

I'm suspicious that the reported panic is related to KASAN, which has special handling in pmap_growkernel(). If so, I'd prefer to fix this by raising the interrupt_min threshold when KASAN/KMSAN is enabled.

@glebius can you confirm whether this is the case? If not, could you please provide sysctl vm.uma output from the system in question? I would like to understand why this fragment appears in the backtrace:

kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0                                                                      
keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520                                                                                    
zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0                                                                                          
cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620                                                                                         
cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660                                                                              
in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690

In D47935#1093253, @markj wrote:

Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?

No, KASAN is not in the kernel. The inpcb keg has uk_ppera = 4, hence its uk_allocf is page_alloc() which is basically kmem_malloc_domainset(). Our inpcb is 0x880 bytes, so I guess keg_layout() calculated 4 page slabs as most efficient.

vm.uma.tcp_inpcb.stats.xdomain: 0
vm.uma.tcp_inpcb.stats.fails: 0
vm.uma.tcp_inpcb.stats.frees: 602031303
vm.uma.tcp_inpcb.stats.allocs: 602042494
vm.uma.tcp_inpcb.stats.current: 11191
vm.uma.tcp_inpcb.domain.0.timin: 3
vm.uma.tcp_inpcb.domain.0.limin: 12
vm.uma.tcp_inpcb.domain.0.wss: 762
vm.uma.tcp_inpcb.domain.0.bimin: 1524
vm.uma.tcp_inpcb.domain.0.imin: 1524
vm.uma.tcp_inpcb.domain.0.imax: 3556
vm.uma.tcp_inpcb.domain.0.nitems: 2286
vm.uma.tcp_inpcb.limit.bucket_max: 18446744073709551615
vm.uma.tcp_inpcb.limit.sleeps: 0
vm.uma.tcp_inpcb.limit.sleepers: 0
vm.uma.tcp_inpcb.limit.max_items: 0
vm.uma.tcp_inpcb.limit.items: 0
vm.uma.tcp_inpcb.keg.domain.0.free_slabs: 0
vm.uma.tcp_inpcb.keg.domain.0.free_items: 23424
vm.uma.tcp_inpcb.keg.domain.0.pages: 30444
vm.uma.tcp_inpcb.keg.efficiency: 92
vm.uma.tcp_inpcb.keg.reserve: 0
vm.uma.tcp_inpcb.keg.align: 63
vm.uma.tcp_inpcb.keg.ipers: 7
vm.uma.tcp_inpcb.keg.ppera: 4
vm.uma.tcp_inpcb.keg.rsize: 2176
vm.uma.tcp_inpcb.keg.name: tcp_inpcb
vm.uma.tcp_inpcb.bucket_size_max: 254
vm.uma.tcp_inpcb.bucket_size: 131
vm.uma.tcp_inpcb.flags: 0x850000<VTOSLAB,SMR,FIRSTTOUCH>
vm.uma.tcp_inpcb.size: 2176

In D47935#1094163, @glebius wrote:

In D47935#1093253, @markj wrote:

Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?

No, KASAN is not in the kernel. The inpcb keg has uk_ppera = 4, hence its uk_allocf is page_alloc() which is basically kmem_malloc_domainset(). Our inpcb is 0x880 bytes, so I guess keg_layout() calculated 4 page slabs as most efficient.

I see, thanks. One other question: does the system in question have more than one NUMA domain? If so, we will use a large import quantum, KVA_NUMA_IMPORT_QUANTUM, and we need to be able to allocate more than two PTPs in order to grow the map by that much. That is, when vm_ndomains > 1, we should set vmd_interrupt_min to a larger value.

In D47935#1094165, @markj wrote:

I see, thanks. One other question: does the system in question have more than one NUMA domain? If so, we will use a large import quantum, KVA_NUMA_IMPORT_QUANTUM, and we need to be able to allocate more than two PTPs in order to grow the map by that much. That is, when vm_ndomains > 1, we should set vmd_interrupt_min to a larger value.

The paniced system has vm_ndomains = 1.

syzkaller hit similar issue with fork(2):
https://syzkaller.appspot.com/bug?extid=6cd13c008e8640eceb4c

IMHO, we could propagate the pmap_growkernel() error all the way up and fail the syscall.