(amd64 only for now)
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
What's the motivation for this change?
I'm worried that with this change, instead of cleanly panicking, the kernel will end up in a catatonic state, requiring some intervention. In particular, we do not set the vm_reclaimfn callback for kernel KVA arenas - after a failure, I think we'll end up sleeping forever.
A regular malloc(M_NOWAIT) call can end up in this function. When we ran out of pages pmap_growkernel() would panic before this patch, but with the patch the ENOMEM analog would be returned all the way up the allocation stack. Potentially machine can revive later, if memory hogs are gone. Example trace:
panic() at panic+0x43/frame 0xfffffe040404bfb0 pmap_growkernel() at pmap_growkernel+0x2a3/frame 0xfffffe040404c000 vm_map_insert1() at vm_map_insert1+0x254/frame 0xfffffe040404c090 vm_map_find_locked() at vm_map_find_locked+0x4fc/frame 0xfffffe040404c160 vm_map_find() at vm_map_find+0xaf/frame 0xfffffe040404c1e0 kva_import() at kva_import+0x36/frame 0xfffffe040404c220 vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c270 vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c300 kva_import_domain() at kva_import_domain+0x25/frame 0xfffffe040404c330 vmem_try_fetch() at vmem_try_fetch+0xce/frame 0xfffffe040404c380 vmem_xalloc() at vmem_xalloc+0x578/frame 0xfffffe040404c410 vmem_alloc() at vmem_alloc+0x37/frame 0xfffffe040404c460 kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0 keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520 zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0 cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620 cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660 in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690 syncache_socket() at syncache_socket+0x44/frame 0xfffffe040404c710 syncache_expand() at syncache_expand+0x81f/frame 0xfffffe040404c830 tcp_input_with_port() at tcp_input_with_port+0x8e0/frame 0xfffffe040404c980 tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe040404c9b0 tcp6_input() at tcp6_input+0xb/frame 0xfffffe040404c9c0 ip6_input() at ip6_input+0x82f/frame 0xfffffe040404ca90 netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cae0 ether_demux() at ether_demux+0x129/frame 0xfffffe040404cb10 ether_nh_input() at ether_nh_input+0x2f4/frame 0xfffffe040404cb60 netisr_dispatch_src() at netisr_dispatch_src+0x6d/frame 0xfffffe040404cbb0 ether_input() at ether_input+0x36/frame 0xfffffe040404cbf0 tcp_lro_flush() at tcp_lro_flush+0x31f/frame 0xfffffe040404cc20 tcp_lro_flush_all() at tcp_lro_flush_all+0x1d3/frame 0xfffffe040404cc60 mlx5e_rx_cq_comp() at mlx5e_rx_cq_comp+0x10d2/frame 0xfffffe040404cd80 mlx5_cq_completion() at mlx5_cq_completion+0x78/frame 0xfffffe040404cde0 mlx5_eq_int() at mlx5_eq_int+0x2ad/frame 0xfffffe040404ce30 mlx5_msix_handler() at mlx5_msix_handler+0x15/frame 0xfffffe040404ce40 lkpi_irq_handler() at lkpi_irq_handler+0x29/frame 0xfffffe040404ce60 ithread_loop() at ithread_loop+0x249/frame 0xfffffe040404cef0 fork_exit() at fork_exit+0x7b/frame 0xfffffe040404cf30
pmap_growkernel() allocates with VM_ALLOC_INTERRUPT, which is able to allocate any available free page; vmd->vmd_interrupt_free_min pages are required to be free otherwise. Today, that means we should keep at least 2 free pages per NUMA domain. Should that threshold be increased instead?
If a thread calling malloc(M_WAITOK) hits this error, won't we end up sleeping forever in VMEM_CONDVAR_WAIT for the kernel KVA arena? kmem_free() doesn't free KVA back to kernel_arena, only to the per-domain arenas.
Also, in your stack, why are we using kmem_malloc() to allocate a slab for inpcbs? That should be going through uma_small_alloc(), which doesn't need KVA. Was the kernel compiled with KASAN enabled?
Of course there are more deadlocks/live locks behind this change. But we wouldn't see them until this fix is done.
Also it is under the knob. If you insist, I can flip the defaults. But IMO we do want to find and fix the next level of problems.
I'm suspicious that the reported panic is related to KASAN, which has special handling in pmap_growkernel(). If so, I'd prefer to fix this by raising the interrupt_min threshold when KASAN/KMSAN is enabled.
@glebius can you confirm whether this is the case? If not, could you please provide sysctl vm.uma output from the system in question? I would like to understand why this fragment appears in the backtrace:
kmem_malloc_domainset() at kmem_malloc_domainset+0x99/frame 0xfffffe040404c4d0 keg_alloc_slab() at keg_alloc_slab+0xb9/frame 0xfffffe040404c520 zone_import() at zone_import+0xef/frame 0xfffffe040404c5b0 cache_alloc() at cache_alloc+0x316/frame 0xfffffe040404c620 cache_alloc_retry() at cache_alloc_retry+0x25/frame 0xfffffe040404c660 in_pcballoc() at in_pcballoc+0x1f/frame 0xfffffe040404c690
No, KASAN is not in the kernel. The inpcb keg has uk_ppera = 4, hence its uk_allocf is page_alloc() which is basically kmem_malloc_domainset(). Our inpcb is 0x880 bytes, so I guess keg_layout() calculated 4 page slabs as most efficient.
vm.uma.tcp_inpcb.stats.xdomain: 0 vm.uma.tcp_inpcb.stats.fails: 0 vm.uma.tcp_inpcb.stats.frees: 602031303 vm.uma.tcp_inpcb.stats.allocs: 602042494 vm.uma.tcp_inpcb.stats.current: 11191 vm.uma.tcp_inpcb.domain.0.timin: 3 vm.uma.tcp_inpcb.domain.0.limin: 12 vm.uma.tcp_inpcb.domain.0.wss: 762 vm.uma.tcp_inpcb.domain.0.bimin: 1524 vm.uma.tcp_inpcb.domain.0.imin: 1524 vm.uma.tcp_inpcb.domain.0.imax: 3556 vm.uma.tcp_inpcb.domain.0.nitems: 2286 vm.uma.tcp_inpcb.limit.bucket_max: 18446744073709551615 vm.uma.tcp_inpcb.limit.sleeps: 0 vm.uma.tcp_inpcb.limit.sleepers: 0 vm.uma.tcp_inpcb.limit.max_items: 0 vm.uma.tcp_inpcb.limit.items: 0 vm.uma.tcp_inpcb.keg.domain.0.free_slabs: 0 vm.uma.tcp_inpcb.keg.domain.0.free_items: 23424 vm.uma.tcp_inpcb.keg.domain.0.pages: 30444 vm.uma.tcp_inpcb.keg.efficiency: 92 vm.uma.tcp_inpcb.keg.reserve: 0 vm.uma.tcp_inpcb.keg.align: 63 vm.uma.tcp_inpcb.keg.ipers: 7 vm.uma.tcp_inpcb.keg.ppera: 4 vm.uma.tcp_inpcb.keg.rsize: 2176 vm.uma.tcp_inpcb.keg.name: tcp_inpcb vm.uma.tcp_inpcb.bucket_size_max: 254 vm.uma.tcp_inpcb.bucket_size: 131 vm.uma.tcp_inpcb.flags: 0x850000<VTOSLAB,SMR,FIRSTTOUCH> vm.uma.tcp_inpcb.size: 2176
I see, thanks. One other question: does the system in question have more than one NUMA domain? If so, we will use a large import quantum, KVA_NUMA_IMPORT_QUANTUM, and we need to be able to allocate more than two PTPs in order to grow the map by that much. That is, when vm_ndomains > 1, we should set vmd_interrupt_min to a larger value.