Dynamically grow the slab size to control wasted memory
Needs ReviewPublic
Actions

Authored by alc on Jul 30 2017, 6:25 PM.

Details

Reviewers

kib
markj
jeff

Summary

I'm posting this patch for discussion. I'm not looking to commit it anytime soon, because I think that further changes are likely required.

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

Here are the zones impacted by this patch:

--> VMSPACE: uk_ppera=2, uk_ipers=3, uk_size=2496, uk_rsize=2496
--> filedesc0: uk_ppera=2, uk_ipers=7, uk_size=1104, uk_rsize=1104
--> THREAD: uk_ppera=4, uk_ipers=11, uk_size=1360, uk_rsize=1376
--> Mountpoints: uk_ppera=5, uk_ipers=7, uk_size=2744, uk_rsize=2744
--> socket: uk_ppera=2, uk_ipers=9, uk_size=856, uk_rsize=856
--> tcpcb: uk_ppera=2, uk_ipers=9, uk_size=872, uk_rsize=872
--> sctp_ep: uk_ppera=3, uk_ipers=8, uk_size=1456, uk_rsize=1456
--> sctp_asoc: uk_ppera=3, uk_ipers=5, uk_size=2408, uk_rsize=2408
--> sctp_raddr: uk_ppera=2, uk_ipers=11, uk_size=736, uk_rsize=736

In D11784#244309, @alc wrote:

Here are the zones impacted by this patch:

I implemented this yesterday and got similar results. In my case, the FPU_save_area zone also ended up having its slab size grown.

In the thread zone, I noticed that we'd get good packing with ppera = 1 if the item alignment was 16 bytes (leading to rsize = 1360), but as a result of r313391 we have rsize = 1376.

vm/uma_core.c
1267	This flag needs to be set in the zone too, for at least uma_dbg_alloc().

In D11784#244320, @markj wrote:

I implemented this yesterday and got similar results. In my case, the FPU_save_area zone also ended up having its slab size grown.

The entries size depends on the hardware. It is sized according to the presence of XMM/YMM/ZMM registers (in fact CPU reports the needed size for XSAVE, if supported).

alc added inline comments.Jul 30 2017, 10:02 PM

vm/uma_core.c
1267	My impression is that the flags set here are inherited through the following lines in zone_ctor(): zone->uz_flags \|= (keg->uk_flags & (UMA_ZONE_INHERIT \| UMA_ZFLAG_INHERIT)); Am I overlooking something?

markj added inline comments.Jul 30 2017, 10:25 PM

vm/uma_core.c
1267	Oh, you're right. I had hit a panic as a result of not setting VTOSLAB at all, but setting it in the keg alone did not fix the problem; I must have made a mistake when testing.

alc added inline comments.Jul 31 2017, 4:49 PM

vm/vm_page.c
499–506	I'm not thrilled by this snippet because it extends the vm_page_array by a huge amount: SEGMENT 4: start: 0x100000000 end: 0x7eaf9d000 domain: 0 free list: 0xffffffff81e19b70 SEGMENT 5: start: 0x81ffa8000 end: 0x81ffe8000 domain: 0 free list: 0xffffffff81e19b70 SEGMENT 5 was created by this snippet. This snippet is required, at least in principle, because vmspace objects can be allocated using startup_alloc(). I think that the solution is to allocate the boot pages from a segment that is lower in the physical address space.

jeff added a subscriber: jeff.Aug 12 2017, 10:16 PM

Three broad comments;

We don't necessarily know which allocations will be handed off to do DMA on. This may be safe as busdma becomes more sophisticated but in the past network drivers definitely assumed mbufs were contiguous, for example. There may still be advantages to allocating aligned kva or possibly just contig physical addresses and using the direct map. You could in principle still use a constant offset from the aligned boundary to find the slab. You would also produce shorter scatter/gather descriptors for those zones that are involved in dma. Eliminating vtoslab may not be that important with the per-cpu caches in play but we do need to consider the implications of contiguous virtual addresses not being physically contiguous.

With power of two allocators allocating non power of two sizes will cause fragmentation to grow with each allocation. You will grab and split a 4 page contiguous chunk to allocate 3 pages. This can cause pathological behavior if there are a significant number of these allocations. Also, since the waste is always at the tail, two neighboring allocations won't even free single pages that can become pairs of pages. If these zones are not heavily used then it is not a concern but as structure sizes change the blend of zones using these larger allocations might change. You could change the algorithm to use the next larger power of two size after we cross the fragmentation threshold which may actually increase internal fragmentation in exchange for external fragmentation. On 64bit machines this is almost certainly not a problem in practice if you're using contig kva but not contig physical addresses. On large memory 32bit machines address space fragmentation has always been a problem.

There may still be cases where it is advantageous to select offpage allocation. Imagine mbuf clusters which are exactly 2k. 2 can fit perfectly in a page. If you don't permit offpage you will then allocate 5 pages to store 9 clusters and a slab header. It is probably more reasonable to do offpage here. This also sidesteps the virt/phys issues. I think the algorithm would evaluate whether offpage or on brought us below the minimum fragmentation at each slab size, with a preference for on page if they both fit.

In D11784#248990, @jeff wrote:

Three broad comments;

We don't necessarily know which allocations will be handed off to do DMA on. This may be safe as busdma becomes more sophisticated but in the past network drivers definitely assumed mbufs were contiguous, for example. There may still be advantages to allocating aligned kva or possibly just contig physical addresses and using the direct map. You could in principle still use a constant offset from the aligned boundary to find the slab. You would also produce shorter scatter/gather descriptors for those zones that are involved in dma. Eliminating vtoslab may not be that important with the per-cpu caches in play but we do need to consider the implications of contiguous virtual addresses not being physically contiguous.

If we go down this path, I think that it would make sense to have a new flag to uma_zcreate() that guaranteed physical contiguity for the memory underlying each object. This might mean that we sacrifice fragmentation control for these objects.

To be clear, we could create a variant of uma_small_alloc() that called vm_page_alloc_contig(). This function could provide the alignment that you desire, enabling the use a constant offset to locate the slab structure. However, it could fail due to physical memory fragmentation. We could fall back to kmem_arena-based allocation for zones that don't have the proposed flag set, but we would have to guarantee aligned kva and that requires the use of the potentially expensive vmem_xalloc(). (Recall that in vmem just being in the free list for size X does not mean that the range starts at an X aligned boundary.)

With power of two allocators allocating non power of two sizes will cause fragmentation to grow with each allocation. You will grab and split a 4 page contiguous chunk to allocate 3 pages. This can cause pathological behavior if there are a significant number of these allocations. Also, since the waste is always at the tail, two neighboring allocations won't even free single pages that can become pairs of pages. If these zones are not heavily used then it is not a concern but as structure sizes change the blend of zones using these larger allocations might change. You could change the algorithm to use the next larger power of two size after we cross the fragmentation threshold which may actually increase internal fragmentation in exchange for external fragmentation. On 64bit machines this is almost certainly not a problem in practice if you're using contig kva but not contig physical addresses. On large memory 32bit machines address space fragmentation has always been a problem.

Yes, I think it's worth considering promotion to the next larger power of 2, especially if we're talking about an uma_small_alloc() variant that uses vm_page_alloc_contig().

Just so we're all on the same page here, I think that it's important to distinguish between power-of-2 allocators that are binary buddy allocators, like vm_phys.c, and allocators, like vmem. For example, if we're allocating from kmem_arena a 3-page allocation can start on an even or odd page number. Moreover, vmem will coalesce adjacent 1- and 2-page ranges, regardless of whether the 2-page range comes first or last. No less important, when vmem splits a range to satisfy an allocation, it does not have to recursively split the unused portion into power-of-2 size ranges. So, a sequence of 3-page allocations could be back-to-back.

There may still be cases where it is advantageous to select offpage allocation. Imagine mbuf clusters which are exactly 2k. 2 can fit perfectly in a page. If you don't permit offpage you will then allocate 5 pages to store 9 clusters and a slab header. It is probably more reasonable to do offpage here. This also sidesteps the virt/phys issues. I think the algorithm would evaluate whether offpage or on brought us below the minimum fragmentation at each slab size, with a preference for on page if they both fit.

Agreed, and I don't think that this patch changes our behavior in this respect. If the objects are power-of-2-sized, we will still use a single-page slab and an offpage slab structure.

In D11784#249094, @alc wrote:

In D11784#248990, @jeff wrote:

Three broad comments;

We don't necessarily know which allocations will be handed off to do DMA on. This may be safe as busdma becomes more sophisticated but in the past network drivers definitely assumed mbufs were contiguous, for example. There may still be advantages to allocating aligned kva or possibly just contig physical addresses and using the direct map. You could in principle still use a constant offset from the aligned boundary to find the slab. You would also produce shorter scatter/gather descriptors for those zones that are involved in dma. Eliminating vtoslab may not be that important with the per-cpu caches in play but we do need to consider the implications of contiguous virtual addresses not being physically contiguous.

If we go down this path, I think that it would make sense to have a new flag to uma_zcreate() that guaranteed physical contiguity for the memory underlying each object. This might mean that we sacrifice fragmentation control for these objects.

I looked deeper into this. At some point in the consolidation of bus dma load functions we gained support for non-contiguous mbufs. This is true for any virtual address loaded via busdma. However, some network drivers may not actually support sg lists > 1 and some drivers still don't use busdma. Given this, I think a flag makes sense, we can set it on power of two zones, likely with no actual impact but just for completeness, and also on mbufs for maximum comaptibility. I don't think anything else requires it.

I don't think we really need to optimize out vtoslab except for something incredibly high traffic like mbufs where it might benefit. I would want to see profiles of that though.

To be clear, we could create a variant of uma_small_alloc() that called vm_page_alloc_contig(). This function could provide the alignment that you desire, enabling the use a constant offset to locate the slab structure. However, it could fail due to physical memory fragmentation. We could fall back to kmem_arena-based allocation for zones that don't have the proposed flag set, but we would have to guarantee aligned kva and that requires the use of the potentially expensive vmem_xalloc(). (Recall that in vmem just being in the free list for size X does not mean that the range starts at an X aligned boundary.)

First, I wonder now why we even use kmem/kernel arenas when we alloc contig memory. It may be more efficient to simply use the direct map? Is this just incomplete integration of direct map?

Given that aligned virtual memory is only required for faster vtoslab() I think this is not really necessary.

With power of two allocators allocating non power of two sizes will cause fragmentation to grow with each allocation. You will grab and split a 4 page contiguous chunk to allocate 3 pages. This can cause pathological behavior if there are a significant number of these allocations. Also, since the waste is always at the tail, two neighboring allocations won't even free single pages that can become pairs of pages. If these zones are not heavily used then it is not a concern but as structure sizes change the blend of zones using these larger allocations might change. You could change the algorithm to use the next larger power of two size after we cross the fragmentation threshold which may actually increase internal fragmentation in exchange for external fragmentation. On 64bit machines this is almost certainly not a problem in practice if you're using contig kva but not contig physical addresses. On large memory 32bit machines address space fragmentation has always been a problem.

Yes, I think it's worth considering promotion to the next larger power of 2, especially if we're talking about an uma_small_alloc() variant that uses vm_page_alloc_contig().

Just so we're all on the same page here, I think that it's important to distinguish between power-of-2 allocators that are binary buddy allocators, like vm_phys.c, and allocators, like vmem. For example, if we're allocating from kmem_arena a 3-page allocation can start on an even or odd page number. Moreover, vmem will coalesce adjacent 1- and 2-page ranges, regardless of whether the 2-page range comes first or last. No less important, when vmem splits a range to satisfy an allocation, it does not have to recursively split the unused portion into power-of-2 size ranges. So, a sequence of 3-page allocations could be back-to-back.

Yes I had forgotten about this distinction. I did run into this problem once with 9k jumbos. They turned into allocations of three pages which quickly exhausted the contiguous memory pool and left singletons floating around.

There may still be cases where it is advantageous to select offpage allocation. Imagine mbuf clusters which are exactly 2k. 2 can fit perfectly in a page. If you don't permit offpage you will then allocate 5 pages to store 9 clusters and a slab header. It is probably more reasonable to do offpage here. This also sidesteps the virt/phys issues. I think the algorithm would evaluate whether offpage or on brought us below the minimum fragmentation at each slab size, with a preference for on page if they both fit.

Agreed, and I don't think that this patch changes our behavior in this respect. If the objects are power-of-2-sized, we will still use a single-page slab and an offpage slab structure.

I see it evaluates once at the end whether the header fits in the slab once we reach the desired fragmentation. I think it would be done better in a loop while containing the slab header against fragmentation. Afterall it's not unused space. I think we can also consolidate keg_small_init and keg_large_init at this point and make a single function. Otherwise very large allocations will violate the fragmentation invariant.

I could take this patch and make a merged and simplified version of small/large init if you like. 15 years later and this code looks painfully cumbersome to me.

rlibby mentioned this in D22797: Restore the reservation of boot pages for bucket zones..Dec 13 2019, 5:48 PM