Page MenuHomeFreeBSD

kern/geom: Make the struct bio allocation more robust to a heavy load.
Needs ReviewPublic

Authored by seigo.tanimura_gmail.com on May 16 2024, 7:57 AM.
Tags
None
Referenced Files
F107616621: D45215.diff
Thu, Jan 16, 5:34 PM
Unknown Object (File)
Sun, Jan 12, 10:26 PM
Unknown Object (File)
Nov 25 2024, 4:10 PM
Unknown Object (File)
Nov 24 2024, 2:50 PM
Unknown Object (File)
Nov 23 2024, 12:48 AM
Unknown Object (File)
Nov 22 2024, 10:30 AM
Unknown Object (File)
Nov 20 2024, 11:39 PM
Unknown Object (File)
Nov 19 2024, 7:01 AM
Subscribers
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

The heavy write load on a nvme(4) device may exhaust the bios because of many
parallel bio requests in progress. Tmpfs(5) used by poudriere-bulk(8) is one
of such the case, in which many "swap_pager: cannot allocate bio" log lines
appear. The other filesystems can also trigger this issue, often in silence
on the log, though counted as the allocation failures of the g_bio uma(9)
zone.

This commit addresses the issue by reserving some bios for writing and alike,
allocated in the non-blocking manner. This is essentially required to make
g_new_bio() non-blocking in order to avoid the deadlock where the swap write
is needed to allocate the kernel memory for the new bios.

The default bios reserved for the non-blocking allocation is 65536. This
should be sufficient for a single nvme(4) device with the maximum parallel
requests.

  • New Loader Tunable
  • kern.geom.reserved_new_bios The number of the bios reserved for the non-blocking allocation. Zero means no bios are reserved. Due to the limitation on the uma(9) zone, this configuration cannot be altered upon a running host.

Signed-off-by: Seigo Tanimura <seigo.tanimura@gmail.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 57762
Build 54650: arc lint + arc unit

Event Timeline

sys/geom/geom_io.c
147–149

Wouldn't it be safer to allocate for the normal area and then if that fails, use the reserve? Then you wouldn't need so many reserve bios.

And what about my system that has 48 NVMe drives? There's no autoscaling to what's in the system, and that's made worse by always allocating out of the reserve.

171

too many parens: they aren't needed for the two ternary operands, for example.

201

Same comment here.

sys/geom/geom_io.c
147–149

I thought M_USE_RESERVE would try the normal area first, which has turned out as wrong.

Have to watch out for the KTR stuff as well. Maybe that should be moved back to g_{new,alloc}_bio().

sys/geom/geom_io.c
147–149

ah, yes. If a reserve is set, and there's fewer than that many available for a non-use-reserve allocation, more slabs will be allocated. I think I was reading the code backwards and you're correct (I think I made this mistake a few years ago as well). I wound up pre-allocating then w/o reserve and just failing if we needed to allocate more.

Still worry about 64k hard-coded regardless of machine size.

sys/geom/geom_io.c
147–149

Checked the behaviour regarding M_USE_RESERVE again. The main logic is in keg_fetch_free_slab().

sys/vm/uma_core.c, a03c23931e:

/*
 * Fetch an existing slab from a free or partial list.  Returns with the
 * keg domain lock held if a slab was found or unlocked if not.
 */
static uma_slab_t
keg_fetch_free_slab(uma_keg_t keg, int domain, bool rr, int flags)
{
        uma_slab_t slab;
        uint32_t reserve;

        /* HASH has a single free list. */
        if ((keg->uk_flags & UMA_ZFLAG_HASH) != 0)
                domain = 0;

        KEG_LOCK(keg, domain);
        reserve = (flags & M_USE_RESERVE) != 0 ? 0 : keg->uk_reserve;
        if (keg->uk_domain[domain].ud_free_items <= reserve ||
            (slab = keg_first_slab(keg, domain, rr)) == NULL) {
                KEG_UNLOCK(keg, domain);
                return (NULL);
        }
        return (slab);
}

where keg->uk_reserve is configured by uma_zone_reserve(). So the idea above is to leave at least keg->uk_reserve free items in keg unless M_USE_RESERVE is set.

This also shows that it is all right to give M_USE_RESERVE to the zone with no reservation; keg->uk_reserve is inited to zero.

Still worry about 64k hard-coded regardless of machine size.

That is configurable on the loader by kern.geom.reserved_new_bios. (Line 749)

The problem is that uma_zone_reserve() cannot be called if the target keg has any allocated items, which is likely to happen on a running host. Again, sys/vm/uma_core.c, a03c23931e:

/* See uma.h */
void
uma_zone_reserve(uma_zone_t zone, int items)
{
        uma_keg_t keg;

        KEG_GET(zone, keg);
        KEG_ASSERT_COLD(keg);
        keg->uk_reserve = items;
}
sys/geom/geom_io.c
147–149

You can scale the default based on memory size or other things we scale to the machine size.

Key findings out of quick codewalk over g_new_bio() callers
sys/dev/nvme/nvme_ns.c, the nvme(4) namespace driver, calls g_new_bio() in the heaviest manner.

  • A lot of child bios are allocated in a burst.
  • The child bios are always allocated in the non-blocking way by g_new_bio(), regardless from the bio commands.
  • The child bios are allocated and freed within the nvme(4) namespace driver.

These findings imply that the issue should be solved within the nvme(4) namespace driver.

My idea now is the per-nvme-ns uma(9) zone for the child bios. This is hopefully feasible thanks to the lifecycle of the child bios. The benefits include:

  • Optimal bio reservation per nvme(4) devices.
  • Attach-time control of bio reservation.
  • Separation of allocation burst from global bio zone.
  • Tuning depending on nvme(4) nature.
  • Hotplug and removal.
sys/geom/geom_io.c
147–149

That may work tentatively, but not sure if that is precise enough for the production.

An alternative solution in my mind will be described as a separate comment.

What does nvmecontrol identify ndaX say here? What's the optimal I/O boundary? And maybe we should just have a knob to disable trying to use it.

nvmecontrol identify nda4 | grep Opt
Optimal I/O Boundary:        256 blocks
In D45215#1031630, @imp wrote:

What does nvmecontrol identify ndaX say here? What's the optimal I/O boundary? And maybe we should just have a knob to disable trying to use it.

nvmecontrol identify nda4 | grep Opt
Optimal I/O Boundary:        256 blocks

Or maybe to enable it. It originally was a workaround for old Intel drives that had a *HUGE* performance improvement doing this. These days, it's unclear if we should be enforcing it, especially on partitions that are not well aligned to this value.

In D45215#1031630, @imp wrote:

What does nvmecontrol identify ndaX say here? What's the optimal I/O boundary? And maybe we should just have a knob to disable trying to use it.

root@pkgfactory2:~ # dmesg | grep '^nda0'
nda0 at nvme0 bus 0 scbus31 target 0 lun 1
nda0: <VMware Virtual NVMe Disk 1.3 VMware NVME_0000>
nda0: Serial Number VMware NVME_0000
nda0: nvme version 1.3
nda0: 24576MB (50331648 512 byte sectors)
root@pkgfactory2:~ # nvmecontrol identify nda0
Size:                        50331648 blocks
Capacity:                    50331648 blocks
Utilization:                 50331648 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       1
Current LBA Format:          LBA Format #00
Metadata Capabilities
  Extended:                  Not Supported
  Separate:                  Not Supported
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   Not Supported
Deallocate Logical Block:    Read Not Reported
Optimal I/O Boundary:        0 blocks
NVM Capacity:                0 bytes
Globally Unique Identifier:  6b98b012561de87e000c2969a5e05af1
IEEE EUI64:                  0000000000000000
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Best
root@pkgfactory2:~ #

The host in question is a VM of VMWare Workstation 17. All of my nvme storages are emulated by the hypervisor.

I have implemented the stats for the bio allocation and found that it is not nvme(4) but some other subsystems that allocate bios heavily in my case. Will come back with the detail soon.

Patch for Bio Alloation Stats
{F84091654}

This is not meant for merging as is, but maybe with some brushups. At least, the bio allocation callers should be defined in the style like MALLOC_DEFINE() rather than hardcoding.


Observed Bio Allocation Stats
Test case: poudriere-bulk(8) on 2325 ports including the dependency. (11 failed due to the problems on the port tree, retaking now with the local fixes)

  • ZFS enabled for poudriere(8) and the ccache(1) cache directory.
  • TMPFS enabled for wrkdir, data and localbase.
  • Swap partitions are created out of ZFS.

Bio Allocation Stats per Caller

  • Overall (Significant callers only) {F84092125}
    • There were no allocations from nvme(4).
    • The most significant callers are the zfs(4) vdev, geom(4) part and disk.
    • The VM swap pager comes in the late stage where the work directories do not fit within the memory.
    • The geom(4) vfs is not zero, but can be neglected in this case.
  • Failures (All callers) {F84092154}
    • There were literally no bio allocation failures.
  • Detail: zfs(4) vdev {F84092256}
    • All allocations are via g_alloc_bio().
    • The general entry point for the geom(4) activities by the poudriere(8) file access.
  • Detail: VM swap pager {F84093121}
    • g_new_bio() for writing, g_alloc_bio() for reading.
    • Usually less than zfs(4) vdev except for a couple of spikes, probably for the wrkdir and localbase accesses.
  • Detail: geom(4) part and disk {F84093234} {F84093246}
    • All allocations are via g_new_bio().
    • Called just before going into the storage drivers.

Analysis

The three significant callers give a certain scenario in my mind:

  • The file accesses by the poudriere(8) threads trigger zfs(4) vdev. It is all right to block as this is the entry into geom(4).
  • zfs(4) vdev then calls deeper into geom(4) part and disk. They cannot block because they may be called by the geom/g_down kernel thread, which cannot sleep. sys/geom/geom_io.c, b12c6876b4:
void
g_io_schedule_down(struct thread *tp __unused)
{
        struct bio *bp;
        int error;

        for(;;) {
        /* snip */
                THREAD_NO_SLEEPING();
                CTR4(KTR_GEOM, "g_down starting bp %p provider %s off %ld "
                    "len %ld", bp, bp->bio_to->name, bp->bio_offset,
                    bp->bio_length);
                bp->bio_to->geom->start(bp);
                THREAD_SLEEPING_OK();
        }
}

Assuming that, the heavy load on the non-blocking bio allocation comes from the geom(4) design. That can happen on any hosts under the sufficiently heavy I/O load.

I guess someone has found the issue before and added a workaround. Again, sys/geom/geom_io.c, b12c6876b4:

/*
 * Pace is a hint that we've had some trouble recently allocating
 * bios, so we should back off trying to send I/O down the stack
 * a bit to let the problem resolve. When pacing, we also turn
 * off direct dispatch to also reduce memory pressure from I/Os
 * there, at the expxense of some added latency while the memory
 * pressures exist. See g_io_schedule_down() for more details
 * and limitations.
 */
static volatile u_int __read_mostly pace;

One pitfall of this fix, suspicious to me, is that pacing turns some blocking allocations into non-blocking at the design level. This adds more pressure to the VM swap pager, another geom(4) user, contrary to the goal of pacing.

The bio reservation solution should ideally be designed to serve as the slack between the early warning of the resource shortage and the hard failure. I have seen that in the OpenZFS ARC (PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594), where the resource allocation under the certain watermark succeeds without blocking and triggers the resource reclaim thread. Maybe that is the general design pattern of Solaris, not just OpenZFS.

My fix implements only the slack part as of now. It would be even better for uma(9) to monitor and act on that, though that work would be much more than just the bio allocation issue.

One problem looked over regarding M_USE_RESERVE:

keg_fetch_slab() looks for the free slab across the domains when M_USE_RESERVE is given. This is true even for the zone without any reservation.

This means that uma_zalloc(zone, M_NOWAIT | M_USE_RESERVE) is not completely equivalent to uma_zalloc(zone, M_NOWAIT) for the zone with no reservation.

The problem was exploited out of the baseline test where kern.geom.reserved_new_bios was set to zero. Under this setup, multiple "swap_pager: indefinite wait buffer" log lines appeared and many processes for poudriere-bulk(8) were killed for failing to read from the swap. It is suspected that the swap I/O stalled because all of the free bios in the zone were allocated.

Updated Test Results with Fix

Bio Allocation Stats per Caller

  • Overall (Significant callers only) {F84267545}
  • Failures (All callers) {F84267712}
  • Details: zfs(4) vdev, VM swap pager, geom(4) part and disk. {F84267802} {F84267816} {F84267825} {F84267837}

Baseline Test Results

  • kern.geom.reserved_new_bios set to zero.
  • The "swap_pager: cannot allocate bio" logs reproduced when many builders extracted the dependency packages in parallel.
  • It has turned out that all of the port build errors were due to the problems in the ports tree. I understand that the fix still makes a sense because the logs above have disappeared with the fix.

Bio Allocation Stats per Caller

  • Overall (Significant callers only) {F84269289}
  • Failures (Significant callers only) {F84269331}
    • The failures tended to happen in burst.
    • Most of the failures happened in the VM swap pager.
  • Detail: zfs(4) vdev {F84269601} {F84269757}
    • No failures.
  • Detail: VM swap pager {F84269821} {F84269836}
    • Almost all failures happened here.
    • The free slabs are not fetched across the domains; allowing that (by mistake) ended up with the "swap_pager: indefinite wait buffer" logs and many processes killed for the indefinite swap read time.
  • Detail: geom(4) part and disk {F84270556} {F84270602} {F84270619} {F84270646}
    • ~1-10% of the VM swap pager, but not zero.

Fix Requirements and Design
The test results up to this moment and the findings put some requirements for the robust fix, including:

  • The allocation of the swap write bios must be distinct from the generic bio allocation, so that the activities on the generic bios do not interfere the swap write, and vice versa.
    • This also applies to nvme(4) and xen(4) blkback because splitting is not a usual bio usage.
  • When a bio being cloned is allocated distinctly from the generic ones, the cloned bio should also be allocated distinctly.
  • The free bios for the write operations should be favoured over the non-write operations because the former may make some pages reclaimable.
    • Implemented already in the fix.

The caller-supplied bio zones should meet the requirements above. eg., The VM swap pager creates its own bio zone with the reserve, and allocates the write bios out of it. The originating zone is recorded in the bio so that it can be reused for cloning, as well as reclaiming the bio. This allows the isolation of the bio activities and also the fine tuning of the bio zones per usage.

I will see how I can implement that for the VM swap pager.

seigo.tanimura_gmail.com retitled this revision from kern/geom: Reserve some bios for the non-blocking allocation. to kern/geom: Make the struct bio allocation more robust to a heavy load..May 27 2024, 9:59 AM

Diff Updates

  • The bio allocation accepts the caller-supplied uma(9) zone for both non-blocking (g_new_bio_uz()) and blocking (g_alloc_bio_uz()) allocations.
  • The bio cloners (g_clone_bio() and g_duplicate_bio()) inherit the uma(9) zone of the original bio.
  • One prerequisite and three usage diffs submitted separately.
  • Optimization.
  • The new KPIs and loader tunables are covered in the man pages.

Poudriere-bulk(8) Test Results

  • The OS version has been updated to 14.1-RC1.
  • kern.geom.reserved_new_bios and vm.swap_reserved_new_bios are set to 65536, the default values.
  • The tendency of the results are the same as https://reviews.freebsd.org/D45215#1033442.

Bio Allocation Stats per Caller

  • Overall (Significant callers only) {F84704797}
  • Failures (All callers) {F84704820}
    • No failures.
  • Details: zfs(4) vdev, VM swap pager, geom(4) part and disk. {F84704846} {F84704862} {F84704871} {F84704880}

`vm_lowmem` Kernel Events
This is actually taken by the stats for https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594.
{F84705149}

  • The zfs.arc.vm_lowmem.kmem trace shows the VM_LOW_KMEM kernel events.
  • The highest event rate: ~150 [events] / 10 [minutes] = 0.25 [events / sec]

Baseline Test Results

Bio Allocation Stats per Caller

  • Overall (Significant callers only) {F84706318}
  • Failures (Significant callers only) {F84706456}
    • The failures tended to happen in burst.
    • Most of the failures happened in the VM swap pager; some in the geom(4) disk and part.
  • Detail: zfs(4) vdev {F84706556} {F84706573}
    • No failures.
  • Detail: VM swap pager {F84706621} {F84706634}
    • Almost all failures happened here.
    • The fails here result in the delay of the swap write operations.
  • Detail: geom(4) part and disk {F84706729} {F84706743} {F84706756} {F84706770}
    • ~1-10% of the VM swap pager, but not zero.
    • The penalty of these failures include the pace down in geom(4), which actually makes another trouble. The detail is discussed in the analysis section.

Analysis and Discussion

Effect of Pacing Down in geom(4)
When g_io_deliver() detects the bio allocation error from a geom(4) consumer, it triggers the pace down by pace = 1. This invokes some geom(4) process changes, including but not limited to:

  • Passing all bios to the consumers via the queue (g_bio_run_down) and the g_down kernel thread, instead of the direct function calls.
  • Enforcing a pause of at least 1ms upon each bio scheduling by g_down.

Both of the process changes throttles the flow of bios.

The problem of this approach is that it does not count what purpose the bios are serving for. While this fix is likely to work if the bios come from the normal file I/O operations by the user processes, it makes the problem worse when the bios are for the swap write. The pace down slows the swap writes and hence the shortage of the free pages gets worse.

Related Work
D24400, committed as c6213beff4, addresses the similar issue, except that the issue is in the geom(4) eli class. D24400 creates the uma(9) zone for the encryption and decryption works in the eli class. It then allocates from the reserve if the bio being handled has BIO_SWAP.

There are at least two points that have to be compared to my change:

  1. Where to place the uma(9) zones

    D24400 deals with the memory used by the eli class only, so it is natural that the zone is also held within the eli class. Contrary to that, my case has to cover the geom(4) classes used by multiple bio flows. In my setup, they are the disk and part classes.

    It highly depends on the actual geom(4) configuration which class has the hot bio flow. If my setup had, say, a mirror or RAID instance used for poudriere-bulk(8), that would be as hot as the underlying part and disk instances. This means that the per-class uma(9) zone for the bio would require the tuning in multiple classes. Also, the low-layer classes close to the physical devices (again, the disk and part) often have to deal with multiplexed bio flows. That makes the tuning even more difficult.

    My change intends to place the uma(9) zones for the bio at the origin of the bio flow instead. Along with the inheritance of the zone upon cloning, this approach moves the tuning points to the entries into geom(4), namely the callers of g_new_bio() and g_alloc_bio(). This makes the relation between the tuning points and the bio flow sources clear and hopefully the tuning easy. Also, the low-layer classes do not need any special tuning.

    The exception is the classes that split a received bio into multiple pieces. nvme(4) is the good example. The zone inheritance in such the case would put the requirement of the extra bios by splitting to the geom(4) entry points, so that the tuning on them would have to take that into the account and be difficult. The diffs of D45380 and D45381 are intended to encapsulate the nature of the bio split within each element.
  1. The number of the reserved items

    D24400 defines the default number of the reserved items as 16. (static u_int g_eli_minbufs = 16;) I am curious how that has been determined.

    65536 in my change was actually taken when the design of the non-default bio uma(9) zone placement was TBD. Under the placement described in 1., it is now possible to measure the number of the bios allocated out of each zone and make a better estimation. I have actually done that:

    Bio Counts in g_bio and swap_bio with My Change {F84723913} {F84723935}

    Bio Counts in g_bio on Baseline (NB swap_bio is not created in this case) {F84724014}

    I admit 65536 is definitely overkill. Maybe 1024 would work, though these counts are merely the instant values. It would be nice if the uma(9) zone had the max counter since its creation.

During the retest of the partition alignment fix, some extreme drops of the free pages have been found.

{F85018239} {F85053883}

The first chart above shows the range of the free page count and its key values. The second one depicts the low-memory events by the VM and the bio allocation failures.

At the end of the traces, the observed minimum free pages fell down to 4, the lower bound for VM_ALLOC_SYSTEM. (My test host has 2 VM domains) Although the pagedaemon for uma(9) started the work shortly after the peak of the bio allocation failures, I suppose it was too late. Also, the domain pagedaemons did not react significantly.

The traces were discontinued because the kernel killed the processes failing to reclaim the pages after a stall, including the stats collector. (fluent-bit with my local input plugin for sysctl(3)) The host revived somehow, with the build failures on about 10 ports.

I am now rerunning poudriere-bulk(8) with the following patch to print some context and the stack backtrace when _vm_domain_allocate() allocates the pages beyond vmd_free_reserved. As of about 5 hours since the start, quite some backtraces include zfs(4). Will update later.
{F85018926}