Page MenuHomeFreeBSD

kern/vm: Use the dedicated bio zone for the swap bio.
Needs ReviewPublic

Authored by seigo.tanimura_gmail.com on May 27 2024, 10:08 AM.
Tags
None
Referenced Files
F102129080: D45379.diff
Thu, Nov 7, 11:41 PM
Unknown Object (File)
Tue, Nov 5, 10:09 AM
Unknown Object (File)
Mon, Nov 4, 5:00 PM
Unknown Object (File)
Wed, Oct 16, 11:46 AM
Unknown Object (File)
Tue, Oct 15, 1:30 AM
Unknown Object (File)
Mon, Oct 14, 5:16 AM
Unknown Object (File)
Sun, Oct 13, 2:40 AM
Unknown Object (File)
Fri, Oct 11, 4:01 PM
Subscribers

Details

Reviewers
olce
kib
mav
Summary
  • New Loader Tunable
  • vm.swap_reserved_new_bios The number of the bios reserved and preallocated for the swap operations. Zero means no bios are reserved. Refer to kern.geom.reserved_new_bios in geom(4) for the configuration limitations.

While I am here, apply some minor improvements on swapgeom_strategy():

  • g_alloc_bio() may block and hence never fails.

Signed-off-by: Seigo Tanimura <seigo.tanimura@gmail.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 57932
Build 54820: arc lint + arc unit

Event Timeline

Hi Seigo,

Prior to any analysis, I assume you're doing this to fix some memory allocation deadlocks in the swap path. Could you describe a concrete scenario where you experienced a problem that this patch solves? Were you swapping to some regular partition, an encrypted one, or some vnode or zvol?

Thanks.

Ok, I had missed reviews D45380 and D45381, which mostly answer my question.

Prior to any analysis, I assume you're doing this to fix some memory allocation deadlocks in the swap path. Could you describe a concrete scenario where you experienced a problem that this patch solves? Were you swapping to some regular partition, an encrypted one, or some vnode or zvol?

There are two swaps on the GPT partitions of two NVMe drives.

root@pkgfactory2:~ # swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/nda1p1      67108820  7964000 59144820    12%
/dev/nda2p1      67108820  7987416 59121404    12%
Total           134217640 15951416 118266224    12%
root@pkgfactory2:~ # gpart show /dev/nda1
=>       40  134217648  nda1  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

root@pkgfactory2:~ # gpart show /dev/nda2
=>       40  134217648  nda2  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

root@pkgfactory2:~ #

Except for the multiple swaps, this is a common swap setup to my belief. Only two geom(4) classes are involved; part and disk.

Ok, I had missed reviews D45380 and D45381, which mostly answer my question.

They are actually intended as the sample implementation where a single I/O request from the upper layer is split into multiple bios. Such the usage is likely to require a dedicate uma(9) zone to cover the allocation burst.

D45380 is also my answer to the question by @imp regarding the host with many NVMe drives. (https://reviews.freebsd.org/D45215#1031433)

Prior to any analysis, I assume you're doing this to fix some memory allocation deadlocks in the swap path. Could you describe a concrete scenario where you experienced a problem that this patch solves? Were you swapping to some regular partition, an encrypted one, or some vnode or zvol?

There are two swaps on the GPT partitions of two NVMe drives.

root@pkgfactory2:~ # swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/nda1p1      67108820  7964000 59144820    12%
/dev/nda2p1      67108820  7987416 59121404    12%
Total           134217640 15951416 118266224    12%
root@pkgfactory2:~ # gpart show /dev/nda1
=>       40  134217648  nda1  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

root@pkgfactory2:~ # gpart show /dev/nda2
=>       40  134217648  nda2  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

Both of these drives are mispartitioned. For best performnace they should have at least 1MB if not more alignment for the first partition. Do we really need to add code to handle mis-aligned partition performance problems that are pilot error?

root@pkgfactory2:~ #

Except for the multiple swaps, this is a common swap setup to my belief.  Only two geom(4) classes are involved; part and disk.

I'm not sure that it's common to misconfigure systems, but the layout otherwise is the same.

Ok, I had missed reviews D45380 and D45381, which mostly answer my question.

They are actually intended as the sample implementation where a single I/O request from the upper layer is split into multiple bios. Such the usage is likely to require a dedicate uma(9) zone to cover the allocation burst.

D45380 is also my answer to the question by @imp regarding the host with many NVMe drives. (https://reviews.freebsd.org/D45215#1031433)

The other option is to enhance geom so that these parameters can live there and the nvme driver doesn't really need to change anything. It's a bit of a kludge to do exactly the same code, with another pre-allocated pool there for something that could just be done in geom in the first place. We have various size constraints in geom so that individual drivers don't have to do it. In act, I'd love to rip out the code that does this bio splitting in nvme entirely. It really doesn't belong in the driver.

The other question I have: why is the swap pager going so nuts? Normally, back pressure keeps the source of I/Os from overwhelming the lower layers of the system (which is why allocations are failing and you've moved to a preallocation). Why isn't that the case here? We're flooding it with more traffic than it can process. It would be, imho, much better for it to schedule less I/O at a time than to have these mechanisms to cope with flooding. Are there other drivers that have other issues? Or is nvme somehow special inherently (or because it advertises too much I/O space up the stack?)

I tried to look over the whole set of reviews adding these reservations for the bio zones. I do not think this is a solution for any problem.

When the system starts swapping under load, we should only care about the system surviving, I see no point of trying to optimize it speed (of course, pointless slowness is not useful). From this PoV, having some reserve for write bios for the swap path makes sense. But if we dedicate some resources (memory) for the reserve, it must be used consistently for whole swap write path starting from the buffer and ending in bios actually touching the disks. The allocations of clones should use the reserve.

Instead, it seems to be that patches gratuitously move some disk drives to dedicated zones and add reserves for them. This is not a solution, but it just happens that specific configuration is covered by the fact that you 1) use that drivers for swap 2) you cannot generate more io (and not only swap) that target the drivers.

sys/vm/swap_pager.c
190

This is absolutely insane number.

In D45379#1035237, @imp wrote:
root@pkgfactory2:~ # swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/nda1p1      67108820  7964000 59144820    12%
/dev/nda2p1      67108820  7987416 59121404    12%
Total           134217640 15951416 118266224    12%
root@pkgfactory2:~ # gpart show /dev/nda1
=>       40  134217648  nda1  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

root@pkgfactory2:~ # gpart show /dev/nda2
=>       40  134217648  nda2  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

Both of these drives are mispartitioned. For best performnace they should have at least 1MB if not more alignment for the first partition. Do we really need to add code to handle mis-aligned partition performance problems that are pilot error?

Getting off-topic, yet I have to ask you back. Are you talking about the AFT (4KB-sector) problem? If so, I believe it is sufficient to align to an 8-512B-sector boundary. I understand that the recommendation for the 1MB boundary alignment comes from Windows and Linux. (https://superuser.com/questions/1483928/why-do-windows-and-linux-leave-1mib-unused-before-first-partition) Mac OS X has been found to just align to an 8-512B-sector boundary. (https://forums.macrumors.com/threads/aligning-disk-partitions-to-prevent-ssd-wear.952904/)

Also, the host in question is a VM on VMWare Workstation. These drives are emulated by the hypervisor.

Having said that, I will try reallocating those partitions with the 1MB boundary alignment later and see if there are any changes.

In D45379#1035237, @imp wrote:
root@pkgfactory2:~ # swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/nda1p1      67108820  7964000 59144820    12%
/dev/nda2p1      67108820  7987416 59121404    12%
Total           134217640 15951416 118266224    12%
root@pkgfactory2:~ # gpart show /dev/nda1
=>       40  134217648  nda1  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

root@pkgfactory2:~ # gpart show /dev/nda2
=>       40  134217648  nda2  GPT  (64G)
         40          8        - free -  (4.0K)
         48  134217640     1  freebsd-swap  (64G)

Both of these drives are mispartitioned. For best performnace they should have at least 1MB if not more alignment for the first partition. Do we really need to add code to handle mis-aligned partition performance problems that are pilot error?

Getting off-topic, yet I have to ask you back. Are you talking about the AFT (4KB-sector) problem? If so, I believe it is sufficient to align to an 8-512B-sector boundary. I understand that the recommendation for the 1MB boundary alignment comes from Windows and Linux. (https://superuser.com/questions/1483928/why-do-windows-and-linux-leave-1mib-unused-before-first-partition) Mac OS X has been found to just align to an 8-512B-sector boundary. (https://forums.macrumors.com/threads/aligning-disk-partitions-to-prevent-ssd-wear.952904/)

Yes, after a fashion.... Most of the blocks in modern SSDs are closer to 128k to 512k. By aligning to 1MB (or power of 2 larger), you'll automatically get the proper alignment. Some drives advertise the 'optimal alignment', which is where the bio allocations from nvme_ns are coming from, that's on the order of 64k-1mb. (I have drives from one family that do both128k and 256k depending on the size of the drive). By aligning partitions to 1MB or larger, these optimizations generally are nops because few I//Os will span that gap.

Also, the host in question is a VM on VMWare Workstation. These drives are emulated by the hypervisor.

ah yes. That's likely another reason. It's way more efficient to have good boundaries here than to allow arbitrary writes.

Having said that, I will try reallocating those partitions with the 1MB boundary alignment later and see if there are any changes.

I think it will make things a lot better.

Warner

In D45379#1035253, @imp wrote:

The other question I have: why is the swap pager going so nuts? Normally, back pressure keeps the source of I/Os from overwhelming the lower layers of the system (which is why allocations are failing and you've moved to a preallocation). Why isn't that the case here?

In D45379#1035263, @kib wrote:

When the system starts swapping under load, we should only care about the system surviving, I see no point of trying to optimize it speed (of course, pointless slowness is not useful). From this PoV, having some reserve for write bios for the swap path makes sense. But if we dedicate some resources (memory) for the reserve, it must be used consistently for whole swap write path starting from the buffer and ending in bios actually touching the disks. The allocations of clones should use the reserve.

Instead, it seems to be that patches gratuitously move some disk drives to dedicated zones and add reserves for them. This is not a solution, but it just happens that specific configuration is covered by the fact that you 1) use that drivers for swap 2) you cannot generate more io (and not only swap) that target the drivers.

The swap I/O behaviour observed so far actually comes from the tmpfs(5) filesystems by poudriere-bulk(8) with -J 16. It is configured to use tmpfs(5) for wrkdir, data and localbase.

Also, the bio ran out during the baseline tests when almost all builders worked on extracting the packages required for the build and library dependencies in parallel. A typical situation for the reproduction is after completing multimedia/ffmpeg{,4}, on which many other ports depend. Although not reproduced in the latest test, the completion of math/octave triggers the build of math/octave-forge-*, which has caused many bio allocation failures during the other baseline tests.

The bios for the swap write, including those caused by tmpfs(5) to grab new pages, are allocated by g_new_bio() in swapgeom_strategy(). This is different from the other filesystems in which the bios are allocated by g_alloc_bio() in g_vfs_strategy() generally, or vdev_geom_io_start() in case of zfs(4). The latter can be regulated by blocking as @imp has said, but the former cannot. The result is either a failure (baseline) or the allocation from the reserve. (my change)

The possible options for the mitigation include: (any new idea welcome)

  • Limit the number of the poudriere-bulk(8) (16 on my test host) and the tmpfs(5) size for each builder. (64GB)

    As a poudriere(8) user, this cannot be accepted. The demand on these parameters vary so dynamically during the build. At the beginning, many small or medium-sized ports are built in parallel. In the late stage, a small number of builders work on the large ports. (The biggest one, to my knowledge, is www/chromium, requiring the work directory of at least 33GB) It is not possible to cover both cases by a single set of the parameters.
  • Guarantee the blocking-free swap write by the bio reservation.

    This is what D45215 does. It is essentially required to for the swap write to avoid the inverse dependency to the further memory allocation. This features is also applied to the swap write by tmpfs(5) as that cannot be distinguished from the ordinary swap write.

    To be fair, D45215 has been confirmed to fix the issue on its own. D45379 has been added merely to separate the demand on the bio cloning in case the bio demand for the swap write overwhelms that of the other filesystems. The results at the end of https://reviews.freebsd.org/D45215#1035337 show that D45379 is probably not necessary.
  • Limit the number of the pages for tmpfs(5).

    This looks like a straightforward fix with respect to the issue. Alternatively, the excess pages for tmpfs(5) may be written to the swap in advance, so that they can be reclaimed quickly when needed.

    A major problem is that the VM does not account the number of the pages per pager type. Also, I am not sure if only tmpfs(5) should be regulated in this way and, if not, the fix has to cover the other filesystems as well. That makes the fix complicated because most of the filesystems use the ordinary vnode pager.
In D45379#1035368, @imp wrote:

Having said that, I will try reallocating those partitions with the 1MB boundary alignment later and see if there are any changes.

I think it will make things a lot better.

Baseline Test Results with Partition Alignment Fix
Bio Allocation Stats per Caller

  • Overall (Significant callers only) {F84850688}
  • Failures (Significant callers only) {F84850721} Compared to the baseline of https://reviews.freebsd.org/D45215#1035337, the bio allocation failure reduced to ~1/4. An extra change inevitably caused by the partition alignment fix may have also affected the results, however.

    The files in the ZFS pool used by poudriere(8) was restored to fix the alignment of the ZFS partitions. This made a significant change on the ZFS usage; the allocated size and fragmentation ratio reduced from 475GB and 27% to 260GB to 0%, respectively. It may be the case that this size reduction optimized the ZFS usage and relaxed the pressure on the kernel memory a lot. The ZFS pool in my test host mainly serves the source and ports git repositories, the distfiles, build logs (uncompressed), ccache and packages.

    I suppose the effects of the partition alignment fix have to be evaluated in a long term.
  • Detail: zfs(4) vdev {F84852452} {F84852466}
  • Detail: VM swap pager {F84852529} {F84852554}
  • Detail: geom(4) part and disk {F84852609} {F84852669} {F84852696} {F84852702} These failures seem not affected by the partition alignment fix.