This was tested w/ qemu on arm64 where the median latency was previously
over 64us, but is now half that (for the times that we have data),
but most of the time we don't have data, so then it returns in
1-2us...
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 50135 Build 47027: arc lint + arc unit
Event Timeline
This change was inspired by noticing that CPU usage was significantly increased when running w/ the virtio_random module loaded. W/o it loaded, the harvest thread was consuming ~.02% cpu, but when it was loaded, it was consuming ~1% CPU (and there's a bug report of it consuming 100% cpu: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269823 which this would partially fix).
This change drops the CPU usage w/ virtio loaded down to ~.1%.
I ran some tests using dtrace, using the script: https://www.funkthat.com/gitea/jmg/scripts/src/branch/main/dtrace/virtio_random.d
This showed major latency, one run, notice how one call took over 4ms, and during that entire time, we were busy waiting the cpu, so burning energy, and cpu cycles to do nothing useful):
value count 2048 0 4096 4 8192 6 16384 50 32768 99 65536 96 131072 53 262144 8 524288 6 1048576 4 2097152 1 4194304 1
With this change:
value count 128 0 256 2 512 14 1024 232 2048 124 4096 7 8192 3 16384 6 32768 58 65536 61 131072 11 262144 6 524288 0
as you can see, the longest latency is less than 500µs, an 8x performance improvement.
Nice observation and fix.
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 | Paired with the full-size nominal alignment this is supposed to guarantee we don't straddle a page boundary, I guess? I'm not sure big alignments like (4 * HARVESTSIZE) are actually honored, though. The softc backing memory is allocated by the bus which is unaware of vtrnd_softc. It would probably be safer to have the data heap-allocated and pointed to from the softc, rather than embedded. |
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 |
so, HARVESTSIZE is 2, so this is an alignment of 8, which I'm pretty sure is done/required/allowed. I just ran an experiment to verify alignment 8 is preserved. I added the following line: compiled the code to make sure it's correct. (4 pointers, bool, two structs each 16 bytes each), then I added a bool variable immediately before, and added 8 to the offset, and it compiled fine. Then I changed the "expected" offset to only +4 if I removed the aligned value, so it is working as intended. If you want to be extra safe, I can add the following: which enforces that the alignment is kept, though I guess if we want to be REALLY careful, we should also make sure that the size is 8 in case it changes. The absolute smallest alignment is pointer (this seems like it should be fixed to at least be 8 if not 16, see UMA_ALIGN_PTR). at least by my reading of the code, softc allocations are aligned to power of 2's starting at 16 bytes, as the structure is 80 bytes, the softc will be aligned to 128, which is the next power of 2. For another day would be to have a virtio function that takes a pointer+length, and does the necessary work to create the sglist_seg and sglist so we don't have to worry about it.. that is given (size / PAGE_SIZE + 2 segs, and a pointer, that it'll do the required splitting so this won't ever be a problem. |
rebuilt with this patch, here's the disas:
`Dump of assembler code for function vtrnd_read:
0xffffffff82b05340 <+0>: 55 push %rbp 0xffffffff82b05341 <+1>: 48 89 e5 mov %rsp,%rbp 0xffffffff82b05344 <+4>: 41 57 push %r15 0xffffffff82b05346 <+6>: 41 56 push %r14 0xffffffff82b05348 <+8>: 41 54 push %r12 0xffffffff82b0534a <+10>: 53 push %rbx 0xffffffff82b0534b <+11>: 48 83 ec 10 sub $0x10,%rsp 0xffffffff82b0534f <+15>: 48 8b 1d 3a 1f 00 00 mov 0x1f3a(%rip),%rbx # 0xffffffff82b07290 <g_vtrnd_softc> 0xffffffff82b05356 <+22>: 45 31 f6 xor %r14d,%r14d 0xffffffff82b05359 <+25>: 48 85 db test %rbx,%rbx 0xffffffff82b0535c <+28>: 74 4e je 0xffffffff82b053ac <vtrnd_read+108> 0xffffffff82b0535e <+30>: 80 7b 20 00 cmpb $0x0,0x20(%rbx) 0xffffffff82b05362 <+34>: 75 48 jne 0xffffffff82b053ac <vtrnd_read+108> 0xffffffff82b05364 <+36>: 41 89 f4 mov %esi,%r12d 0xffffffff82b05367 <+39>: 49 89 ff mov %rdi,%r15 0xffffffff82b0536a <+42>: 48 8b 7b 10 mov 0x10(%rbx),%rdi 0xffffffff82b0536e <+46>: 48 8d 75 dc lea -0x24(%rbp),%rsi 0xffffffff82b05372 <+50>: e8 29 be f0 fd call 0xffffffff80a111a0 <virtqueue_dequeue> 0xffffffff82b05377 <+55>: 48 85 c0 test %rax,%rax 0xffffffff82b0537a <+58>: 74 30 je 0xffffffff82b053ac <vtrnd_read+108> 0xffffffff82b0537c <+60>: 44 8b 75 dc mov -0x24(%rbp),%r14d 0xffffffff82b05380 <+64>: 45 39 f4 cmp %r14d,%r12d 0xffffffff82b05383 <+67>: 45 0f 42 f4 cmovb %r12d,%r14d 0xffffffff82b05387 <+71>: 4c 8d 63 48 lea 0x48(%rbx),%r12 0xffffffff82b0538b <+75>: 4c 89 ff mov %r15,%rdi 0xffffffff82b0538e <+78>: 4c 89 e6 mov %r12,%rsi 0xffffffff82b05391 <+81>: 4c 89 f2 mov %r14,%rdx 0xffffffff82b05394 <+84>: e8 b7 66 5d fe call 0xffffffff810dba50 <memcpy_erms> 0xffffffff82b05399 <+89>: 4c 89 e7 mov %r12,%rdi 0xffffffff82b0539c <+92>: 4c 89 f6 mov %r14,%rsi 0xffffffff82b0539f <+95>: e8 4c 93 1f fe call 0xffffffff80cfe6f0 <explicit_bzero> 0xffffffff82b053a4 <+100>: 48 89 df mov %rbx,%rdi 0xffffffff82b053a7 <+103>: e8 e4 fe ff ff call 0xffffffff82b05290 <vtrnd_enqueue> 0xffffffff82b053ac <+108>: 44 89 f0 mov %r14d,%eax 0xffffffff82b053af <+111>: 48 83 c4 10 add $0x10,%rsp 0xffffffff82b053b3 <+115>: 5b pop %rbx 0xffffffff82b053b4 <+116>: 41 5c pop %r12 0xffffffff82b053b6 <+118>: 41 5e pop %r14 0xffffffff82b053b8 <+120>: 41 5f pop %r15 0xffffffff82b053ba <+122>: 5d pop %rbp 0xffffffff82b053bb <+123>: c3 ret
End of assembler dump.
`
based on the repeating pattern:
`
CPU ID FUNCTION:NAME
3 73327 vtrnd_read:0 3 73328 vtrnd_read:1 3 73329 vtrnd_read:4 3 73330 vtrnd_read:6 3 73331 vtrnd_read:8 3 73332 vtrnd_read:10 3 73333 vtrnd_read:11 3 73334 vtrnd_read:15 3 73335 vtrnd_read:22 3 73336 vtrnd_read:25 3 73337 vtrnd_read:28 3 73338 vtrnd_read:30 3 73339 vtrnd_read:34 3 73340 vtrnd_read:36 3 73341 vtrnd_read:39 3 73342 vtrnd_read:42 3 73343 vtrnd_read:46 3 73344 vtrnd_read:50 3 73345 vtrnd_read:55 3 73346 vtrnd_read:58 3 73360 vtrnd_read:108 3 73361 vtrnd_read:111 3 73362 vtrnd_read:115 3 73363 vtrnd_read:116 3 73364 vtrnd_read:118 3 73365 vtrnd_read:120 3 73366 vtrnd_read:122 3 73367 vtrnd_read:123 3 73327 vtrnd_read:0
`
we go from instruction 58 to 108; I'm too tired to understand what that means right now
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 | I forgot HARVESTSIZE was just 2 and this is only 8 bytes, sorry. Yeah, the softc will probably be 8-byte aligned on any 64-bit architecture with or without the explicit annotation. I just don't think the annotation is causing that. I'm less sure we always 8-byte align on 32-bit arch, but it's definitely possible. I was more concerned when I was mistakenly picturing a larger vtrnd_value.
This relies on the softc itself being suitably aligned. The member offset could be correct but if the softc is allocated misaligned, the member won't be aligned either.
sglist_build? https://github.com/freebsd/freebsd-src/blob/1c09320d5833fef8a4b6cc0091883fd47ea1eb1b/sys/kern/subr_sglist.c#L697 |
[...]
0xffffffff82b053ac <+108>: 44 89 f0 mov %r14d,%eax 0xffffffff82b053af <+111>: 48 83 c4 10 add $0x10,%rsp 0xffffffff82b053b3 <+115>: 5b pop %rbx 0xffffffff82b053b4 <+116>: 41 5c pop %r12 0xffffffff82b053b6 <+118>: 41 5e pop %r14 0xffffffff82b053b8 <+120>: 41 5f pop %r15 0xffffffff82b053ba <+122>: 5d pop %rbp 0xffffffff82b053bb <+123>: c3 retEnd of assembler dump.
we go from instruction 58 to 108; I'm too tired to understand what that means right now
So, this is simply saying that it attempted to _dequeue (read) the data, it failed, and so returning 0... I assume that you're also not seeing 100% cpu usage on the harvest thread anymore?
This is the expected behavior for your case. In mine, where the HV does work, w/o https://reviews.freebsd.org/D38897, I'll see the above 3 out of 4 times... because it loops trying to read the data 4 times in quick succession but there's only data from the last set of 4 polls.
So, I think what might be wrong is that you don't have another virtio driver to "unstick" the virtio queue. FreeBSD only has drivers for balloon, block, console, network, entropy and scsi. I looked at your config, and the only other device you define that is supported by FreeBSD is balloon. In my config, I'm using both network and block so it's possible that it's working here only because there is additional disk/network traffic causing the rng events to get dequeued...
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 | you must have skipped over this statement:
I looked at subr_bus.c and kern_malloc.c and traced through the code to come to this conclusion. You'll need to provide me w/ stronger evidence that my reading of the code is wrong. Also, malloc(9) says: The malloc(), realloc(), and reallocf() functions return a kernel virtual address that is suitably aligned for storage of any type of object, or NULL if the request could not be satisfied (implying that M_NOWAIT was set). I assume that this includes the _domainset variants, AND by aligned for any type, I assume means that if you have a 256byte type, that it'll be aligned ti 256 bytes, which meshes w/ my reading of the code that it aligns to a power of 2... so as I said, the softc is 128 bytes, so will be aligned to 128 bytes. And specifically this code: https://cgit.freebsd.org/src/tree/sys/kern/kern_malloc.c#n1250 |
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 |
nice, it exists, would be nice if there was a version that would use a static allocation, but I guess now that it's in the softc, it could be used as the address of the softc won't change, but I'm convinced per above that it won't cross a page boundary so doesn't matter/need to be. I've spent way too much time today verifying that the allocation will be properly aligned. |
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 | No, I didn’t miss that part. Your response reminds me why I don’t participate in FreeBSD much anymore, though. |
you can see (part? of) my config at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269874
so there is a network driver, but there might not have been much going on
regardless, the fact that virtio_random requires some other driver to unstick its poll queue sounds like a bug in our virtio implementation.
I didn't see any in your config listed at: https://gist.github.com/igalic/dba54db678ad2f12b08d35504e7909de#file-qemu-conf-L87 which is why I assumed there wasn't one, but maybe a -nic on the command line automatically adds a network adapter. (I'm a qemu newbie, so I didn't even know a config file could be created till you posted yours.)
For my system, I'm using the command:
qemu-system-aarch64 -cpu host -accel hvf -smp 2 -m 4096 -drive file=pflash0.slow.img,format=raw,if=pflash,readonly=on -drive file=pflash1.slow.img,format=raw,if=pflash -M virt -device virtio-gpu-pci -display default,show-cursor=on -device qemu-xhci -device usb-kbd -device usb-tablet -monitor unix:qemu-monitor-socket,server,nowait -device intel-hda -device hda-duplex -device virtio-rng-pci -drive file=vm.raw,format=raw,if=virtio,cache=writethrough -nographic -serial mon:stdio
So we need to enlist someone who knows virtio and/or qemu to debug this further. I don't see any thing wrong w/ the virtio_random driver, but I also am not at all familar w/ virtio.
Here's how LXD starts qemu:
/snap/lxd/24483/bin/qemu-system-x86_64 -S -name test-fbsd14 -uuid 3b4bf603-cf1d-4043-b7ee-ed1ce83aa0c1 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.pid -D /var/snap/lxd/common/lxd/logs/test-fbsd14/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd
The qemu.conf is, indeed https://gist.github.com/igalic/dba54db678ad2f12b08d35504e7909de
So we need to enlist someone who knows virtio and/or qemu to debug this further. I don't see any thing wrong w/ the virtio_random driver, but I also am not at all familar w/ virtio.
i didn't see anything with it even before this 😅
I still suspect that there's something generally wrong with our virtio Subsystem
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 | In general, the fact that malloc(9) returns a suitably aligned buffer here is an implementation detail, not a guarantee of the interface. It can become false in certain conditions ("options REDZONE" for instance) and might become false in the future (if we add additional malloc UMA zones to try and reduce internal fragmentation, for instance). Hence malloc_aligned(9). The manual page is basically just saying that you'll get alignment up to the largest primitive C type. Drivers don't control their softc allocation, so yes, Conrad is right, vtrnd_value should in principle be dynamically allocated using malloc_aligned(). Yes this is all somewhat moot so long as HARVESTSIZE*sizeof(uint32_t) is 8, but presumably drivers shouldn't be relying on that. |
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 | I'd honestly just drop the sizeof assert. It's totally bogus. |
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
63 | better: KASSERT(btoc(sc->vtrnd_value) == btoc((vm_offset_t)sc->vtrnd_value + sizeof(sc->vtrnd_value) - 1)); (forgot the -1 to test the last byte, not the first byte that follows). |
This patch is really required to run anything serving web traffic on Hetzner's virtual machines. Can we please check it in?
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
184 | Presumably you could just init the sglist once here and not bother with redoing it for each vtrnd_enqueue call since it's always enqeueing the same address? |
On what side? Queues are entirely independent in the spec and in FreeBSD. If that's what's going on it must be some dodgy implementation on the host side.
sys/dev/virtio/random/virtio_random.c | ||
---|---|---|
291 | The prototype is static but this is not |