dbuf_hold_impl() cleanup to improve cached read performance
Currently every dbuf_hold_impl() incurs kmem_alloc() and kmem_free()
which can be costly for cached read performance.
This change reverts the dbuf_hold_impl() fix stack commit, i.e.
fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61 to eliminate the extra
kmem_alloc() and kmem_free() operations and improve cached read
performance. With the change, each dbuf_hold_impl() frame uses 40 bytes
more, total of 800 for 20 recursive levels. Linux kernel stack sizes are
8K and 16K for 32bit and 64bit, respectively, so stack overrun risk is
limited.
Sample stack output comparisons with 50 PB file and recordsize=512
Current code
- 2240 64 arc_alloc_buf+0x4a/0xd0 [zfs]
- 2176 264 dbuf_read_impl.constprop.16+0x2e3/0x7f0 [zfs]
- 1912 120 dbuf_read+0xe5/0x520 [zfs]
- 1792 56 dbuf_hold_impl_arg+0x572/0x630 [zfs]
- 1736 64 dbuf_hold_impl_arg+0x508/0x630 [zfs]
- 1672 64 dbuf_hold_impl_arg+0x508/0x630 [zfs]
- 1608 40 dbuf_hold_impl+0x23/0x40 [zfs]
- 1568 40 dbuf_hold_level+0x32/0x60 [zfs]
- 1528 16 dbuf_hold+0x16/0x20 [zfs]
dbuf_hold_impl() cleanup
- 2320 64 arc_alloc_buf+0x4a/0xd0 [zfs]
- 2256 264 dbuf_read_impl.constprop.17+0x2e3/0x7f0 [zfs]
- 1992 120 dbuf_read+0xe5/0x520 [zfs]
- 1872 96 dbuf_hold_impl+0x50f/0x5e0 [zfs]
- 1776 104 dbuf_hold_impl+0x4df/0x5e0 [zfs]
- 1672 104 dbuf_hold_impl+0x4df/0x5e0 [zfs]
- 1568 40 dbuf_hold_level+0x32/0x60 [zfs]
- 1528 16 dbuf_hold+0x16/0x20 [zfs]
Performance observations on 8K recordsize filesystem:
- 8/128/1024K at 1-128 sequential cached read, ~3% improvement
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #9351