Rework of fletcher_4 module
- Benchmark memory block is increased to 128kiB to reflect real block sizes more
accurately. Measurements include all three stages needed for checksum generation,
i.e. init()/compute()/fini(). The inner loop is repeated multiple times to offset
overhead of time function.
- Fastest implementation selects native and byteswap methods independently in
benchmark. To support this new function pointers init_byteswap()/fini_byteswap()
are introduced.
- Implementation mutex lock is replaced by atomic variable.
- To save time, benchmark is not executed in userspace. Instead, highest supported
implementation is used for fastest. Default userspace selector is still 'cycle'.
- fletcher_4_native/byteswap() methods use incremental methods to finish
calculation if data size is not multiple of vector stride (currently 64B).
- Added fletcher_4_native_varsize() special purpose method for use when buffer size
is not known in advance. The method does not enforce 4B alignment on buffer size, and
will ignore last (size % 4) bytes of the data buffer.
- Benchmark kstat is changed to match the one of vdev_raidz. It now shows
throughput for all supported implementations (in B/s), native and byteswap,
as well as the code [fastest] is running.
Example of fletcher_4_bench running on Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz:
implementation native byteswap
scalar 4768120823 3426105750
sse2 7947841777 4318964249
ssse3 7951922722 6112191941
avx2 13269714358 11043200912
fastest avx2 avx2
Example of fletcher_4_bench running on Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz:
implementation native byteswap
scalar 1291115967 1031555336
sse2 2539571138 1280970926
ssse3 2537778746 1080016762
avx2 4950749767 1078493449
avx512f 9581379998 4010029046
fastest avx512f avx512f
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4952