HomeFreeBSD

Fletcher4 implementation using avx512f instruction set

Description

Fletcher4 implementation using avx512f instruction set

Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per
loop iteration. Size alignment of main fletcher4 methods is adjusted
accordingly. New implementation is called 'avx512f'.

Note: byteswap method can be implemented more efficiently when avx512bw hardware
becomes available. Currently, it is ~ 2x slower than native method.

Table shows result of full (native) fletcher4 calculation for different buffer size:

fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB

[scalar] 1213 1228 1231 1231 1225 1200 1160
[sse2] 2374 2442 2459 2456 2462 2250 2220
[avx2] 4288 4753 4871 4893 4900 4050 3882
[avx512f] 5975 8445 9196 9221 9262 6307 5620

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4952

Details

Provenance
Gvozden Neskovic <neskovic@gmail.com>Authored on Jul 6 2016, 11:42 AM
Brian Behlendorf <behlendorf1@llnl.gov>Committed on Aug 16 2016, 9:11 PM
Parents
rG32ffaa3de589: Add support for AVX-512 family of instruction sets
Branches
Unknown
Tags
Unknown

Event Timeline

Brian Behlendorf <behlendorf1@llnl.gov> committed rG70b258fc962f: Fletcher4 implementation using avx512f instruction set (authored by Gvozden Neskovic <neskovic@gmail.com>).Aug 16 2016, 9:11 PM