Page MenuHomeFreeBSD

strcmp optimization for powerpc64
ClosedPublic

Authored by alexandre.yamashita_eldorado.org.br on Apr 27 2018, 8:55 PM.
Referenced Files
Unknown Object (File)
Fri, Nov 8, 5:28 AM
Unknown Object (File)
Thu, Nov 7, 7:22 AM
Unknown Object (File)
Tue, Nov 5, 7:31 AM
Unknown Object (File)
Oct 17 2024, 12:23 AM
Unknown Object (File)
Oct 16 2024, 2:51 PM
Unknown Object (File)
Oct 16 2024, 5:38 AM
Unknown Object (File)
Oct 14 2024, 11:02 PM
Unknown Object (File)
Oct 14 2024, 11:01 PM

Details

Summary

Optimize strcmp for powerpc64.
Data is loaded by double words and cmpb intruction is used to find '\0'.

Some performance gain rates between the current and the optimized solution:

String size (Bytes)<=8<= 1632641282565121024
Gain rate--0.81 %1.21 %3.32 %7.52 %14.86 %27.18 %42.08 %
Gain rate (27/02/2019)0.59%1.92 %3.02 %5.60 %10.16 %18.05 %30.18 %42.82 %

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 16322
Build 16260: arc lint + arc unit

Event Timeline

One thing I am wondering if newer compilers have intrinsic heuristics for common lib calls like this? Do you see the same performance delta with gcc8 from ports?

One thing I am wondering if newer compilers have intrinsic heuristics for common lib calls like this? Do you see the same performance delta with gcc8 from ports?

The gcc8 really improves the performance of the lib.
However, this optimization has a better performance than the current code implemented in C.
I improved the previous code and re-executed the tests using gcc8.
These are the gain rates obtained:

String size (Bytes)<= 8163264128256512102420484096
Gain rate-0.27 %1.46 %2.49 %4.20 %6.57 %12.75 %21.76 %34.71 %49.47 %62.09 %

I am implementing a version of this code with vectorization and I will send this code later.
Thanks, @kbowling .

My question was a little bit more naive, I don't know a lot about industrial compilers. I wondered if for the compiler/rtlib hosted its own builtins for things like this and how that worked. After digging through llvm I think the answer to that is no. llvm was a bit easier to digest than gcc, with what it calls builtins it will simplify library calls and can do other things like inline them with MI and MD information, but I believe it would fall back to the OS libc like this for the machine dependent implementation.

My question was a little bit more naive, I don't know a lot about industrial compilers. I wondered if for the compiler/rtlib hosted its own builtins for things like this and how that worked. After digging through llvm I think the answer to that is no. llvm was a bit easier to digest than gcc, with what it calls builtins it will simplify library calls and can do other things like inline them with MI and MD information, but I believe it would fall back to the OS libc like this for the machine dependent implementation.

I implemented a vectorized version of this code, but I could not obtain significant improvements in the performance rates.
I shared this version on https://github.com/PPC64/freebsd/blob/41388369cb7db55b995debd99c2a514409edd56a/lib/libc/powerpc64/string/strcmp.S .

Could you review this code without vectorization?

My question was a little bit more naive, I don't know a lot about industrial compilers. I wondered if for the compiler/rtlib hosted its own builtins for things like this and how that worked. After digging through llvm I think the answer to that is no. llvm was a bit easier to digest than gcc, with what it calls builtins it will simplify library calls and can do other things like inline them with MI and MD information, but I believe it would fall back to the OS libc like this for the machine dependent implementation.

I implemented a vectorized version of this code, but I could not obtain significant improvements in the performance rates.
I shared this version on https://github.com/PPC64/freebsd/blob/41388369cb7db55b995debd99c2a514409edd56a/lib/libc/powerpc64/string/strcmp.S .

Could you review this code without vectorization?

Hi. Could someone review my patch please?
Basically, this code loads and compares the chars by byte until their addresses are aligned.
After, we load and compare the chars by double word until a \0 or a difference in the chars is found.
Finally, we compute the differences between the chars.

Hi. Could someone review my patch please?
Basically, this code loads and compares the chars by byte until their addresses are aligned.
After, we load and compare the chars by double word until a \0 or a difference in the chars is found.
Finally, we compute the differences between the chars.

Hi Alexandre,

The code looks fine to me, but I do have one question.

Not to diminish the work you put into this, but has any effort been put into finding just how often strcmp() is run over strings longer than 64 characters, in the wild? If the vast majority of the comparisons are on "short" strings, the performance improvements seem less effective. But, if it does turn out that moderate to large strings are often compared, then this is certainly a big win.

Hi. Could someone review my patch please?
Basically, this code loads and compares the chars by byte until their addresses are aligned.
After, we load and compare the chars by double word until a \0 or a difference in the chars is found.
Finally, we compute the differences between the chars.

Hi Alexandre,

The code looks fine to me, but I do have one question.

Not to diminish the work you put into this, but has any effort been put into finding just how often strcmp() is run over strings longer than 64 characters, in the wild? If the vast majority of the comparisons are on "short" strings, the performance improvements seem less effective. But, if it does turn out that moderate to large strings are often compared, then this is certainly a big win.

I couldn't find a solution to estimate this proportion in the wild.
The comparisons on short strings are expected to be less effective because we have less double words to load.
I implement this code, trying to maximize the performance in both short and large strings.
In the worst cases (strings with less than 8 bytes), we lose almost nothing on performance (-0.27 %).

This revision is now accepted and ready to land.Jan 13 2019, 2:42 AM

Improved strcmp performance

  • Using ldu to avoid incrementing indexes in word loop.
  • Moved check of aligment to avoid trying to align one address

while the other cannot be aligned.

  • Removed branch tips.

With current changes strcmp has at least the same performance of
original implementation, including short strings.

This revision now requires review to proceed.Feb 27 2019, 2:14 PM

Looks good! Sorry for the delay in reviewing.

This revision is now accepted and ready to land.Apr 17 2019, 3:20 PM
This revision was automatically updated to reflect the committed changes.