lockprof: add contested-only profiling
This allows tracking all wait times with much smaller runtime impact.
For example when doing -j 104 buildkernel on tmpfs:
no profiling: 2921.70s user 282.72s system 6598% cpu 48.562 total
all acquires: 2926.87s user 350.53s system 6656% cpu 49.237 total
contested only: 2919.64s user 290.31s system 6583% cpu 48.756 total
(cherry picked from commit a0842e69aa5f86d61c072544b540ef21b2015211)