The latency to acquire+release a pshared mutex on FreeBSD is roughly
3.4 times that of a normal/default mutex, whereas on Linux there is
no measurable difference.
Worse, in the face of highly concurrent pshared lock lookups for
completely independent locks the latency grows with the number of
threads. For example, for 2 threads affined to different NUMA
nodes the latency grows to 16x that of a normal/default mutex,
for 4 threads it grows to 35x, and for 8 threads it grows to 66x.
The high latency is likely due to the fact that in order to perform
a lookup at least 3 pages and many cachelines need to be touched/read,
and several different cachelines need to be modified. The poor perf
under high concurrency is likely due to cacheline thrashing of the
single pshared_hash[] hash table r/w read lock.
The concurrency issued can be completely mitigated by using a
per-bucket r/w lock on the pshared_hash[] hash table, but that
requires numerous code changes, doesn't improve the latency, and
requires much more memory for a properly aligned hash table.
Instead, by implementing a small, per-thread pshared lock lookup cache
we can reduce the lookup latency by roughly 57% in the single-threaded
case, 91% for 2 threads, 97% for 4 threads, 98% for 8 threads, ...
I.e, the latency of a cache hit is roughly constant irrespective of
the number of threads performing a pshared lock lookup.
Similarly, in a test where a single thread acquires two completely
independent nested locks (an r/w read lock followed by a mutex) we see
a throughput improvement of roughly 2.3x (3.4x for 2 threads, 3.7x for
4 threads, 4.3x for 8 threads, ...)