We only need to include sys/_atomic_subword.h on arm64 yo provide
atomic_testandset_acq_long. Add an implementation in the arm64 atomic.h
based on the existing atomic_testandset macro.
Details
- Reviewers
markj kib manu alc - Group Reviewers
arm64 - Commits
- rG713b7f1a3b6d: Add atomic_testandset_acq_* on arm64
rG02c16e2174ba: Add atomic_testandset_acq_* on arm64
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
It is strange, does anybody need atomic_testand{set,clear}_rel_<type>? It seems that LSE instructions LDCLRA and LDEORA both acquire and release semantic, am I right that we end up use them?
The only user of testand{set,clear} with a barrier in the kernel is sys/dev/random/fenestrasX/fx_pool.c.
Based on the description of the LDCLRA instruction LD*A and LD*AL will have acquire semantics, LD*AL and LD*L will have release semantics, so only LD*AL will have both. With this change the kernel can use LD{SET,CLR}A, so will have acquire semantics.
Just a general comment about atomic_testandset_acq_* and similar RMW operations - I'm fairly sure that operations like this do not behave as described in atomic(9):
For example, to add two long integers ensuring that all prior loads and stores are completed before the addition is performed, use atomic_add_rel_long().
In particular, for both the LL/SC and LSE variants of the arm64 implementation, the acquire semantics apply only to the load portion of the operation. In other words, there is nothing preventing a subsequent load from being reordered with the store. But that example seems to suggest that the store will be visible before subsequent loads.
Oops, I meant to quote this section instead:
For example, to subtract two integers ensuring that the subtraction is completed before any subsequent loads and stores are performed, use atomic_subtract_acq_int().
The C17 standard defines _acq as
For memory_order_acquire, memory_order_acq_rel, and memory_order_seq_cst, a load operation performs an acquire operation on the affected memory location.
For me this sounds exactly as implemented, i.e. for acquire, the load portion of RWM is acquire. Symmetrically for release, the store portion of RMW is release.
There is no reordering terminology. Probably our man page requires some tweak.
There was a discussion on the arch list about this a few years ago: https://lists.freebsd.org/pipermail/freebsd-arch/2014-October/016160.html.
With LL/SC atomics we may nove a later load before the store-exclusive, but it will be a successful store as any failed store-exclusive would rerun the load-exclusive and will have acquire semantics.
The one place we need to be careful is if we have both acquire and release semantics (accesses from before and after may be reordered to be within the read-modify-write sequence, so be reordered with each other), however as we don't have these in atomic(9) it shouldn't be an issue.
Was there a discussion about specific issue Mark noted? Because the points from the mail you reference are all gone for long time. Atomic_load_acq and store_rel on x86 fully utilize TSO model provided by hardware. We also have atomic_thread_fence_XXX() primitives which provide desired memop-less fences without breaking C11 memory model.
With LL/SC atomics we may nove a later load before the store-exclusive, but it will be a successful store as any failed store-exclusive would rerun the load-exclusive and will have acquire semantics.
The one place we need to be careful is if we have both acquire and release semantics (accesses from before and after may be reordered to be within the read-modify-write sequence, so be reordered with each other), however as we don't have these in atomic(9) it shouldn't be an issue.
This is the common (allowed) issue with acq_rel operation, basically this is one of the aspects they are different from true sequentially consistent op (but there are more).
Yes, I think atomic(9) needs to be updated to reflect that acquire semantics for atomic_<op>_acq_<type>() apply specifically to the load portion of the operation. At least one consumer, smr_enter(), gets this wrong. There, the store must complete before any subsequent loads. On amd64 this is true since a load will not be reordered before a locked instruction (i.e., atomic_add_acq_int() behaves like __storeload_barrier()), but on arm64 there is no guarantee.