⚙ D1945 Buf ring cleanups

kmacy updated this revision to Diff 3916.Feb 23 2015, 12:47 AM

kmacy retitled this revision from to Buf ring cleanups.

kmacy updated this object.

kmacy edited the test plan for this revision. (Show Details)

kmacy added reviewers: imp, rpaulo, andrew, zbb.

kmacy added a subscriber: emaste.

kmacy updated this revision to Diff 3917.Feb 23 2015, 1:13 AM

kmacy updated this object.

These changes look good to me. The only issues I see are if the queue size is 1. And even then I'm not sure they are issues!

sys/sys/buf_ring.h
2	2007-2015 is project style.
2	Happiness from the anonymous nitpickers society.

This revision is now accepted and ready to land.Feb 23 2015, 1:28 AM

make sure tabbing is correct
multi-line comments need to be re-worked (

This revision now requires review to proceed.Feb 23 2015, 2:01 AM

Thinking about this some more I think I've been overzealous with the memory barriers.

There are five variables in question:

prod_head: this is only read by the enqueue routine, if the latter were to initially read a stale value for it the cmpxchg (atomic_cmpset_acq_32) would fail. However, the implied memory barrier in cmpxchg would cause the subsequent read of prod_head to read the up to date value permitting the cmpxchg to succeed the second time. Conclusion - no additional memory barriers needed.
prod_tail: this value is the one used by dequeue to determine the effective producer index. The absence of a memory barrier preceding it made it possible for a new value to be visible to dequeue before the corresponding store to br_ring[idx] was visible in enqueue. Reading a stale value in enqueue could prevent progress, but would not violate correctness. Conclusion - atomic_store_rel_32() needed when doing update in enqueue and atomic_load_acq_32() needed when reading the value in dequeue. Correctness will not be violated by reading a stale value in enqueue so the initial read can be without a memory barrier, if it doesn't match the preceding value it may be stale so we need the while loop to read with an atomic_load_acq_32().
cons_head: this value is used only by dequeue, it is either updated atomically (dequeue_mc) or protected by a mutex (dequeue_sc). Conclusion - no memory barriers needed
cons_tail: This is used to communicate the latest consumer index between dequeue and enqueue. Reading a stale value in enqueue can cause an enqueue fail when it should not have. This was fixed in a recent commit by requiring a memory barrier before reading it again and returning ENOBUFS. There is no risk of reading a _new_ value of cons_tail before br_ring[idx] has been read because the value is not updated until following the read. Conclusion - forcing ordered reads necessary. Ordered rights don't matter here.
br_ring[idx] : The consistency of this value is addressed with prod_tail. This value needs to reach memory before prod_tail does. I believe the latest code handles this correctly but will need to look at it again.

rpaulo added inline comments.Feb 23 2015, 8:31 PM

sys/sys/buf_ring.h
109	_acq_32
112	This is the part that I don't understand: cons_tail and prod_head don't share a cache line, so how can the read memory barrier help?

kmacy added inline comments.Feb 23 2015, 11:10 PM

sys/sys/buf_ring.h
112	So this bit of confusion leads me to believe that the implied specificity of FreeBSD's acquire / release API is a bit of sophistry that actually falls short of the more straightforward rmb()/wmb()/mb() in Linux. So I know of no architecture where memory barriers apply only to specific cache lines. Can you name one?

rpaulo added inline comments.Feb 23 2015, 11:48 PM

sys/sys/buf_ring.h
112	Actually, you're right and this seems to work as intended.

Add further discussion of memory ordering expectations
Remove gratuitous memory barriers

Fix comment in dequeue_sc

Now I'm wondering if we need to guarantee that the load from ring needs to complete before subsequent changes that could expose it to destructive updates

I think I can guarantee that the ring load completes before it is exposed to destructive updates by just enforcing release semantics on cons_tail. This is obviously _much_ cheaper on x86 being an ordinary store.

Only do atomic_load_acq to guarantee that stale values can't cause unwarranted failures. The use of atomic_store_rel_32 on the tail indices should assure that reading a stale value can't violate correctness.

kmacy added subscribers: • ian, adrian.Feb 24 2015, 3:07 AM

From Wojciech Macek:
One thing, here is what's i think is going on:

// 1. Core reads cons_head first
cons_head = br->br_cons_head;
// 2. It knows it might need a br->br_ring[cons_head]. The index is already known, so early-load data into read buffers.
// 3. Some kind of stall, pagetable walk, mesi wait or whatever
// 4. Other core enqueues something to the ring and updates prod_tail, data in br->br_ring[cons_head] changes.
// 5. The core now is running, just read prod_tail
prod_tail = br->br_prod_tail;

cons_next = (cons_head + 1) & br->br_cons_mask;

<some stuff here>

// 6. Yeah, there is something in the ring!
if (cons_head == prod_tail) 
    return (NULL);

<other even more important stuff>

br->br_cons_head = cons_next;
// 7. Oh, I've already early-loded this value... it would be great to use it now, wouldn't it?
buf = br->br_ring[cons_head];


<....>

// 8. Return the old value
return (buf);

I understand now. The problem is the relaxed consistency of ARM whereby loads can be ordered ahead of loads. When we enter the function br_ring[cons_head] is not valid. By the time we check prod_tail the value in memory of br_ring[cons_head] _is_ valid, but the value in our ROB is invalid. This could be mitigated, but not avoided entirely by making it a critical section. So you really do need to enforce that prod_tail is read before br_ring[cons_head] since program ordering of reads is not guaranteed. Unfortunately acq_load is expensive on x86 and loads can't be ordered loads. Thus I'd rather add an architecture specific define for relaxed consistency architectures so that x86 doesn't suffer a gratuitous atomic (while those that do need it will have it).

Why not have a new API that more accurately captures the nuance here?

But I think #4 in Wojciech's comment actually speaks volumes. "data in br->br_ring[cons_head] changes." cons_head shouldn't be changing behind the scenes. If it can, then you've just dropped data. Why is it being updated in the #6 step? I mean, I understand why it needs to do that for the algorithm, but that means that this isn't lockless.

Also, in #7, why is it using the *new* value of cons_head rather than the old value? Or is that just bad English and that happens on the other core.

This code is also fragile when cache line size is larger than the compiled constant CACHE_LINE_SIZE, but that isn't changed by these patches...

Is this more descriptive example ?

`Core(0) - buf_ring_enqueue()                                       Core(1) - buf_ring_dequeue_sc()
-----------------------------------------                                       ----------------------------------------------

                                                                                cons_head = br->br_cons_head;
atomic_cmpset_acq_32(&br->br_prod_head, ...));
                                                                                buf = br->br_ring[cons_head];     <see <1>>
br->br_ring[prod_head] = buf;
atomic_store_rel_32(&br->br_prod_tail, ...);
                                                                                prod_tail = br->br_prod_tail;
                                                                                if (cons_head == prod_tail) 
                                                                                        return (NULL);
                                                                                <condition is false and code uses invalid(old) buf>`

<1>
Load (on core 1) from br->br_ring[cons_head] can be reordered (speculative readed) by CPU.

And, imho, the compiler can do the same, because non-volatile load can be moved
across volatile load. But I'm not 100% sure here.

In D1945#30, @meloun-miracle-cz wrote:

Is this more descriptive example ?

`Core(0) - buf_ring_enqueue()                                       Core(1) - buf_ring_dequeue_sc()
-----------------------------------------                                       ----------------------------------------------

                                                                                cons_head = br->br_cons_head;
atomic_cmpset_acq_32(&br->br_prod_head, ...));
                                                                                buf = br->br_ring[cons_head];     <see <1>>
br->br_ring[prod_head] = buf;
atomic_store_rel_32(&br->br_prod_tail, ...);
                                                                                prod_tail = br->br_prod_tail;
                                                                                if (cons_head == prod_tail) 
                                                                                        return (NULL);
                                                                                <condition is false and code uses invalid(old) buf>`

<1>
Load (on core 1) from br->br_ring[cons_head] can be reordered (speculative readed) by CPU.

Load re-ordering is the fundamental issue here.

And, imho, the compiler can do the same, because non-volatile load can be moved
across volatile load. But I'm not 100% sure here.

My, possibly naive, understanding of volatile is that loads from memory only have to happen once - not necessarily that they can be re-ordered. However, re-ordering them is only a small step. On the practical side, scalarizing an arbitrary sized global array is not something that LLVM is currently capable of. Load re-ordering of scalars on the other hand is probably common when, for example, lifting loads out of a loop.

In D1945#29, @imp wrote:

Why not have a new API that more accurately captures the nuance here?

According to http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf the only re-ordering possible with normal memory operations on x86 is that stores can be re-ordered after loads.

But I think #4 in Wojciech's comment actually speaks volumes. "data in br->br_ring[cons_head] changes." cons_head shouldn't be changing behind the scenes. If it can, then you've just dropped data. Why is it being updated in the #6 step? I mean, I understand why it needs to do that for the algorithm, but that means that this isn't lockless.

The validity of the data in the ring is predicated on the value of prod_tail. If loads were executed in program order then this could never happen. Either prod_tail would indicate that the ring is empty and the function would return right away, or prod_tail would indicate that the ring was not empty - guaranteeing that value in the br_ring[cons_head] was valid. The problem only occurs because load re-ordering permits us to read br_ring[cons_head] before prod_tail says it is safe to do so.

Also, in #7, why is it using the *new* value of cons_head rather than the old value? Or is that just bad English and that happens on the other core.

The value of cons_head is irrelevant. It is controlled by dequeue and updates are serialized by a mutex.

This code is also fragile when cache line size is larger than the compiled constant CACHE_LINE_SIZE, but that isn't changed by these patches...

Why? CACHE_LINE_SIZE alignment is only to avoid gratuitous coherency traffic. I actually have an optimization related to that that I'd like to discuss but I'd like to clear this up first.

Improve prod_tail and cons_tail ordering comments
Enforce read barrier after prod_tail read in dequeue

Store release doesn't intrinsically serialize loads with stores. Update ring on dequeue to guarantee that the update of cons_tail happens last on all architectures.

kmacy added a subscriber: rstone.Feb 24 2015, 8:47 PM

Remove gratuitous re-ordering of assignments and fix PREFETCH by initializing the now conditional prod_tail

Make ring entries volatile void * to protect against any future compiler optimizations
Comment on alignment
make ring entries structs to simplify (possible) aligning of entries
comment on prod_tail usage in PREFETCH_DEFINED
remove incorrect comment above putback
missing a dequeue is not as disruptive as missing an enqueue, remove the double check

fix bug on RMO arches in dequeue_mc

the cmpset_acq in dequeue_mc should provide the memory barrier we need

I see there was huge amount of good work put into fixing the race, fruitful discussion, but also no updates since last month. Frankly speaking, I am waiting to see this fix in mainline, instead of using patch from with Wojciech Macek. Are there any more issues with this or can it be merged?

Looks like a good fix to have in the tree. Any other holdups besides the usual suspect (time)?

cheers,
Hiren

rhenjau_gmail.com added a subscriber: rhenjau_gmail.com.May 8 2015, 9:00 AM

jpa-semihalf.com commandeered this revision.May 11 2015, 5:01 PM

jpa-semihalf.com added a reviewer: kmacy.

alc added a subscriber: alc.Jul 30 2015, 6:22 PM

alc added inline comments.

sys/sys/buf_ring.h
60–94	As of this week, you no longer need this. The excessive, i.e., unnecessary, ordering by load_acq on x86 no longer exists.
131	Is there a release operation on &br->br_prod_head anywhere in this code? I didn't see one. So, I'm not sure what this is intended to order.
177–178	That is guaranteed by the release semantics on the store in line 252. Preceding (by program order) loads as well as stores are ordered by a release store.
223–224	Same comment as before.
260–268	Same comment as before.

sbahra_repnop.org added a subscriber: sbahra_repnop.org.Sep 19 2015, 11:41 PM

sbahra_repnop.org added inline comments.

sys/sys/buf_ring.h
166	If br_prod_tail is re-ordered prior to load of br_cons_head, bad things will happen. Let's assume (a, b) = (cons_head, prod_tail). Initial state: (0, 0), indicating empty. T0: Producer transitions us from (0,0) to (0, 1). T1: A consumer comes in and transitions us to (1, 1). This is an empty state. T2: Begins this operation, reads prod_tail when it is 0 and reads cons_head after T1 has completed. It has observed (1, 0) despite T1 having rendered the ring buffer empty and will go ahead and successfully apply a CAS speculation on cons_head being 1. This scenario was also confirmed with fault injection on Power (RMO).

ck_ring has slightly different semantics (it is completely lock-free on consumer, but assumes UINT_MAX operations don't occur during a delay in dequeue) but I took a stab at the buf_ring MP semantics today. I have tested this on RMO architectures and verified the fences with fault injection.

Will likely be cleaning up excessive macro use on M*_DEFINE, but it's up here in the mean time: https://github.com/concurrencykit/ck/blob/master/include/ck_ring.h - it seems like the old variation in ordering guarantee are MPMC fencing of shared producer / consumer counters. I hope it's useful.

sys/sys/buf_ring.h
115	Similar race to my previous comment. We must order this read or risk invalidly storing a value.

zbb resigned from this revision.Jan 25 2016, 3:41 PM

zbb removed a reviewer: zbb.

rpaulo resigned from this revision.Jun 4 2016, 5:12 AM

rpaulo removed a reviewer: rpaulo.

rpaulo added inline comments.

sys/sys/buf_ring.h
191	This should be under #else, right?

alc mentioned this in D6986: ARM, ARM64: Workaround for buf_ring reordering.Jun 27 2016, 5:25 PM

alc mentioned this in D8637: buf_ring.h: fix memory order issues..Dec 5 2016, 5:50 PM

Buf ring cleanups
Needs ReviewPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 3916

sys/sys/buf_ring.h

Buf ring cleanupsNeeds ReviewPublicActions