This change is mostly the fix for ARMv7 architecture. Make sure
that the correct value is observable on the dequeue side.
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Differential D1833
Add memory barriers to buf_ring jpa-semihalf.com on Feb 13 2015, 12:21 PM. Authored by Tags None Referenced Files
Details
Diff Detail
Event TimelineComment Actions Damn, we don't have an atomic way to assign a pointer value? (ie, expressing it via an atomic acq/rel semantic) ? Comment Actions Hmm, I think these memory barriers were in the tree but were removed recently. I think it would be nice to use the atomic load acq/rel because this change could negatively impact x86. Comment Actions an atomic_store_rel would put the dmb() before writing the producer ring head, the opposite of what this change proposes (which reinforces my vague feeling that this change isn't quite right somehow). On the consumer side, it's not clear what piece of data would be the subject of an atomic_load_acq(), because I can't really tell from the current placement of the rmb() what reordering it's trying to prevent.
Comment Actions I'm mostly echoing Ian Lepore's comments here:
Are you actually seeing problems that is fixed by this change? Comment Actions From W. Macek: It was observed on ARM Cortex-A15, that sometimes, the buf_ring_dequeue_sc finction returned the old pointer, which left there from the previous buffer ring pass. With this fix, the issue disappeared. Inside buf_ring_dequeue_sc, after the cons_head is read, the CPU memory access predictors are capable of preloading the values from the memory for the future use - like a br->br_ring[cons_head], for example. However, this value might not be valid. There is a small "hole" in the logic, between line 165 and 166. If the enqueue function puts the new entry to the ring just after the core prefetches the buf = br->br_ring[cons_head] and before the br->br_prod_tail is read, it will return the wrong, old, value. To prevent this, full memmory barrier must be added before reading br->br_ring[cons_head]. However, that change would negatively impact x86, so it's better to use atomic_load_acq_32 on br->br_prod_tail variable: on Intel it will not change anything (except adding compiler barrier), but on ARM it puts a full dmb() after the br->br_prod_tail is read. As Andrew suggested, the atomic operation was also added to br->br_cons_head variable. Comment Actions Hmm, IMHO, this lockless buf ring stuff should be reconsidered much more because of presented issue. It's lockless so there is no way how to get stable snapshot of buf ring variables. A race is always present. Each read value should be considered stale at any moment. Thus what is the real issue here? Is it the order of reading, i.e. prod_tail is pre-read before cons_head? It must be that because in other cases, IMHO, nothing can help. But if it's reordering issue, what else variables are involved? These two reads are very close, but nobody ensures that it could not happen in bigger distance... At least, the same scenario is in buf_ring_advance_sc() and alike in buf_ring_dequeue_mc() and buf_ring_peek(). IMHO, if there is some presumption that some variables are changed in defined order and they should be read in the same order, then we must re-checked the code for all of them. Comment Actions Getting the memory ordering correct is difficult, yes. But the semantics are not ambiguous. Consumers are (at least for the most part) using it correctly in the network stack. Comment Actions This change looks good on its surface, but I haven't looked deeply to see if it has any bad connotations... Comment Actions Even I’m not able to evaluate all aspect, I see some serious defects here - at least from ARM point of view.
Unfortunately buf_ring_dequeue_sc() have not defined any read ordering and code can see buf = atomic_load_rel_32(&br->br_ring[cons_head]); looks more correct to me. On ARM, all stores to variable referenced by atomic_cmpset() must be done by atomic_store (or by other Ohh, and note – current atomic_store() implementation on ARM is broken too, but fix is easy (see atomic store_*_64(). Comment Actions I'm commenting on just this aspect of Michal's comment, not the overall change. I'm not sure I agree that a normal store fails to clear the exclusive monitor, but even more importantly in this context... the discussion of that and any possible fixes for it shouldn't hold up this change right now. I'm specifically not commenting on (or formally reviewing) the overall behavior of the ring buffer code. I'm not qualified to do that without a lot more studying of the code and just don't have time right now. Comment Actions The update to prod_tail happens _after_ the store to br_ring and is done with an atomic_store_rel_int subsequently in dequeue_sc the read of prod_tail will be done with an atomic_load_acq_int before loading br_prod. Since there is a memory barrier after the update in enqueue and one before the read where is the problem?
Great, but the only variable that is updated with atomic_cmpset is prod_head to atomically acquire the right to store in to that index in br_ring. Please clarify what the actual problem is. This change fixes the only real issue that I can glean from your comment. Comment Actions First of all, I was wrong in the behavior of atomic operations on ARM. After yesterday’s long session on IRC with Ian and Andy, we concluded that pure (non atomic) store affects(clears) state of exclusive monitor. So our implementation of atomics is fine on ARM. I apologize for my mistake, and, please, ignore everything about atomic_cmpset() in my previous message. Back to your first question. I thought it only as a technical note (because misordered read occurs on br_ring[]). So, i'm fine with proposed patch. Comment Actions Reading this again, cons_head is used only by dequeue and is thus protected by the lock or an atomic update. Hence the atomic_load_acq_32 isn't needed for it. On the other hand, there are other places where prod_tail needs to be read with a memory barrier. I think there are a few other potential races in the code since the 'tail' values are not mutex protected. Please see Comment Actions I've been spending more time looking at this than I care to admit. I don't understand why we need the load_acq_32 on prod_tail. The atomic_store_rel_int in enqueue should guarantee that the store to the ring happens before the update to prod_tail. So the consumer should _never_ see a prod_tail that is newer than the ring[prod_next] value. I suspect that something else is going on that we don't understand. Comment Actions kmacy was so kind to offer posting the final solution to this ticket so I'm changing status to abandoned.
|