Page MenuHomeFreeBSD

SMR for TCP hostcache.
ClosedPublic

Authored by glebius on Apr 12 2021, 4:55 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 23, 1:15 AM
Unknown Object (File)
Wed, Jan 22, 9:27 AM
Unknown Object (File)
Tue, Jan 21, 2:06 PM
Unknown Object (File)
Sun, Jan 19, 8:12 PM
Unknown Object (File)
Wed, Jan 15, 9:07 AM
Unknown Object (File)
Sun, Jan 12, 10:21 PM
Unknown Object (File)
Sun, Jan 12, 10:00 PM
Unknown Object (File)
Sun, Jan 12, 10:00 PM
Subscribers

Details

Summary

When we receive a SYN flood from a limited set of hosts, the TCP hostcache
is the bottleneck. This is hash row lock contention just to read hostcache
data.

The core of the change is that only update requests do mtx_lock. The read
requests use SMR section. This required some changes to writers as well,
e.g. directly reusing the least used entry in a row is no longer possible,
also updates need to be atomic.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 38536
Build 35425: arc lint + arc unit

Event Timeline

sys/netinet/tcp_hostcache.c
288

Really it is somewhat strange to allocate a new SMR zone for each VNET. I understand that it's the pattern we use today, but we should permit VNET zones to share slabs and SMR trackers. This is just an observation.

353

TAILQs are not SMR-safe. You need to use CK_LIST or CK_STAILQ for now. There is a diff to add SMR_LIST_* macros with extra parameters to allow assertions to be embedded, and to distinguish between readers and writers, but they did not land yet.

591

This illustrates why we need special queue macros here. Since reads are now unlocked, something needs to ensure that all stores to the nascent hc_entry are visible to other CPUs before it is added to the queue.

sys/netinet/tcp_hostcache.c
288

I totally agree. Can do this single zone change prior to SMR change.

591

The nature of the hostcache is that a lookup failure isn't fatal. So, I think, the change is safe with TAILQ. As long as SMR guarantees that we won't read entry contents from the previous allocation, we are good.

sys/netinet/tcp_hostcache.c
591

It is not just lookup failures that need to be considered. There are many fields that get initialized before the structure is inserted into the hash bucket. Without synchronization it is possible to match on a half-constructed hostcache entry.

This is not memory safe anyway. There is nothing guaranteeing that the update of the "next" pointer of a new entry will be visible before the update to the queue head, so a concurrent lookup might follow a garbage pointer. It may work fine in practice on amd64 but the standard queue.h macros do not give the right semantics on platforms with weak memory ordering guarantees.

glebius added inline comments.
sys/netinet/tcp_hostcache.c
591

Got it, understood. Thanks, Mark. Will update the revision.

Converted hash buckets to CK_SLIST.

sys/netinet/tcp_hostcache.c
288

It appears that all TCP UMA zones are VNET local. I don't like that, but I would prefer to leave it as is and let VIMAGE fans and experts (I'm an neither) to decide whether they should stay VNET local or become global.

Seems ok to me, most of my comments are minor.

sys/netinet/tcp_hostcache.c
326

I would split this into two functions. There would some duplication, but the checks in the main loop can be lifted into a function to reduce that, and it's nicer to have separate code paths for readers and writers when they use different synchronization primitives.

376

Is there any point in having a branch here? atomic_store_int() is just a plain store.

481

"second to last" or "next to last" is a more standard way of referring to hc_prev, IMO.

528

This comment is a bit misleading to me: plain (aligned) stores are already assumed to be atomic. atomic_store here just ensures that the stores will not be reordered with each other atomic(9) operations, which doesn't seem to be very important. However, it does make it clear that data races are possible and expected, so we should keep them. But then tcp_hc_get() and similar functions should also use atomic_load.

751

How is it guaranteed that hch_length > 0?

755

This can become head->hch_length--.

glebius added inline comments.
sys/netinet/tcp_hostcache.c
326

My original version was like this, but then I collapsed it back in one as to me this looks nicer. Very subjective of course. If more reviewers prefer to split it, I'll do it.

376

My goal was to avoid unnecessary cache line trashing on every lookup. Imagine two CPUs looking up the same entry million times a second (a SYN flood scenario), if each of them updates the entry on each lookup, that would create cache misses. With the check, that would happen only once per prune interval.

528

For the TCP hostcache it is non-fatal to read a stale value, when a new value is being written. However, we don't want any trash value to be read, we need either old or new. On amd64 aligned stores can't produce any intermediate values, and hc_entry is most likely aligned, although there is no enforcement for its each field. So, all these atomics aren't necessary in practice. I don't know if any other or any future arch can write 32-bit word in two steps. I sought advice with Kostik and he suggested to keep these atomics as self documenting code.

751

If we entered the list traversal, then there is at least one entry on the list. This logic was here before and my patch doesn't change it.

glebius marked an inline comment as done.

Address markj@'s comments.

Seems ok to me, with the caveat that I'm not very familiar with the TCP stack.

sys/netinet/tcp_hostcache.c
326

Ok. I just generally dislike using flag variables to determine what kind of synchronization is being used. It is not just a matter of style, this kind of thing makes it harder to read, to write assertions, to get useful results from static analysis, etc.. But this is not my code. :)

528

The compiler will ensure that each field is naturally aligned, unless you specifically force it not to.

I agree with what you wrote. I think you should keep the atomics. The change I'm suggesting is to add corresponding atomic_loads. Again, it shows that data races are possible, and thus provides hints for both the reader and for tools that try to automatically detect data races, like KCSAN. Right now we are not very good about this, but at least we can try to avoid it in new code.

Use atomic loads when reading hostcache values, as suggested by Mark.

sys/netinet/tcp_hostcache.c
375–376
750–754

As Mark suggested split tcp_hc_lookup() into lookup-for-read and
lookup-for-write versions, and then inline the latter into tcp_hc_update().

As Mark says make locking semantics clearer. Also removes passing
pointers to pointers.

markj added inline comments.
sys/netinet/tcp_hostcache.c
495

These checks should probably be in their own function.

sys/netinet/tcp_hostcache.c
495

The cycles are now different. For update we keep the previous pointer and for read lookup we don't.

I can reduce copy-n-paste in the hash calculation, though.

Reduce copy-n-paste: macro for hash and tcp_hc_cmp().

sys/netinet/tcp_hostcache.c
525

This check was added to make sure no undetected race (depite the row lock) exists and leads to accounting errors.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 20 2021, 5:06 PM
This revision was automatically updated to reflect the committed changes.