Page MenuHomeFreeBSD

SMR protection for inpcbs
AbandonedPublic

Authored by glebius on Oct 20 2021, 8:25 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 21, 9:16 AM
Unknown Object (File)
Sat, Jan 18, 5:54 PM
Unknown Object (File)
Tue, Jan 7, 8:48 AM
Unknown Object (File)
Dec 16 2024, 4:27 AM
Unknown Object (File)
Nov 24 2024, 9:55 PM
Unknown Object (File)
Nov 24 2024, 11:32 AM
Unknown Object (File)
Nov 22 2024, 12:38 PM
Unknown Object (File)
Nov 19 2024, 8:15 AM

Details

Summary

Back when we were in the process of moving from the sophisticated combination of locks to protect the inpcb database to the epoch there was a serious concern. The concern was that the epoch is great for stuff that doesn't change all the time - interfaces, addresses, etc, but if we use epoch for connections, then its garbage collector potentially can grow really long purge lists. Back then myself and Jeff came with a plan and a promise. Jeff was already sketching the SMR and promised it to be ready soon. And I promised to use SMR for the inpcbs. As part of the work the KPI between the pcb database and rest of the stack is reviewed and made more solid.

Implementation details:

  • The database is already build on CK lists.
  • SMR section makes it safe to access a pcb after it went through uma_zfree(). We always want a particular pcb locked, so after successful lookup and lock we just check INP_FREED.
  • Transitioning from SMR section to an rwlock is sophisticated, but this all is hidden from normal KPI users.
  • List traversals (e.g. netstat or raw sockets) are now done with an iterator-like KPI, that also hides all SMR complexity.

Despite added complexity in SMR-to-rwlock transitions, the overall change makes the code simpler to understand.

Note. This change requires PCBGROUP to be retired.

The full branch, that has some preparatory changes, removal of PCBGROUP and rewrite of TCP_HPTS referencing model of inpcbs can be found here: https://github.com/glebius/FreeBSD/commits/inpcb

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 42422
Build 39310: arc lint + arc unit

Event Timeline

glebius added a reviewer: network.
glebius retitled this revision from SMR locking for inpcbs to SMR protection for inpcbs.
glebius edited the summary of this revision. (Show Details)

Added Jeff, Mark, Mateusz. Not asking to review change as a whole, but please take a look at inp_smr_lock, inp_list_first, inp_list_next.

I don't see any hpts changes in this review.. which is what I was looking for. I suppose
you have it in another review then??

sys/kern/subr_witness.c
569

Why are these lines changed/re-arranged.. is there some reason for it?

sys/kern/uipc_ktls.c
812–813

Your comment above
"SMR section makes it safe to access a pcb after it went through uma_zfree(). We always want a particular pcb locked, so after successful lookup and lock we just check INP_FREED."

confuses me here. The comment implies that we need to check INP_FREED, but yet here you remove that. Is the comment wrong or misleading? Did you mean we no-longer check INP_FREED maybe??

sys/netinet/in_mcast.c
1591 ↗(On Diff #97185)

These changes here appear to be un-related to the work described.

sys/netinet/in_pcb.c
636

Conceptually the INP_WLOCK() should be done before line 619. I know its not
"lookup-able" at this point but I think for consistency you should not be writing
anything on the inp here without holding the lock.

sys/netinet/in_pcb.h
301–302

Why are we removing the spare pointer?

388

This comment seems to be incomplete like you stopped in the middle of a sentence was more intended?

Just some minor comments so far, I will look further at the locking.

sys/kern/subr_witness.c
569

The order in the table determines the lock order expected by witness.

sys/netinet/in_pcb.c
553

Looks like there's a missing newline here.

1759

refcount_acquire() returns the old value, so you could use that to write a non-racy assertion:

u_int old __diagused;

old = refcount_acquire(&inp->inp_refcount);
KASSERT(old > 0, ...);
1770

No need for this assertion, refcount_release() will panic in this case.

1796

Consider moving the assertions into a subroutine so they aren't duplicated.

2245–2247

Is this unfinished?

sys/netinet/in_pcb.h
54

proc.h sorts before rwlock.h

713

same for the others above.

sys/netinet/ip_divert.c
188

Could be just return (inp->inp_lport == nport);.

glebius added inline comments.
sys/kern/subr_witness.c
569

The locks with not-so-informative names "udp" and "tcp" correspond to the big struct pcbinfo locks. The locking order has changed. Now we no longer use pcbinfo lock to traverse the lists, however when we free or alloc a pcb, we would grab info lock holding a pcb lock.

P.S. I think these WITNESS hint blocks can be removed as a whole. They are needed only for complicated cases, and this case is no longer a complicated one. WITNESS will learn the right order naturally.
P.P.S. In the current patch there are couple places where we use the old order (inp_apply_all, in_pcb_notifyall), this will be addressed in next update of the review.

sys/kern/uipc_ktls.c
812–813

The comment belongs to in_pcb.c only. When we came to a pcb via a CK list pointer, then we must check this flag, and this is what inp_smr_lock() or inp_list_next() do.

Nobody outside in_pcb.c shall come to a pcb via CK list pointers.

sys/netinet/in_mcast.c
1591 ↗(On Diff #97185)

Yep, and they already are committed to FreeBSD/main.

When I rebase the patch, they will disappear. But I'm not rebasing above our Netflix merge point, cause that would break my merging of this branch to a Netflix branch. So they will disappear next week.

sys/netinet/in_pcb.c
636

That's a good conceptual question! It should be either locked right after allocation and even before bzero, cause bzero already writes to the protected fields OR it should be locked right before it could be looked up. I'll leave that open for now and see what others think.

P.S. I will also ask Drew, what if such change can be not only stylistic but affect performance?

1759

Will do. Thanks!

1796

The HPTS-related ones will go away later. So only two MPASS will remain. I'll leave as is for now, if you don't mind.

2245–2247

Will fix, thanks!

sys/netinet/in_pcb.h
301–302

The spare pointers are a remnant of times when inpcb was exported to userland. I'll put it back, to separate structure cleanup from the main commit.

388

Will fix!

glebius marked 6 inline comments as done.

Addressing small issues noticed by Randall and Mark.

Note. This change requires PCBGROUP to be retired.

Have you circulated this proposal in the wider -net and -vendor community? I know of one downstream that uses this feature.

In D32585#737015, @np wrote:

Note. This change requires PCBGROUP to be retired.

Have you circulated this proposal in the wider -net and -vendor community? I know of one downstream that uses this feature.

Can you please connect me to them?

Updated version. Two major changes reduce code a lot, but in practice
are non-functional:

  • Collapse inp_smr_rlock + inp_smr_wlock into one function with "how" argument
  • Collapse hash/list iterator KPI into a single function that can traverse hash slots and global list with read-lock or write-lock semantic.
sys/netinet/in_pcb.h
631

Missing newline. To be fixed in next update.

In D32585#737015, @np wrote:

Note. This change requires PCBGROUP to be retired.

Have you circulated this proposal in the wider -net and -vendor community? I know of one downstream that uses this feature.

It looks like opnsense is trying to make use of pcbgroups/rss, might want to check with @franco_opnsense.org

sys/netinet/in_pcb.c
527–528

inpcbzone_name can be const, BTW.

1531

These backslashes aren't needed.

1664

Maybe call this advance or next_locked instead of next.

1728

There is no guarantee that next is a valid inp at this point. While outside of the SMR section, next could have been freed and reallocated, so it may belong to a different hash bucket, or it may have been moved to the end of the global list (so some elements can be skipped), or the backing slab could have been reallocated for some other purpose (inps are not type-stable), so next is a pointer to random memory.

glebius added inline comments.
sys/netinet/in_pcb.c
527–528

More than that can be improved here. I got a patch in queue that would static-ize most of struct inpcbinfo initialization. Don't want to pile it into this review.

1728

That's a very good catch. Thanks! I will look deeper into here.

In D32585#737015, @np wrote:

Note. This change requires PCBGROUP to be retired.

Have you circulated this proposal in the wider -net and -vendor community? I know of one downstream that uses this feature.

It looks like opnsense is trying to make use of pcbgroups/rss, might want to check with @franco_opnsense.org

I don't see much history here https://github.com/opnsense/src/commits/master/sys/netinet/in_pcbgroup.c
Anyway, adding Franco as reviewer.

In D32585#737015, @np wrote:

Note. This change requires PCBGROUP to be retired.

Have you circulated this proposal in the wider -net and -vendor community? I know of one downstream that uses this feature.

It looks like opnsense is trying to make use of pcbgroups/rss, might want to check with @franco_opnsense.org

I don't see much history here https://github.com/opnsense/src/commits/master/sys/netinet/in_pcbgroup.c
Anyway, adding Franco as reviewer.

Looking at https://github.com/glebius/FreeBSD/commit/694a5a75a86 in particular seems reasonable so as long as RSS remains in a workable state consider me on board. Probably can pick up any issues on that end should they arise.

Anyway, adding Franco as reviewer.

Looking at https://github.com/glebius/FreeBSD/commit/694a5a75a86 in particular seems reasonable so as long as RSS remains in a workable state consider me on board. Probably can pick up any issues on that end should they arise.

Thanks, Franco!

sys/netinet/in_pcb.c
1675

This line is erroneous. Should have gone with the last update. It will be deleted in next update.

This update:

  • fixes double advance bug in iterator with match
  • adds use to the new iterator in a several other places
  • removes unneeded INP_INFO_WLOCK.

With the latest change to the iterator a theoretical deadlock becomes possible. If two pcbs fall into the same hash slot and are added to this slot in the opposite sequence to their creation, then their order in the hash slot will be opposite to their order in the global list. Then, if iterator on the hash slot can deadlock with iterator on global list. The only thing that iterates on hash slot is raw IPv4. The opposite list order is possible only if an application unbinds then binds a raw socket. For a deadlock to happen an execution of rip_pcblist is needed at the same time as packet processing.

The future plan is to get rid of global list and iterate over all hash slot for global traversals, so this potential problem will go away.

My phabricator skills were not enough to update this review and embed it into a stack of reviews, so abandoning it in favor of https://reviews.freebsd.org/D33022