Page MenuHomeFreeBSD

bge: tell debugnet there are 2 rx rings, not 1,024
ClosedPublic

Authored by vangyzen on Jul 18 2022, 6:13 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 10, 4:41 AM
Unknown Object (File)
Mon, Dec 30, 10:34 PM
Unknown Object (File)
Sun, Dec 29, 10:11 AM
Unknown Object (File)
Nov 29 2024, 5:40 PM
Unknown Object (File)
Nov 24 2024, 9:21 PM
Unknown Object (File)
Nov 23 2024, 12:07 PM
Unknown Object (File)
Nov 22 2024, 6:58 AM
Unknown Object (File)
Nov 15 2024, 3:31 PM
Subscribers

Details

Summary

debugnet provides the network stack for netgdb and netdump. Since it
must operate under panic/debugger conditions and can't rely on dynamic
memory allocation, it preallocates mbufs during boot or network
configuration. At that time, it does not yet know which interface
will be used for debugging, so it does not know the required size and
quantity of mbufs to allocate. It takes the worst-case approach by
calculating its requirements from the largest MTU and largest number
of receive queues across all interfaces that support debugnet.

Unfortunately, the bge NIC driver told debugnet that it supports 1,024
receive queues. It actually supports only 2 queues (with 1,024 slots,
thus the error). This greatly exaggerated debugnet's preallocation,
so with an MTU of 9000 on any interface, it allocated 600 MB of memory.
A tiny fraction of this memory would be used if netgdb or netdump were
invoked; the rest is completely wasted.

MFC after: 1 week
Sponsored by: Dell EMC Isilon

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 46468
Build 43357: arc lint + arc unit

Event Timeline

Ugh. :(

For what it's worth, I am planning to change some of the preallocation logic in light of PR 264191. In particular, when netgdb was added we started preallocating mbufs at link-up time, regardless of whether the system was configured to netdump upon panic. The preallocation logic tries to be clever and doesn't work well at all.

My as-yet unimplemented idea is based on the following observations:

  • We can count the number of clusters consumed (i.e., passed to if_input) during polling of a given rx ring.
  • So long as an rx ring contains at least one mbuf cluster, it can receive data.
  • An rx polling operation both consumes received packets and allocates new clusters. Drivers follow one of the following patterns:
    • consume available packets up to some limit, pass them to if_input; then replenish empty slots in the rx ring
    • consume one available packet, replenish it immediately, pass the received packet to if_input

The idea is that we can get away with preallocating a fixed number of clusters of each size, ignoring the driver, the ifnet MTU and rx ring size, and mostly ignoring the rx ring count. Specifically:

  • create a pool for each cluster size (2KB, 4KB, 9KB, 16KB), populate them with some fixed number (8?) of clusters during boot; preallocation size is count*(2+4+9+16)KB.
  • define a way to limit mbuf cluster allocations to n in a code section:
    • when in such a section, the allocator maintains a count of available clusters: when the count is zero, allocation fails; when it's positive, an allocation causes the count to decrement; when a cluster is consumed and then freed, the count is incremented
    • we enter a new section each time an rx ring is polled, and the initial count is 1
    • the initial count has to be 1 to accommodate drivers like bge, which drop a received packet if they're unable to allocate a buffer to replace it

Assuming that the drivers all work in such a model, which I think is the case but haven't verified, this lets us avoid having to second-guess the driver. Does this idea seem like it'd work?

This revision is now accepted and ready to land.Jul 18 2022, 6:58 PM

I like your plan overall. I'm not intimately familiar with the network drivers, so I can't comment on that aspect.

define a way to limit mbuf cluster allocations to n in a code section

Why is this necessary? Why can't we let the driver allocate all available mbufs? Hmmm, we need some for tx. Is that the reason?

How do multi-queue NICs behave when some rx rings are empty? If a packet arrives on an empty rx ring, does the packet get dropped, or does it get queued to a different rx ring?

I like your plan overall. I'm not intimately familiar with the network drivers, so I can't comment on that aspect.

define a way to limit mbuf cluster allocations to n in a code section

Why is this necessary? Why can't we let the driver allocate all available mbufs?

Because we want to avoid a situation where we exhaust all of our preallocated mbufs to fill populate a single ring. That's the reason we preallocate as much as we do today.

Hmmm, we need some for tx. Is that the reason?

tx is easier to deal with since allocation is done in debugnet/netdump itself. (And doesn't require mbuf clusters, just headers.) My scheme is really just focused on rx.

How do multi-queue NICs behave when some rx rings are empty? If a packet arrives on an empty rx ring, does the packet get dropped, or does it get queued to a different rx ring?

I've always assumed that the packet would get dropped if the ring to which it's steered doesn't have any available buffer space. That might not be true for all devices, but I wouldn't want to rely on that.

define a way to limit mbuf cluster allocations to n in a code section

Why is this necessary? Why can't we let the driver allocate all available mbufs?

Because we want to avoid a situation where we exhaust all of our preallocated mbufs to fill populate a single ring. That's the reason we preallocate as much as we do today.

How do multi-queue NICs behave when some rx rings are empty? If a packet arrives on an empty rx ring, does the packet get dropped, or does it get queued to a different rx ring?

I've always assumed that the packet would get dropped if the ring to which it's steered doesn't have any available buffer space. That might not be true for all devices, but I wouldn't want to rely on that.

This makes sense. Ideally, the iteration over rx queues would happen directly in foo_debugnet_poll. That already seems to be the case for most multi-queue drivers, but not all.

I've just noticed that if_re seems to have the same bug (confusing rx ring count with descriptor count).