Page MenuHomeFreeBSD

pfil: add pfil_mem_{in,out}() and retire pfil_run_hooks()
ClosedPublic

Authored by glebius on Jan 7 2023, 9:02 PM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 27 2024, 7:50 AM
Unknown Object (File)
Nov 6 2024, 6:10 AM
Unknown Object (File)
Oct 17 2024, 11:05 PM
Unknown Object (File)
Oct 3 2024, 5:02 PM
Unknown Object (File)
Oct 1 2024, 5:59 PM
Unknown Object (File)
Sep 30 2024, 10:34 AM
Unknown Object (File)
Sep 30 2024, 6:25 AM
Unknown Object (File)
Sep 30 2024, 2:17 AM

Details

Summary

The 0b70e3e78b0 changed the original design of a single entry point
into pfil(9) chains providing separate functions for the filtering
points that always provide mbufs and know the direction of a flow.
The motivation was to reduce branching. The logical continuation
would be to do the same for the filtering points that always provide
a memory pointer and retire the single entry point.

o Hooks now provide two functions: one for mbufs and optional for

memory pointers.

o pfil_hook_args() has a new member and pfil_add_hook() has a

requirement to zero out uninitialized data. Bump PFIL_VERSION.

o As it was before, a hook function for a memory pointer may realloc

into an mbuf.  Such mbuf would be returned via a pointer that must
be provided in argument.

o The only hook that supports memory pointers is ipfw:default-link.

It is rewritten to provide two functions.

o All remaining uses of pfil_run_hooks() are converted to

pfil_mem_in().

o Transparent union of pfil_packet_t and tricks to fix pointer

alignment are retired. Internal pfil_realloc() reduces down to
m_devget() and thus is retired, too.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I don't know about the ipfw changes (as in not reviewing that part). I definitely agree with the rest of the patch (modulo the one nit).

Thanks for finishing this, I was going to do it then other work got in the way.

sys/net/pfil.c
234

stray newline?

I didn't notice any performance impact with this patch (I've applied D37976 and D37977 together).
Lab is measuring the forwarding rate on a very simple IPFW firewall in front of 2 packets generators (one legitimate 20Mpps flows, a second illegimitate 20Mpps flows).

x FreeBSD n280910, packets-per-second forwarded (ipfw-at-nic-level)
+ FreeBSD n280910 with D37976 and D37977, packets-per-second forwarded (ipfw-at-nic-level)
+--------------------------------------------------------------------------+
|   *       x  +      * + x      x                                        +|
|       |___________A_M________|                                           |
||____________________M_____A__________________________|                   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      12683306      12686106      12685002      12684784     1099.0563
+   5      12683316      12689972      12685037      12685578     2566.0973 
No difference proven at 95.0% confidence

that's probably because the branches you would normally ran into got previously sorted out with 0b70e3e78b0279c66be06dea27bcdaf5eadf663d , but chances are there is enough other slowdown for that to not matter on its own

In D37977#873507, @mjg wrote:

that's probably because the branches you would normally ran into got previously sorted out with 0b70e3e78b0279c66be06dea27bcdaf5eadf663d , but chances are there is enough other slowdown for that to not matter on its own

@olivier Have you tested the improvements from the revision mentioned above?

@mjg: Have you measured an improvement from these changes? Did you test with inbound ethernet traffic from physical devices? If you tested using CPU generated traffic, I wonder if the cache misses outweigh any benefit from reducing branching? Alternatively, I wonder if the CPU branch predictor is just really good..

What cache-misses? I did not see any point in benching this change, it is the original minus some branches and no extra memory accesses.

In D37977#873536, @mjg wrote:

What cache-misses? I did not see any point in benching this change, it is the original minus some branches and no extra memory accesses.

My theory as to why Olivier does not see any benefit from this change is that he's benchmarking using incoming traffic, which is cold in cache. And so the cpu time is likely dominated by cache misses on packet header fields, and any benefit from a reduction in branches is so small that it gets lost in the noise.

I was thinking you might have seen a benefit if you were using CPU generated traffic that was hot in cache.

In D37977#873536, @mjg wrote:

What cache-misses? I did not see any point in benching this change, it is the original minus some branches and no extra memory accesses.

I do see point in benching, since it is a followup on your API change that was driven by supposed performance issue with the old API. Now that we don't see any changes with cutting off extra branches for the driver level hooks, which in principle can process more traffic than stack level hooks, I'm curious on measured performance impact of 0b70e3e78b0 itself. What was it?

cy added a subscriber: cy.

As far as I can see and also based on my testing, though not benchmarking, here at my site, I see no issues with this.

In D37977#873536, @mjg wrote:

What cache-misses? I did not see any point in benching this change, it is the original minus some branches and no extra memory accesses.

I do see point in benching, since it is a followup on your API change that was driven by supposed performance issue with the old API. Now that we don't see any changes with cutting off extra branches for the driver level hooks, which in principle can process more traffic than stack level hooks, I'm curious on measured performance impact of 0b70e3e78b0 itself. What was it?

The issue with the old API was adding work when it is known at compilation time it wont be needed.

For single-threaded performance, cache and branch prediction misses are the biggest problems. Even a correctly predicted avoidable branch is work which did not have to be done.

The few branches which got whacked in said commit do very little on its own, only because of massive slowdowns everywhere else. For that reason I saw 0 benefit from benching.

As such the thing to do is to incrementally whack the avoidable problems and stop adding new ones, at least.

On that point, I recall a benchmark by olivier showing about 2% perf increase from compiling without KDTRACE_HOOKS, which in turned whacked all the SDT_* probes. sdt probes compile to a __predict_false branch on one global var. This whacked enough branches to warrant benchmarking.

I also note previously I worked on path lookup in the vfs layer. After removing tons of bloat (but still keeping a lot!) and introducing lockless lookup it was at about 2.7 mln/s. Dilligent whacking of branches which did not need to be there brought it up to about 3.4 mln/s, and even then there is perf left on the table. The path was in the lines of /usr/obj/usr/src/sys/GENERIC/vnode_if.c, so quite a few components as well.

Anyhow, bottom line:

  1. things are slow
  2. just because a step has negligible impact on its own does not mean it should not be taken
  3. can we please stop adding trivially avoidable overhead in fast paths
This revision was not accepted when it landed; it landed in state Needs Review.Feb 14 2023, 6:05 PM
This revision was automatically updated to reflect the committed changes.