Details

Reviewers

ae
glebius
olivier

Group Reviewers

network

Commits

rGf5baf8bb12f3: Add modular fib lookup framework.

Summary

NOTE: in order to test the changes, you need to build kernel with options FIB_ALGO. Based against r368820.

NOTE: I'm going to commit this revision on December 23, unless I get any objections.

Problem statement

Currently FreeBSD uses radix (compressed binary tree) to perform all unicast route manipulations including loookups.
Radix requires storing key length in each item, allowing to use sockaddrs and literally any address family.
This flexibility comes at the cost: radix is slow, cache-unfriendly and adds locking to the hot path.

There is an extremely high bar in trying to switch radix to something else.
Some efforts has been done to reduce the coupling, however radix is still closely tied to the rest of the system. Fixed locking semantics, rtentry format, iteration assumptions prevents restrict the list of possible solutions.

Algo overview

For small tables (VM, potentially embedded), class of read-only datastructures can be used, as it is easy to rebuild the entire datastructure from scratch on each change.

For large tables, there are far more effective algorithms tailored for IPv4 and IPv6 lookups. Algorithms like DIR24-8, Lulea or DXR use 3-5 bytes per prefix, compared to ~192 bytes in radix for large-table use cases.
They also limit lookup to 2-3 memory accesses (IPv4), while radix can be notably worse.
Some of the algorithms require complex update procedures, so using them assumes some sort of update batching to reduce the change overhead.

Goals

Reduce the bar in introducing new route lookup algorithms
Make existing routing lookups for existing families fast and lockless

Proposed solution

Add a framework that allows to attach lookup algorithms via kernel modules, allowing them to be used for dataplane LPM lookups.
As most of the effective algorithms perform one-way compilation of prefixes into custom data structures, their implementations rely on another "control-plane" copy of the prefixes to perform datastructure updates. This approach keeps radix as the “control plane” source of truth, simplifying the actual algorithm implementation. It also serves as the abstraction layer for the current routing code details such as requirements on lock ordering or control plane performance.

Algorithms

As a baseline, the following algorithms will be provided out of the box:
IPv4:

lockless radix (small amount of routes, rebuilding on each change, in-kernel)
DPDK rte lpm (DIR24-8 variation, large-tables, kernel-module) (D27412)
"base" radix (to serve as a fallback, in-kernel)

IPv6:

lockless radix (small amount of routes, rebuilding on each change, in-kernel)
DPDK rte lpm (DIR24-8 variation, large-tables, kernel-module) (D27412)
"base" radix (to serve as a fallback, in-kernel)

Implementation details

Framework takes care of handling initial synchronisation, route subscription, nhop/nhop groups reference and indexing, dataplane attachments and fib instance algorithm setup/teardown.

Retries

Framework is build to be resilient about failures. It explicitly allows algorithms to request "rebuild" if an algorithm is unable to perform in-place modification. For example, it is possible that memory allocation fails or algorithm/framework runs of object indexes.
Rebuild is simply building new algorithm instance, potentially fetching data from an old instance and switching dataplane pointers.
This approach simplifies implementation of readonly datastructures and update batching.

Automatic algorithm selection

As different workloads may have different route scale, including the framework in GENERIC requires supporting all scales w/o human intervention. Framework implements automatic algorithm selection and switchover the following way:

each algo has get_preference() callback, returning relative preference (0..255) for the provided route table scale
after routing table change, callback is scheduled to re-evaluate currently used algorithm vs others. Callback executes after N=30 sec or M=100 route changes, whichever happens first
New algorithm preference has to be X=5% better than the current one to enable switchover

Nexthop referencing and indexing

Framework provide wrappers to automatically reference nexthops, ensuring they can be safely returned and their refcount is non-zero.
It also maintains idx->nhop pointer array, transparently handling nhop/nhop group indexes, allowing algorithms to store 16 or 32-bit indexes instead of pointers.

Dataplane pointers

Instead of two-dimensional rnh array operated by`rt_table_get_rnh()`, framework uses per-family linear array of the following structures:

struct fib_dp {
        flm_lookup_t    *f;
        void            *arg;
};

Function is the algorithm lookup function and the data is the pointer to the algorithm-specific data structure.
Changing the function/data pointer is implemented by creating another array copy, switching it and reclaiming old copy via epoch(9) callback.

Callbacks

Effectively the lookup module needs to implement 6 callbacks, with nearly all of table interaction is handled by the framework:

# Lookup: return nexthop pointer for address specified by key and scopeid
typedef struct nhop_object *flm_lookup_t(void *algo_data,
    const struct flm_lookup_key key, uint32_t scopeid);

# Create base datastructures for the instance tied to a specific RIB
typedef enum flm_op_result flm_init_t (uint32_t fibnum, struct fib_data *fd,
    void *_old_data, void **new_data);

# Free algorithm-specific datastructures
typedef void flm_destroy_t(void *data);

# Callback for initial synchronisation, called for each route in the routing table as a part of "rebuild"
# called under rib write lock
typedef enum flm_op_result flm_dump_t(struct rtentry *rt, void *data);

# Callback for providing the datapath func/pointer to be used in lookups
# called under rib write lock
typedef enum flm_op_result flm_dump_end_t(void *data, struct fib_dp *dp);

# Callback notifying of a single route table change
# called under rib write lock
typedef enum flm_op_result flm_change_t(struct rib_head *rnh,
    struct rib_cmd_info *rc, void *data);

# Callback for determining relative algorithm preference based on the routing table data
typedef uint8_t flm_get_pref_t(const struct rib_rtable_info *rinfo);

Test Plan

Performance

Lookup performance is tested using D27604 kernel module. Basically, the module calls fib[46]_lookup() in a loop, measuring total lookup time.

CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz ()
8 IP destinations (both IPv4/IPv6)
single thread
10M lookups
4 runs
dynamic algo switch
r368604 (GENERIC-NODEBUG + ROUTE_ALGO)

Results

Small-fib ("standard" configuration: interface & default route)

radix4: 279064482 nanoseconds, 35 830 428 pps
radix4_lockless: 208777967 nanoseconds, 47 892 984 pps
bsearch4: 68720124 nanoseconds, 145 503 229 pps
dpdk_lpm4: 60284954 nanoseconds, 165 862 281 pps

radix6: 346572490 nanoseconds, 28 853 992 pps
radix6_lockless: 292266765 nanoseconds, 34 215 316 pps

Large fib

IPv4: 710k routes

radix4_lockless: 1070335461 nanoseconds, 9 342 865 pps
bsearch4: N/A
dpdk_lpm4: 73376846 nanoseconds, 136 282 772 pps

IPv6: 100k routes

radix6_lockless: 1587777930 nanoseconds, 6 298 109 pps
dpdk_lpm6: 176917777 nanoseconds, 56 523 432 pps

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Small cleanup, rebase to account for already-committed parts.

Harbormaster completed remote builds in B35096: Diff 80096.Nov 28 2020, 5:28 PM

Move dpdk code to contrib
Add IPv4 dpdk lpm code (compile tested only)
Move sysctl tree to net.route.algo
Fix panic on dpdk modules unload
Fix multipath refcounting

Harbormaster completed remote builds in B35099: Diff 80103.Nov 28 2020, 11:46 PM

melifaro added a child revision: D27410: Add nhop_ref_any(), referencing nhop or nexthop group..Nov 29 2020, 12:01 AM

Update to reflect committed changes. Split DPDK part in a separate review.

Harbormaster completed remote builds in B35105: Diff 80119.Nov 29 2020, 3:22 PM

melifaro mentioned this in D27412: Bring DPDK route lookups to FreeBSD.Nov 29 2020, 3:58 PM

melifaro edited the summary of this revision. (Show Details)Nov 29 2020, 4:11 PM

Kernel build failed with this error (on r368264):

--- kernel.full ---
ld: error: undefined symbol: fib_destroy_rib
>>> referenced by route.c:158 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/net/route.c:158)
>>>               route.o:(rt_table_destroy)

ld: error: undefined symbol: fib_grow_rtables
>>> referenced by route_tables.c:191 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/net/route/route_tables.c:191)                                >>>               route_tables.o:(grow_rtables)

ld: error: undefined symbol: fib_select_algo_initial
>>> referenced by route_tables.c:215 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/net/route/route_tables.c:215)                                >>>               route_tables.o:(grow_rtables)                                                                                                                                                                                                                                         ld: error: undefined symbol: vnet_entry_inet6_dp                                                                                            >>> referenced by in6_fib.c:271 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/netinet6/in6_fib.c:271)                                           >>>               in6_fib.o:(fib6_check_urpf)                                                                                               *** [kernel.full] Error code 1

melifaro edited the summary of this revision. (Show Details)Dec 2 2020, 11:39 PM

melifaro edited the summary of this revision. (Show Details)

Rebase to recent HEAD.

Harbormaster completed remote builds in B35160: Diff 80244.Dec 2 2020, 11:41 PM

In D27401#613140, @olivier wrote:

Kernel build failed with this error (on r368264):

any chance you could try building with ROUTE_ALGO config option enabled?

gbe added a subscriber: gbe.Dec 3 2020, 12:34 AM

The error is different:

ld: error: undefined symbol: nhgrp_get_count
>>> referenced by route_algo.c:962 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/net/route/route_algo.c:962)
>>>               route_algo.o:(fib_get_rtable_info)
>>> referenced by route_algo.c:962 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/net/route/route_algo.c:962)
>>>               route_algo.o:(fib_check_best_algo)
*** [kernel.full] Error code 1

Fix build with !ROUTE_MPATH.

Harbormaster completed remote builds in B35181: Diff 80291.Dec 3 2020, 10:24 PM

In D27401#613379, @olivier wrote:

The error is different:

ld: error: undefined symbol: nhgrp_get_count
>>> referenced by route_algo.c:962 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/net/route/route_algo.c:962)
>>>               route_algo.o:(fib_get_rtable_info)
>>> referenced by route_algo.c:962 (/usr/local/BSDRP/TESTING/FreeBSD/src/sys/net/route/route_algo.c:962)
>>>               route_algo.o:(fib_check_best_algo)
*** [kernel.full] Error code 1

Indeed. Was broken for non-multipath case. Should work now.

Yes, it builds and boot, but seems quiet verbose, the dmesg is full of messages like those:

[rt_algo] inet6.0 (radix6_lockless) handle_rtable_change_cb: Scheduling rebuilt
[rt_algo] inet6.0 fib_check_best_algo: candidate_algos: 2, curr: radix6_lockless(255) result: NULL(255)
[rt_algo] inet6.0 (radix6_lockless) handle_rtable_change_cb: Scheduling rebuilt
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 1 0xfffff8000bb01e00
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 3 0xfffff8000bb01c00
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 2 0xfffff8000bb01d00
[rt_algo] inet6.0 (radix6_lockless) sync_algo: initial dump completed.
[rt_algo] inet6.0 (radix6_lockless) try_setup_instance: DUMP completed successfully.
[rt_algo] inet.0 (radix4_lockless) handle_rtable_change_cb: Scheduling rebuilt
[rt_algo] inet6.0 (radix6_lockless) replace_rtables_family: [vnet 0xfffff80003088f00] replace with f:0xffffffff80de82b0 arg:0xfffff800039739
00
[rt_algo] inet6.0 (radix6_lockless) replace_rtables_family: OLD FFI: 0xfffff8000b3cde00 NEW FFI: 0xfffff8000b3cd380
[rt_algo] inet6.0 (radix6_lockless) replace_rtables_family: update 0xfffff8000b3cde00 -> 0xfffff8000b3cd380
[rt_algo] inet6.0 setup_instance: try 0: fib algo result: 0
[rt_algo] inet6.0 (radix6_lockless) rebuild_callout: switched to new instance
[rt_algo] inet6.0 (radix6_lockless) schedule_destroy_instance: DETACH
[rt_algo] inet6.0 (radix6_lockless) schedule_destroy_instance: destroying old instance
[rt_algo] inet.0 fib_check_best_algo: candidate_algos: 2, curr: radix4_lockless(255) result: NULL(255)
[rt_algo] inet6.0 (radix6_lockless) destroy_instance: destroy fd 0xfffff80003985d00
[rt_algo] inet.0 (radix4_lockless) fib_get_nhop_idx:  REF nhop 1 0xfffff8000bb13e00
[rt_algo] inet.0 (radix4_lockless) sync_algo: initial dump completed.
[rt_algo] inet.0 (radix4_lockless) try_setup_instance: DUMP completed successfully.
[rt_algo] inet.0 (radix4_lockless) replace_rtables_family: [vnet 0xfffff80003088f00] replace with f:0xffffffff80d98f10 arg:0xfffff80003973c0
0
[rt_algo] inet.0 (radix4_lockless) replace_rtables_family: OLD FFI: 0xfffff8000b3cde40 NEW FFI: 0xfffff80003778bc0
[rt_algo] inet.0 (radix4_lockless) replace_rtables_family: update 0xfffff8000b3cde40 -> 0xfffff80003778bc0
[rt_algo] inet.0 setup_instance: try 0: fib algo result: 0
[rt_algo] inet.0 (radix4_lockless) rebuild_callout: switched to new instance
[rt_algo] inet.0 (radix4_lockless) schedule_destroy_instance: DETACH
[rt_algo] inet.0 (radix4_lockless) schedule_destroy_instance: destroying old instance
[rt_algo] inet.0 (radix4_lockless) destroy_instance: destroy fd 0xfffff80003985b00
[rt_algo] inet.0 (radix4_lockless) handle_rtable_change_cb: Scheduling rebuilt
[rt_algo] inet6.0 (radix6_lockless) handle_rtable_change_cb: Scheduling rebuilt
[rt_algo] inet.0 fib_check_best_algo: candidate_algos: 2, curr: radix4_lockless(255) result: NULL(255)
[rt_algo] inet.0 (radix4_lockless) fib_get_nhop_idx:  REF nhop 1 0xfffff8000bb13e00
[rt_algo] inet.0 (radix4_lockless) fib_get_nhop_idx:  REF nhop 2 0xfffff8000bc48e00
[rt_algo] inet.0 (radix4_lockless) fib_get_nhop_idx:  REF nhop 3 0xfffff8000bc48d00
[rt_algo] inet.0 (radix4_lockless) sync_algo: initial dump completed.
[rt_algo] inet.0 (radix4_lockless) try_setup_instance: DUMP completed successfully.
[rt_algo] inet.0 (radix4_lockless) replace_rtables_family: [vnet 0xfffff80003088f00] replace with f:0xffffffff80d98f10 arg:0xfffff8000bc7cd8
0
[rt_algo] inet.0 (radix4_lockless) replace_rtables_family: OLD FFI: 0xfffff80003778bc0 NEW FFI: 0xfffff8000b360a80
[rt_algo] inet.0 (radix4_lockless) replace_rtables_family: update 0xfffff80003778bc0 -> 0xfffff8000b360a80
[rt_algo] inet.0 setup_instance: try 0: fib algo result: 0
[rt_algo] inet.0 (radix4_lockless) rebuild_callout: switched to new instance
[rt_algo] inet.0 (radix4_lockless) schedule_destroy_instance: DETACH
[rt_algo] inet.0 (radix4_lockless) schedule_destroy_instance: destroying old instance
[rt_algo] inet6.0 fib_check_best_algo: candidate_algos: 2, curr: radix6_lockless(255) result: NULL(255)
[rt_algo] inet.0 (radix4_lockless) destroy_instance: destroy fd 0xfffff80003985d00
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 1 0xfffff8000bb01e00
[rt_algo] inet.0 (radix4_lockless) destroy_instance:  FREE nhop 1 0xfffff8000bb13e00
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 5 0xfffff8000bc48b00
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 4 0xfffff8000bc48c00
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 3 0xfffff8000bb01c00
[rt_algo] inet6.0 (radix6_lockless) fib_get_nhop_idx:  REF nhop 2 0xfffff8000bb01d00
[rt_algo] inet6.0 (radix6_lockless) sync_algo: initial dump completed.
[rt_algo] inet6.0 (radix6_lockless) try_setup_instance: DUMP completed successfully.
[rt_algo] inet6.0 (radix6_lockless) replace_rtables_family: [vnet 0xfffff80003088f00] replace with f:0xffffffff80de82b0 arg:0xfffff80003973a
80
[rt_algo] inet6.0 (radix6_lockless) replace_rtables_family: OLD FFI: 0xfffff8000b3cd380 NEW FFI: 0xfffff8000b3609c0
[rt_algo] inet6.0 (radix6_lockless) replace_rtables_family: update 0xfffff8000b3cd380 -> 0xfffff8000b3609c0
[rt_algo] inet6.0 setup_instance: try 0: fib algo result: 0
[rt_algo] inet6.0 (radix6_lockless) rebuild_callout: switched to new instance
[rt_algo] inet6.0 (radix6_lockless) schedule_destroy_instance: DETACH
[rt_algo] inet6.0 (radix6_lockless) schedule_destroy_instance: destroying old instance

And inet forwarding performance is very very slow:

[root@apu2]~# netstat -ihw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      3.0M     0     0       171M        15k     0       874K     0
      3.0M     0     0       170M        15k     0       872K     0
      3.1M     0     0       176M        15k     0       903K     0

And inet6 forwarding seems not working at all, extrat of a netstat -ss:

ip6:
        98000627 total packets received
        98000627 packets not forwardable
        12 packets sent from this host
        94454498 output packets discarded due to no route

Fix IPv6 forwarding.
Fix tests.

Harbormaster completed remote builds in B35201: Diff 80345.Dec 4 2020, 11:14 PM

Result on a small and medium hardware (huge difference between those hardware).
pmc data no more available on the small AMD cpu (regression because it was).

small hardware (AMD GX with 1Gb/s Intel NIC): full results report, no pmc data available
- D27401(Diff 80345) brings an improvement of 7.2% to forward IPv4
- D27401(Diff 80345) brings a reduction of 8.9% to forward IPv6
medium hardware (Xeon with 10Gb/s Mellanox NIC): full results report with flamegraphs
- D27401(Diff 80345) brings a small improvement of 1.7% improvement to forward IPv4
- D27401(Diff 80345) brings a small improvement of 1.4% to forward IPv6

In D27401#614250, @olivier wrote:

Result on a small and medium hardware (huge difference between those hardware).
pmc data no more available on the small AMD cpu (regression because it was).

small hardware (AMD GX with 1Gb/s Intel NIC): full results report, no pmc data available

D27401(Diff 80345) brings an improvement of 7.2% to forward IPv4

D27401(Diff 80345) brings a reduction of 8.9% to forward IPv6

medium hardware (Xeon with 10Gb/s Mellanox NIC): full results report with flamegraphs

D27401(Diff 80345) brings a small improvement of 1.7% improvement to forward IPv4

D27401(Diff 80345) brings a small improvement of 1.4% to forward IPv6

Thank you for testing it & providing detailed measurements & PMC data!
So far the key takeaways for me are the following:

IPv6 performance loss for small systems may be a result of unaligned radix memory allocation
unlocked radix does not bring a lot of benefit - it may be worth considering something else for small-fib usecase - 6% in rn_match is way too costy.

I'll experiment with another algo approach and update this & DPDK review.

Update default small-fib lookup algo to bsearch.

Harbormaster completed remote builds in B35375: Diff 80654.Dec 13 2020, 12:02 AM

In D27401#614558, @melifaro wrote:

In D27401#614250, @olivier wrote:

Result on a small and medium hardware (huge difference between those hardware).
pmc data no more available on the small AMD cpu (regression because it was).

small hardware (AMD GX with 1Gb/s Intel NIC): full results report, no pmc data available

D27401(Diff 80345) brings an improvement of 7.2% to forward IPv4

D27401(Diff 80345) brings a reduction of 8.9% to forward IPv6

medium hardware (Xeon with 10Gb/s Mellanox NIC): full results report with flamegraphs

D27401(Diff 80345) brings a small improvement of 1.7% improvement to forward IPv4

D27401(Diff 80345) brings a small improvement of 1.4% to forward IPv6

Thank you for testing it & providing detailed measurements & PMC data!
So far the key takeaways for me are the following:

IPv6 performance loss for small systems may be a result of unaligned radix memory allocation

unlocked radix does not bring a lot of benefit - it may be worth considering something else for small-fib usecase - 6% in rn_match is way too costy.

I'll experiment with another algo approach and update this & DPDK review.

In D27401#614250, @olivier wrote:

Result on a small and medium hardware (huge difference between those hardware).
pmc data no more available on the small AMD cpu (regression because it was).

small hardware (AMD GX with 1Gb/s Intel NIC): full results report, no pmc data available

D27401(Diff 80345) brings an improvement of 7.2% to forward IPv4

D27401(Diff 80345) brings a reduction of 8.9% to forward IPv6

medium hardware (Xeon with 10Gb/s Mellanox NIC): full results report with flamegraphs

D27401(Diff 80345) brings a small improvement of 1.7% improvement to forward IPv4

D27401(Diff 80345) brings a small improvement of 1.4% to forward IPv6

Is there any change you can retest IPv4 forwarding using the same approach (preferably with PMC data where possible)?

melifaro edited the summary of this revision. (Show Details)Dec 13 2020, 12:05 AM

olivier added inline comments.Dec 13 2020, 7:25 AM

sys/netinet/in_fib_algo.c

165

Where rt_get_raw_nhop() is defined ?

--- in_fib_algo.o ---
/usr/src/sys/netinet/in_fib_algo.c:164:7: error: implicit declaration of function 'rt_get_raw_nhop' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
        nh = rt_get_raw_nhop(rt);
             ^
/usr/src/sys/netinet/in_fib_algo.c:164:5: error: incompatible integer to pointer conversion assigning to 'struct nhop_object *' from 'int' [-Werror,-Wint-conversion]
        nh = rt_get_raw_nhop(rt);

175

Same here:

/usr/src/sys/netinet/in_fib_algo.c:174:2: error: implicit declaration of function 'rt_get_inet_prefix_pmask' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
        rt_get_inet_prefix_pmask(rt, &addr4, &mask4, &scopeid);

melifaro added inline comments.Dec 13 2020, 9:14 AM

sys/netinet/in_fib_algo.c
165	Added in r 368317
175	Sorry, should have stated that rebase to recent head is required :-( Added in r368317

Just find unexpected improvement (not related to this review) on latest -head: On my 10Gb/s Chelsio server (8-cores) is now reaching the line-rate of 14.8Mpps.
So I will start another bench on biggest hardware (40 and 100Gb/s).
Meanwhile results and flamegraphs here (but seems not useful now):
https://github.com/ocochard/netbenches/blob/master/Xeon_E5-2650_8Cores-Chelsio_T540-CR/forwarding-pf-ipfw/results/fbsd13-r368606.D27401v3/README.md

The previous unexpected improvement should became from Chelsio drivers, because on small hardware with Intel NIC, there is no such difference:Full report here.

x r368606: IPv4 packets-per-second forwarded
+ r368606 with D27401(Diff 80654): IPv4 packets-per-second forwarded
+--------------------------------------------------------------------------+
|x   x       x x  x                                    +   +       +  +   +|
|  |______A__M___|                                                         |
|                                                        |_______A_M_____| |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        693659        700880        698804      697673.4     3029.3716
+   5        717041      725241.5        722101      721296.6     3389.6496
Difference at 95.0% confidence
	23623.2 +/- 4688.25
	3.386% +/- 0.682181%
	(Student's t, pooled s = 3214.56)
=> D27401(Diff 80654) brings an improvement of 3.39% to forward IPv4: the improvement was 7.2% with previous Diff 80345.

Add dynamic algo switch (default fib only)
Fixup logging
Fix algo instance removal

Harbormaster completed remote builds in B35390: Diff 80677.Dec 13 2020, 11:20 PM

melifaro edited the test plan for this revision. (Show Details)Dec 13 2020, 11:38 PM

melifaro edited the test plan for this revision. (Show Details)

New benches on Diff 80677 on multiples hardwares (click their links for full data and flame-graphs):

tiny AMD 1Ghz 4 cores with Intel Gb:
- improvement of 17.5% to forward IPv4!
- reduction of 7.2% to forward IPv6
small Atom 8-cores with Chelsio 10Gb
- 14.6% improvement to forward IPv4!
- 8.2% improvement to forward IPv6!
medium Xeon 8-cores with Chelsio 10Gb
- No difference, but was already reaching line-rate
Big Xeon 16-cores with Chelsio 40Gb
- No difference

Update to recent HEAD
Export fib_radix_lookup_nh
small bigfixes

Harbormaster completed remote builds in B35404: Diff 80705.Dec 14 2020, 11:56 PM

Update to recent HEAD
Export fib_radix_lookup_nh
small bigfixes

Harbormaster completed remote builds in B35405: Diff 80706.Dec 14 2020, 11:57 PM

melifaro edited the summary of this revision. (Show Details)Dec 14 2020, 11:59 PM

In D27401#617025, @olivier wrote:

New benches on Diff 80677 on multiples hardwares (click their links for full data and flame-graphs):

Yay! thank you for working on that!

tiny AMD 1Ghz 4 cores with Intel Gb:

improvement of 17.5% to forward IPv4!

reduction of 7.2% to forward IPv6

I still don't understand what's happening here. Changes only touches fib6_lookup() internals and should be just "better" in both memory allocation and cpu cycles.
Is there any chance you can eventually test this with net.route.algo.inet6.algo=radix6 to check if the result is the same as "stock" version?

small Atom 8-cores with Chelsio 10Gb

14.6% improvement to forward IPv4!

8.2% improvement to forward IPv6!

medium Xeon 8-cores with Chelsio 10Gb

No difference, but was already reaching line-rate

Big Xeon 16-cores with Chelsio 40Gb

No difference

That's an interesting one.
fib4_lookup() utilisation goes 11% -> 4%, and fib6_lookup() goes 11.7% -> 10%. IPv6 part is understandable here - there is no bsearch4 analogue, so the only benefit is unlocked radix.
However, lack of IPv4 difference is a bit weird. I haven't understood where did the 7% go.

Anyway, my impression that so far it looks positive performance-wise. It certainly has potential to deliver notably better results for large-fib boxes ( D27412 should address it ).

With that in mind, I plan to commit the change (after a bit more tweaks and more commented code) on Dec 19 unless I receive any objections.

I need to redo all my benches!!!

I just found a netmap pkt-gen bug that doesn't correctly use the range of IP sources & destinations:

pkt-gen -f tx -N -i igb2 -l 60 -4 -d 198.19.10.1:2000-198.19.10.100 -D 00:0d:b9:41:ca:3d -s 198.18.10.1:2000-198.18.10.20 -w 2 -R 1000

This pkt-gen command line should generate 2000 flows (100 sources IP * 20 destinations IP), but it is not the case, the last bit is never updated.
As example from one source IP in this range, I should see 100 different destinations but there is only 10:

# tcpdump -pni igb1 host 198.18.10.15
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on igb1, link-type EN10MB (Ethernet), capture size 262144 bytes
13:15:05.868196 IP 198.18.10.15.2000 > 198.19.10.35.2000: UDP, length 18
13:15:05.884197 IP 198.18.10.15.2000 > 198.19.10.95.2000: UDP, length 18
13:15:05.889205 IP 198.18.10.15.2000 > 198.19.10.55.2000: UDP, length 18
13:15:05.905178 IP 198.18.10.15.2000 > 198.19.10.15.2000: UDP, length 18
13:15:05.907193 IP 198.18.10.15.2000 > 198.19.10.75.2000: UDP, length 18
13:15:05.926168 IP 198.18.10.15.2000 > 198.19.10.35.2000: UDP, length 18
13:15:05.928219 IP 198.18.10.15.2000 > 198.19.10.95.2000: UDP, length 18
13:15:05.944168 IP 198.18.10.15.2000 > 198.19.10.55.2000: UDP, length 18
13:15:05.949198 IP 198.18.10.15.2000 > 198.19.10.15.2000: UDP, length 18
13:15:05.965167 IP 198.18.10.15.2000 > 198.19.10.75.2000: UDP, length 18
13:15:05.967190 IP 198.18.10.15.2000 > 198.19.10.35.2000: UDP, length 18
13:15:05.986165 IP 198.18.10.15.2000 > 198.19.10.95.2000: UDP, length 18
13:15:05.988191 IP 198.18.10.15.2000 > 198.19.10.55.2000: UDP, length 18
13:15:06.004167 IP 198.18.10.15.2000 > 198.19.10.15.2000: UDP, length 18
13:15:06.009202 IP 198.18.10.15.2000 > 198.19.10.75.2000: UDP, length 18
13:15:06.025166 IP 198.18.10.15.2000 > 198.19.10.35.2000: UDP, length 18
13:15:06.027191 IP 198.18.10.15.2000 > 198.19.10.95.2000: UDP, length 18
13:15:06.046165 IP 198.18.10.15.2000 > 198.19.10.55.2000: UDP, length 18
13:15:06.048192 IP 198.18.10.15.2000 > 198.19.10.15.2000: UDP, length 18
13:15:06.064180 IP 198.18.10.15.2000 > 198.19.10.75.2000: UDP, length 18
13:15:06.069208 IP 198.18.10.15.2000 > 198.19.10.35.2000: UDP, length 18
13:15:06.085164 IP 198.18.10.15.2000 > 198.19.10.95.2000: UDP, length 18
13:15:06.087189 IP 198.18.10.15.2000 > 198.19.10.55.2000: UDP, length 18
13:15:06.106185 IP 198.18.10.15.2000 > 198.19.10.15.2000: UDP, length 18
13:15:06.108213 IP 198.18.10.15.2000 > 198.19.10.75.2000: UDP, length 18
13:15:06.124164 IP 198.18.10.15.2000 > 198.19.10.35.2000: UDP, length 18
13:15:06.129187 IP 198.18.10.15.2000 > 198.19.10.95.2000: UDP, length 18

So due to a bug in netmap pkt-gen, the previous benches were generating only 200 UDP flows (in place of 2000 for the smallest device and 5000 for the others).
New benches were done, and the new results with lots more flows seems concluding the same as previous.

Move all nexthop refcounting to the framework side.
Fix lookup module refcounting.
Implement FLM_ERROR handling.
Rebase to latest HEAD.

Harbormaster completed remote builds in B35499: Diff 80867.Dec 17 2020, 9:42 PM

Test.

Harbormaster completed remote builds in B35500: Diff 80868.Dec 17 2020, 9:48 PM

melifaro edited the test plan for this revision. (Show Details)Dec 17 2020, 10:32 PM

melifaro edited the test plan for this revision. (Show Details)Dec 17 2020, 11:04 PM

olivier added inline comments.Dec 18 2020, 7:11 PM

sys/conf/files
4181	Where is this file ?

Fix conf/files.

Harbormaster completed remote builds in B35521: Diff 80909.Dec 18 2020, 7:30 PM

Virtualise fib_error_list.
Rename MOD_LOCK lock to FIB_MOD_LOCK to avoid clash with
module subsystem.
Improve comments.

Harbormaster completed remote builds in B35527: Diff 80922.Dec 19 2020, 12:07 AM

So, here are benches result against the diff 80909 (not the latest one):

small hardware (AMD GX with 1Gb/s Intel NIC): full results report, no pmc data available
- D27401(Diff 80345) brings a reduction of 2% to forward IPv4
- D27401(Diff 80345) brings a reduction of 8.5% to forward IPv6
small hardware (8 core Atom with 10Gb/s Mellanox NIC): full results report with flamegraphs
- D27401(Diff 80345) : no improvement to forward IPv4
- D27401(Diff 80345): brings a reduction of 4.3 to forward IPv6

In D27401#618779, @olivier wrote:

So, here are benches result against the diff 80909 (not the latest one):

small hardware (AMD GX with 1Gb/s Intel NIC): full results report, no pmc data available

D27401(Diff 80345) brings a reduction of 2% to forward IPv4

D27401(Diff 80345) brings a reduction of 8.5% to forward IPv6

small hardware (8 core Atom with 10Gb/s Mellanox NIC): full results report with flamegraphs

D27401(Diff 80345) : no improvement to forward IPv4

D27401(Diff 80345): brings a reduction of 4.3 to forward IPv6

Would it be possible if you could test the full-view with this diff and D27412 ? The latter produces 2 kernel modules, which should be loaded (dpdk_lpm4 and dpdk_lpm6) before the test.

Rename ROUTE_ALGO to FIB_ALGO
Rewrite / simplify bsearch4
Add more documentation
Add more detailed errors
Rebase to latest HEAD

Harbormaster completed remote builds in B35586: Diff 81024.Dec 21 2020, 11:40 PM

melifaro edited the summary of this revision. (Show Details)Dec 21 2020, 11:42 PM

Update to the latest -HEAD.
Fix panic on vnet teardown.

Harbormaster completed remote builds in B35690: Diff 81165.Dec 25 2020, 12:41 AM

melifaro changed the repository for this revision from rS FreeBSD src repository - subversion to rG FreeBSD src repository.Dec 25 2020, 12:41 AM

More comments & small refactoring.
Rebase to latest HEAD.

Harbormaster completed remote builds in B35691: Diff 81166.Dec 25 2020, 10:14 AM

This revision was not accepted when it landed; it landed in state Needs Review.Dec 25 2020, 11:37 AM

Closed by commit rGf5baf8bb12f3: Add modular fib lookup framework. (authored by melifaro). · Explain Why

This revision was automatically updated to reflect the committed changes.

melifaro added a commit: rGf5baf8bb12f3: Add modular fib lookup framework..

melifaro mentioned this in rG537d13437314: Bring DPDK route lookups to FreeBSD..Jan 9 2021, 12:51 PM

melifaro mentioned this in D28434: Add FIB_ALGO to GENERIC on amd64/arm64..Jan 31 2021, 1:15 PM

melifaro mentioned this in rG6993187a8c30: Add FIB_ALGO to GENERIC on amd64/arm64..Apr 24 2021, 11:25 PM

gbe mentioned this in rG0ca122044369: Add FIB_ALGO to GENERIC on amd64/arm64..Sep 14 2022, 9:02 PM

Add modular routing lookup framework.
ClosedPublic
Actions

Details

Problem statement

Algo overview

Goals

Proposed solution

Algorithms

Implementation details

Retries

Automatic algorithm selection

Nexthop referencing and indexing

Dataplane pointers

Callbacks

Performance

Results

Small-fib ("standard" configuration: interface & default route)

Large fib

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 81167

sys/conf/files

sys/conf/options

sys/net/route.c

sys/net/route/fib_algo.h

sys/net/route/fib_algo.c

sys/net/route/route_tables.c

sys/net/route/route_var.h

sys/netinet/in_fib.c

sys/netinet/in_fib_algo.c

sys/netinet6/in6_fib.h

sys/netinet6/in6_fib.c

sys/netinet6/in6_fib_algo.c

Add modular routing lookup framework.ClosedPublicActions

Details

Problem statement

Algo overview

Goals

Proposed solution

Algorithms

Implementation details

Retries

Automatic algorithm selection

Nexthop referencing and indexing

Dataplane pointers

Callbacks

Performance

Results

Small-fib ("standard" configuration: interface & default route)

Large fib

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 81167

sys/conf/files

sys/conf/options

sys/net/route.c

sys/net/route/fib_algo.h

sys/net/route/fib_algo.c

sys/net/route/route_tables.c

sys/net/route/route_var.h

sys/netinet/in_fib.c

sys/netinet/in_fib_algo.c

sys/netinet6/in6_fib.h

sys/netinet6/in6_fib.c

sys/netinet6/in6_fib_algo.c

Add modular routing lookup framework.
ClosedPublic
Actions

Revision Contents
Changeset List