Page MenuHomeFreeBSD

Stage 2: Introduce scalable route multipath
ClosedPublic

Authored by melifaro on Sep 15 2020, 11:15 PM.
Tags
None
Referenced Files
F102760952: D26449.diff
Sat, Nov 16, 7:26 PM
F102732101: D26449.id77371.diff
Sat, Nov 16, 11:59 AM
Unknown Object (File)
Fri, Nov 15, 4:45 PM
Unknown Object (File)
Thu, Nov 14, 6:01 AM
Unknown Object (File)
Wed, Nov 13, 10:43 PM
Unknown Object (File)
Wed, Nov 13, 5:58 AM
Unknown Object (File)
Sun, Nov 10, 2:00 PM
Unknown Object (File)
Sun, Nov 10, 2:00 PM

Details

Summary

This is the second part of routing subsystem changes initially described in D24141.
First part of these changes has been landed in D24232.
Support for outbound hashing is in D26523.

NOTE: I'm going to land this diff on Saturday, October 3 unless I get any objections.

Implementation

  • Follows the same implementation as nexthop objects and reuses parts of the infrastructure.
  • Nexthop groups are stored in the resizable hash and are indexed in the same way nexthops do.
  • Similarly, there is a private part (contains pointes to nexthops and relative wights) and dataplane-visible part, consisting of array of nexthops to choose from.
  • Max nhgrp wifth is set to 64 (no limitations here, can be bumped futher)
  • Lazy initialisation: allocate index/hash memory at first attempt to add multipath route.
  • rt_nhop can now point to either nexthop or nexthop group (distinguished by NHF_MULTIPATH flag.
  • all dataplane functions handle nexthop selection internally
  • routing table notifications can now notify on switching between nexthop groups -> rib_decompose_notification() has been created to decompose such notifications to a set of old "simple" add/del/change operations.

User-visible changes

  • Backward compatible: all non-multipath functionality continues to work as is
  • All routes comes up with weights, default is 1, max is 16M (2^24).
  • route delete <prefix> will work for non-multipath prefixes, specifying gateway is required for multipath ones

Hashing

  • Relies on mbuf flowid for both transit and outbound traffic.
  • For locally-originated connections we don’t currently calculate inp_flowid. As performance optimization, start calculating these hashes after inserting first multipath route (V_hash_outbound).

Rollout stages

  1. Commit with net.route.multipath=0 by default even with ROUTE_MPATH
  2. Enable ROUTE_MPATH in amd64 GENERIC.
  3. Turn on net.route.multipath=1 by default
Test Plan
divert:ipdivert_ip_input_local_success  ->  skipped: ipdivert module is not loaded  [0.027s]
divert:ipdivert_ip_output_remote_success  ->  skipped: ipdivert module is not loaded  [0.021s]
fibs:fibs_ifroutes1_success  ->  passed  [0.073s]
forward:fwd_ip_icmp_gw_fast_success  ->  passed  [1.241s]
forward:fwd_ip_icmp_gw_slow_success  ->  passed  [1.105s]
forward:fwd_ip_icmp_iface_fast_success  ->  passed  [1.051s]
forward:fwd_ip_icmp_iface_slow_success  ->  passed  [1.075s]
ip_reass_test:ip_reass__large_fragment  ->  passed  [0.014s]
ip_reass_test:ip_reass__multiple_last_fragments  ->  passed  [0.030s]
ip_reass_test:ip_reass__zero_length_fragment  ->  passed  [0.015s]
lpm:lpm_test1_success  ->  passed  [0.151s]
lpm:lpm_test2_success  ->  passed  [0.174s]
output:output_raw_flowid_mpath_success  ->  passed  [0.426s]
output:output_raw_success  ->  passed  [0.082s]
output:output_tcp_flowid_mpath_success  ->  passed  [1.850s]
output:output_tcp_setup_success  ->  passed  [0.154s]
output:output_udp_flowid_mpath_success  ->  passed  [9.339s]
output:output_udp_setup_success  ->  passed  [1.221s]
redirect:valid_redirect  ->  passed  [1.114s]
so_reuseport_lb_test:basic_ipv4  ->  passed  [1.566s]
so_reuseport_lb_test:basic_ipv6  ->  passed  [0.903s]
socket_afinet:socket_afinet  ->  passed  [0.002s]
socket_afinet:socket_afinet_bind_ok  ->  passed  [0.002s]
socket_afinet:socket_afinet_bind_zero  ->  passed  [0.002s]

fibs6:fibs6_ifroutes1_success  ->  passed  [1.576s]
forward6:fwd_ip6_gu_icmp_gw_gu_fast_success  ->  passed  [2.318s]
forward6:fwd_ip6_gu_icmp_gw_gu_slow_success  ->  passed  [2.706s]
forward6:fwd_ip6_gu_icmp_gw_ll_fast_success  ->  passed  [3.020s]
forward6:fwd_ip6_gu_icmp_gw_ll_slow_success  ->  passed  [3.046s]
forward6:fwd_ip6_gu_icmp_iface_fast_success  ->  passed  [2.620s]
forward6:fwd_ip6_gu_icmp_iface_slow_success  ->  passed  [2.677s]
lpm6:lpm6_test1_success  ->  passed  [1.890s]
lpm6:lpm6_test2_success  ->  passed  [1.990s]
mld:mldraw01  ->  passed  [4.191s]
output6:output6_raw_flowid_mpath_success  ->  failed: Balancing failure: 1: 37 2: 4  [2.348s]
output6:output6_raw_success  ->  passed  [1.814s]
output6:output6_tcp_flowid_mpath_success  ->  passed  [2.239s]
output6:output6_tcp_setup_success  ->  passed  [2.089s]
output6:output6_udp_flowid_mpath_success  ->  passed  [4.973s]
output6:output6_udp_setup_success  ->  passed  [3.244s]
redirect:valid_redirect  ->  passed  [2.363s]
scapyi386:scapyi386  ->  passed  [4.396s]

test_rtsock_l3:rtm_add_v4_gu_ifa_ordered_success  ->  passed  [0.118s]
test_rtsock_l3:rtm_add_v4_gw_direct_success  ->  passed  [0.118s]
test_rtsock_l3:rtm_add_v4_no_rtf_host_failure  ->  passed  [0.125s]
test_rtsock_l3:rtm_add_v4_temporal1_success  ->  passed  [0.131s]
test_rtsock_l3:rtm_add_v6_gu_gw_gu_direct_success  ->  passed  [0.151s]
test_rtsock_l3:rtm_add_v6_gu_ifa_hostroute_success  ->  passed  [0.124s]
test_rtsock_l3:rtm_add_v6_gu_ifa_ordered_success  ->  passed  [0.132s]
test_rtsock_l3:rtm_add_v6_gu_ifa_prefixroute_success  ->  passed  [0.122s]
test_rtsock_l3:rtm_add_v6_temporal1_success  ->  passed  [0.124s]
test_rtsock_l3:rtm_change_v4_gw_success  ->  passed  [0.126s]
test_rtsock_l3:rtm_change_v4_mtu_success  ->  passed  [0.126s]
test_rtsock_l3:rtm_change_v6_gw_success  ->  passed  [0.130s]
test_rtsock_l3:rtm_change_v6_mtu_success  ->  passed  [0.122s]
test_rtsock_l3:rtm_del_v4_gu_ifa_prefixroute_success  ->  passed  [0.132s]
test_rtsock_l3:rtm_del_v4_prefix_nogw_success  ->  passed  [0.136s]
test_rtsock_l3:rtm_del_v6_gu_ifa_hostroute_success  ->  passed  [0.121s]
test_rtsock_l3:rtm_del_v6_gu_ifa_prefixroute_success  ->  passed  [0.127s]
test_rtsock_l3:rtm_del_v6_gu_prefix_nogw_success  ->  passed  [0.125s]
test_rtsock_l3:rtm_get_v4_empty_dst_failure  ->  passed  [0.003s]
test_rtsock_l3:rtm_get_v4_exact_success  ->  passed  [0.122s]
test_rtsock_l3:rtm_get_v4_hostbits_failure  ->  passed  [0.119s]
test_rtsock_l3:rtm_get_v4_lpm_success  ->  passed  [0.121s]
test_rtsock_lladdr:rtm_add_v4_gu_lle_success  ->  passed  [0.130s]
test_rtsock_lladdr:rtm_add_v6_gu_lle_success  ->  passed  [0.138s]
test_rtsock_lladdr:rtm_add_v6_ll_lle_success  ->  passed  [0.119s]
test_rtsock_lladdr:rtm_del_v4_gu_lle_success  ->  passed  [0.121s]
test_rtsock_lladdr:rtm_del_v6_gu_lle_success  ->  passed  [0.120s]
test_rtsock_lladdr:rtm_del_v6_ll_lle_success  ->  passed  [0.118s]

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 33584
Build 30835: arc lint + arc unit

Event Timeline

Add outbound hashing & other fixes.

melifaro retitled this revision from Introduce scalable multipath to Stage 2: Introduce scalable route multipath.Sep 17 2020, 11:04 PM
melifaro edited the summary of this revision. (Show Details)
melifaro edited the test plan for this revision. (Show Details)
melifaro added reviewers: network, glebius, ae, olivier.

Heavy patch, looks promising overall.

sys/net/route/route_ctl.c
564–567

So in the case of ! ROUTE_MPATH, the call to uma_zfree is suppressed now?

Look like the review contains a patch to file usr.bin/netstat/nhgrp.c that doesn't exist on -head

Look like the review contains a patch to file usr.bin/netstat/nhgrp.c that doesn't exist on -head

Yeah, that's weird. raw patch is different from the one I have in git..
I'll update the patch shortly by cleaning some stuff in nhgrp.c, hopefully this will be enough for phabricator to stop thinking it's a copied file.

usr.bin/netstat/nhgrp.c
193 ↗(On Diff #77196)

I have a build error here:

--- all_subdir_usr.bin/netstat ---
/usr/src/usr.bin/netstat/nhgrp.c:193:11: error: use of undeclared identifier 'NET_RT_NHGROUPS'
        mib[4] = NET_RT_NHGROUPS;
sys/net/route/route_ctl.c
564–567

Not exactly. We have error set to EEXIST above, which renders this condition to be true.

Update nhgrp kernel <> userland interface.
Fix corner cases in next hop group compilations.

Fix rib_walk_del() multipath route handling.
Reword some comments.

usr.bin/netstat/nhgrp.c
193 ↗(On Diff #77196)

Should be fixed now.

Tested on my ECMP lab (https://bsdrp.net/documentation/examples/ecmp#testing_load_balancing) and the flow-id load-balancing seems working great.
Would you like test using bird and FRR too ?

Update to the latest HEAD to remove already-committed parts.

Tested on my ECMP lab (https://bsdrp.net/documentation/examples/ecmp#testing_load_balancing) and the flow-id load-balancing seems working great.
Would you like test using bird and FRR too ?

As far as I understand, bird doesn't have bits to support multipath with rtsock, some amount of work in krt_send_route() has to be done to make it happen.
Haven't checked FRR code yet, probably it also may need some updates.

Btw, If you by any chance have cycles/capability/willingness to test locally-originated traffic (performance-wise), that would we awesome :-)

Split out outbound hashing part.
Rebase to latest HEAD.

Add forgotten netstat header.

Tested with net/frr7 compiled with MULTIPATH option, and it is working great:

[root@router1]~# vtysh

Hello, this is FRRouting (version 7.4).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

router1# sh ip route 10.0.24.0/24
Routing entry for 10.0.24.0/24
  Known via "ospf", distance 110, metric 20, best
  Last update 00:06:39 ago
  * 10.0.112.2, via igb1, weight 1
  * 10.0.212.2, via igb2, weight 1

route1# sh ipv6 route 2001:db8:24::/64
Routing entry for 2001:db8:24::/64
  Known via "ospf6", distance 110, metric 20, best
  Last update 00:02:42 ago
  * fe80::20d:b9ff:fe45:7ad5, via igb1, weight 1
  * fe80::20d:b9ff:fe45:7ad6, via igb2, weight 1


router1# exit

[root@router1]~# netstat -rn4 | grep 10.0.24.0/24
10.0.24.0/24       10.0.212.2         UG1        igb2
10.0.24.0/24       10.0.112.2         UG1        igb1

[root@router1]~# netstat -rn6 | grep 2001:db8:24::/64
2001:db8:24::/64                  fe80::20d:b9ff:fe45:7ad6%igb2 UG1        igb2
2001:db8:24::/64                  fe80::20d:b9ff:fe45:7ad5%igb1 UG1        igb1
sys/net/route.h
217

Looks like three descriptions in comments need to be fixed.

sys/net/route/nhgrp.c
232

One malloc could succeed, other fail. Suggested fix - call free() on both before returning.

sys/net/route/nhgrp_ctl.c
69

New style suggest to use _Static_assert() compiler builtin.

83

Extra newline.

244

May be add KASSERT to assure we don't overflow dst->nhg_size?

melifaro added inline comments.
sys/net/route/nhgrp.c
232

Sure, it can happen. I'd say that it is not necessary to tie hash scaling to index array scaling. In situation where some allocation fails we will still make incremental progress.

If you don't mind I'd prefer to leave it as is.

Address Gleb's comments.
Rebase to the latest HEAD.

Fix build with ROUTE_MPATH.
Set net.ip.route.multipath to 0 by default.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 3 2020, 10:47 AM
This revision was automatically updated to reflect the committed changes.

Can we have an updated netstat help and man page, for the new "-O" option, please?