⚙ D30398 Route IPv4 packets via IPv6 next-hops

In D30398#682930, @melifaro wrote:

First of all, thanks for working on this! This is the important feature we need to have in base, preferably turned on by default.
I'd love to land this.

Glad to get your response !
I'm interested with scalable network, and recently exploring the CLOS network architecture described by RFC 7938. It need large number of IPs and configuration
and BGP unnumbered seems to rescue (RFC 5549).
Thanks your effort on FreeBSD's routing component, it is not too hard to implement such feature in the dataplane :)

I spent some time thinking about the potential implementation option some time ago. Let me try to summarise my thoughts below.

I'd really prefer to avoid adding additional complexity to the datapath. I'm really the proponent of making the fast path as fast / as simple as possible, with the obvious tradeoff of moving all the complexity to the control plane.

Reasonable.
I'm not expert in hardware routers, but it seems they behave simple / fast on datapath, the switch core, and leave complexity to the control plane. Though software routers does not compete with hardware routers, it is still valuable to improve the performance of datapath in regions such as cloud compute and fusion compute. PS. I've got noticed that some vendors have products with conception of router-on-nic. Not sure whether FreeBSD currently supports them or not.

I see 2 problems that need to be solved in order to integrate this smoothly:

Pass packet family to the if_output() routine somehow

Have a proper prepend header.

The potential rough edges:

Source address selection

For the source address selection, if the outbound interface does not have corresponding address, routing IPv4 packets via IPv6 next-hop and the outbound interface does not have IPv4 global routable address or vice versa, I think they can be treated same as unnumbered interfaces, then we can borrow rules from current RFCs.
For IPv6 it is RFC 6724, and for IPv4 I think it is RFC 1122 section 3.3.4.3 and RFC 1812.

inpcb LLE caching

If I remember correctly, there is LLE caching in route object, and inpcb caches the route.

slow path - sending packets when there is no LLE entry

Current implementation when there is no LLE entry sending packets is blocked, I wonder if we can re-queue them.

It may be also worth mentioning that, depending on the implementation approach, certain optimisations may be worth considering.
For example, it is not too hard to couple nexthops with the relevant lle, allowing nexthop to store pre-calculated prepend as a pointer and pass the prepend header to the if_output(), optimizing performance for all gateway-based routes.

Good idea !

In that case, the only scenario that needs to be addressed is the "slow path" one.

In a bit more details:

For (1) we actually have at least 2 scenarios - IPv4 over IPv6 and vice versa.
So, one of the problems we need to solve is getting the "right" family inside the if_output(). I'm afraid, that using mbuf flags to address both of the combinations will make the code more complex and add notable amount of branches to the fast path.

I agree. This is a draft, and I did not want to touch other components such as pfil when I started working on it. If there're enough interests on this feature, then I'd like to elaborate on it.
I have ever considered add one more parameter af to if_output() directly, as the original design of if_output() presume that the gateway address family is same as the packets.

What if we leverage some spare fields in the struct route header instead?. We can pre-fill in this for the inbcb_route and update callers like ip_output_send() to include on-stack struct route if it was not passed before. Additionally, we can consider updating the KBI for if_output() and require a shortened version of struct route, w/o ro_dst, to make it family-agnostic and reduce on-stack usage.

It is good to reduce on-stack usage, but I think it needs profiling.
To implement this feature, currently only if_output need address family of the packets. Then we endup these three means:

Obtains af from mbuf.

It works but not performant.

Updating mbuf and add af member.

The af redundant and is performant, but since mbuf is passed crossing layers, need discuss further.

Updating KBI for if_output() and add af parameter.

This assumes the caller knows exactly the address family. On-stack usage seems increase.
If if_output() always need address family, then it makes no difference. The performance of some wrapper interface such as if_vlan might be affected.

The solution you proposed above.

It works theoretically. We also need to distinguish the route on-stack from the cached ones, inp_route e.g.

For (2) - proper prepend header - I was thinking of either
(a) having a "stub" LLes thank can be looked up via arpresolve() with a "proper" prepend

I'm not catching this. What is a "stub" LLEs ?

or (b) not using PCB caching if encap family differs from packet family, leveraging to-be-added nexthop ptr to the prepend data.

Also not catching this. What is the encap family ?

What do you think?

English is not my native tongue, hopefully I expressed clearly :)

gbe added a subscriber: gbe.May 24 2021, 11:38 AM

zlei updated this revision to Diff 90724.Jun 11 2021, 11:29 AM

Herald added a reviewer: cy. · View Herald TranscriptJun 11 2021, 11:29 AM

Herald added a subscriber: donner. · View Herald Transcript

Reuse route.ro_dst to get the address family of outbound / forwarded packets.

This version looks pretty neat, please see some comments inline.

sys/contrib/dpdk_rte_lpm/dpdk_lpm.c
144 ↗	(On Diff #90724)	Does it matter? The only thing that should be relevant for the lookup algos is the nexthop index.
sys/dev/iicbus/if_ic.c
374	worth having as a macro? something like `hdr = RO_GET_FAMILY(ro, dst)`
sys/net/route/route_ctl.c
637–638	Probably worth moving to a separate function to improve readability
sys/netinet/in_fib_dxr.c
341 ↗	(On Diff #90724)	This code doesn't care about nexthop internals other than nhop index.
sys/netinet/ip_output.c
536	Given `rt_update_ro_flags()` is static, it's probably worth just updating its signature to include `nh` parameter
sys/netpfil/ipfw/ip_fw_table_algo.c
3917 ↗	(On Diff #90724)	IIRC we don't care about the nexthop here at all

In D30398#683204, @zlei.huang_gmail.com wrote:

inpcb LLE caching

If I remember correctly, there is LLE caching in route object, and inpcb caches the route.

slow path - sending packets when there is no LLE entry

Current implementation when there is no LLE entry sending packets is blocked, I wonder if we can re-queue them.

ether_resolve_addr() will consume an mbuf and add it to the sending queue of the particular lle we're looking to resolve.
Once lle is resolved we iterate through this queue and re-send these using if_output() (end of arp_check_update_lle() for IPv4). Here we need to distingush between IPv4 or IPv6 packets somehow - either by inspecting IP version field or having a separate queues for IPv4/IPv6. Personally I'd experiment with the former approach.

For (1) we actually have at least 2 scenarios - IPv4 over IPv6 and vice versa.
So, one of the problems we need to solve is getting the "right" family inside the if_output(). I'm afraid, that using mbuf flags to address both of the combinations will make the code more complex and add notable amount of branches to the fast path.

I agree. This is a draft, and I did not want to touch other components such as pfil when I started working on it. If there're enough interests on this feature, then I'd like to elaborate on it.
I have ever considered add one more parameter af to if_output() directly, as the original design of if_output() presume that the gateway address family is same as the packets.

What if we leverage some spare fields in the struct route header instead?. We can pre-fill in this for the inbcb_route and update callers like ip_output_send() to include on-stack struct route if it was not passed before. Additionally, we can consider updating the KBI for if_output() and require a shortened version of struct route, w/o ro_dst, to make it family-agnostic and reduce on-stack usage.

It is good to reduce on-stack usage, but I think it needs profiling.
To implement this feature, currently only if_output need address family of the packets. Then we endup these three means:

Obtains af from mbuf.

It works but not performant.

Updating mbuf and add af member.

The af redundant and is performant, but since mbuf is passed crossing layers, need discuss further.

Updating KBI for if_output() and add af parameter.

This assumes the caller knows exactly the address family. On-stack usage seems increase.
If if_output() always need address family, then it makes no difference. The performance of some wrapper interface such as if_vlan might be affected.

The solution you proposed above.

It works theoretically. We also need to distinguish the route on-stack from the cached ones, inp_route e.g.

Always having struct route pointer in ip_output() seem to simplify things and I do like how this looks in the diff.

For (2) - proper prepend header - I was thinking of either
(a) having a "stub" LLes thank can be looked up via arpresolve() with a "proper" prepend

I'm not catching this. What is a "stub" LLEs ?

Sorry, I should have described it in a bit more detailed fashion.
Basically, my idea was to have a separate LLEs for a combination of (ip, upper_layer_family), so it can store the correct prepend and be cacheable by the PCB layer.

or (b) not using PCB caching if encap family differs from packet family, leveraging to-be-added nexthop ptr to the prepend data.

Also not catching this. What is the encap family ?

Different wording, sorry. Packet family - upper layer family (e.g. IPv6 for IPv6 packet), encap family - family of the gw we look in the LLE table.

What do you think?

English is not my native tongue, hopefully I expressed clearly :)

Updated as @melifaro suggested.

zlei marked 4 inline comments as done.Jun 15 2021, 7:18 AM

zlei added inline comments.

sys/netinet/ip_output.c
541	Currently for IPv4 stack the source address selection is simple, but for some cases such as unnumbered interface, interface has only IPv6 addresses or IPv4 link-local addresses eg., it does not work greatly. It is known that ip_output and icmp_reflect are affected. As the issue exists before this feature, I'm planning to fix it in a separate diff.

zlei added inline comments.Jun 15 2021, 7:32 AM

sys/net/route/route_ctl.c
110	For the feature routing IPv6 packets via IPv4 next-hops, it does not require too much effort. If it is useful in some case, I think we can put it in a separate diff. I'm not expert on this. I'll appreciate if someone would share the use-cases of routing IPv6 packets via IPv4 next-hops.

zlei added inline comments.Jun 15 2021, 7:37 AM

sys/net/if_ethersubr.c
377	It is a HACK here to fix the link layer type. It is better that the lle cache has correct type. I'm still investing on it.

LGTM, I guess the biggest remaining piece now is lle handling, especially sending queued LLE packets upon successful resolution.

In D30398#691776, @melifaro wrote:

LGTM, I guess the biggest remaining piece now is lle handling, especially sending queued LLE packets upon successful resolution.

Done! The solution looks ugly although.

In D30398#691776, @melifaro wrote:

LGTM, I guess the biggest remaining piece now is lle handling, especially sending queued LLE packets upon successful resolution.

sys/netinet6/nd6.c
2219 ↗	(On Diff #91340)	How about we reuse `is_gw` and add some flags like 'ENCAP_IPV4' ?
2489 ↗	(On Diff #91340)	Maybe we can stash only the encap family
sys/ofed/drivers/infiniband/core/ib_addr.c
408	Might be better if `nd6_resolve()` return with correct ether type ?

melifaro added inline comments.Jun 25 2021, 10:00 PM

sys/netinet6/nd6.c
2219 ↗	(On Diff #91340)	I'd avoid having > 7 arguments here, especially given we don't need to pass anything except the upper layer family. Maybe just dedicating 1 byte of `is_gw` to pass the family would work.
2388–2389 ↗	(On Diff #91340)	I'd rather exit here with something like:
2489 ↗	(On Diff #91340)	I'd just store encap family, so it can be made unified across IPv4/IPv6. We can perfectly allocate structure on-stack, as there are no performance requirements.

I have some WIP patch on making nd6_resolve() return LLEs with proper encap. Hope to publish it later this week.

Rework sending queued LLE packets.

melifaro added inline comments.Aug 1 2021, 2:48 PM

sys/net/if_tuntap.c
1405	Do we need it?
sys/netgraph/ng_iface.c
374	Do we need it?

melifaro added a child revision: D31379: [lltable] Add support for "child" LLEs holding encap for IPv4oIPv6 entries..Aug 2 2021, 8:27 PM

I've added D31379 with the lltable support for IPvX over IPvY.
Basically, the diff creates per-family "child" lle entries attached to the main lle entry. The purpose of each child entry is to have an object with a proper encap, so it can be referenced just like standard lle.

I've updated the aforementioned D31379 to reflect the committed parts.
If you could update this review to use the new functionality (e.g. nd6_resolve() returning lle with the proper encap) , that would be awesome.

melifaro added inline comments.Aug 7 2021, 11:05 AM

sys/net/route/route_ctl.c
110	let's remove V6 over V4 for now.
122	Could you also add feature(3) knob here so the userland can check if the support for the functionality exists?
598	Mind renaming to something like `check_gateway_family()` to match verb_description pattern for other static functions? Also: if you have cycles it would be good to spin up a separate diff, targetting moving the existing checks to a separate function. I can land it before this diff, thus simplifying this one.
sys/netgraph/netflow/netflow.c
367	Probably worth explicitly stating we're leaving an empty gateway here for IPv6 nexthops.

melifaro added a child revision: D31451: Simplify nhop operations in ip_output()..Aug 7 2021, 11:26 AM

melifaro added inline comments.Aug 8 2021, 9:43 AM

sys/dev/cxgbe/tom/t4_listen.c
1119–1122

In D30398#708795, @melifaro wrote:

I've updated the aforementioned D31379 to reflect the committed parts.
If you could update this review to use the new functionality (e.g. nd6_resolve() returning lle with the proper encap) , that would be awesome.

I'll manage it in a few days :)

melifaro mentioned this in D31595: Route IPv4 packets via IPv6 next-hops - D30398 variation..Aug 17 2021, 10:55 PM

Rebased on latest main branch

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

zlei marked 3 inline comments as done.Aug 18 2021, 4:47 AM

zlei added inline comments.

sys/net/if_tuntap.c
1405	I'll test P-t-P devices and report later.
sys/net/route/route_ctl.c
122	I'm new to this feature(3) knob. Can you guide me please ?
sys/ofed/drivers/infiniband/core/ib_addr.c
408	This is in D31379 IIUC.

.

sys/net/if_tuntap.c
1405	In case we intend to support route like `route add x.x.x.x -inet6 yy::` where the `yy::` is IPv6 address of the peer of P-t-P interface, we still need it. Otherwise bpf will consume wrong address family.
sys/netgraph/ng_iface.c
374	I think it is the same as above of `sys/net/if_tuntap.c`

melifaro added inline comments.Aug 18 2021, 7:51 AM

sys/net/if_infiniband.c
390	Not needed, we should pass proper family to ‘infiniband_resolve_addr()’

melifaro added inline comments.Aug 18 2021, 7:56 AM

sys/net/route/route_ctl.c
122	Something like `FEATURE(ipv4_rfc5549_support, "Route IPv4 packets via IPv6 nexthops");` You can check `sysctl kern.features` for the features that currently exist.

melifaro added inline comments.Aug 18 2021, 8:18 AM

sys/net/if_ethersubr.c
238–243
375	Not needed anymore.

In D30398#712270, @zlei.huang_gmail.com wrote:

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

It should be the other way round :-) e.g. D31379 is a pre-requisite.
Could you try to apply it first, build on top and test?

Cleaned up some comments.

Added feature(3) knob.

In D30398#712308, @melifaro wrote:

In D30398#712270, @zlei.huang_gmail.com wrote:

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

It should be the other way round :-) e.g. D31379 is a pre-requisite.
Could you try to apply it first, build on top and test?

OK, I'll test and report later.

You also need to add a bit of family wrapping logic inside fill_nhop_from_info(), so we get a proper family for the nexthop.

Also: will you write a commit message, or do you prefer me doing it?

In D30398#712308, @melifaro wrote:

In D30398#712270, @zlei.huang_gmail.com wrote:

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

It should be the other way round :-) e.g. D31379 is a pre-requisite.
Could you try to apply it first, build on top and test?

So far so good :)

Pass correct AF to nd6_resolve()

.

In D30398#713040, @melifaro wrote:

You also need to add a bit of family wrapping logic inside fill_nhop_from_info(), so we get a proper family for the nexthop.

I'll look at it.

In D30398#713041, @melifaro wrote:

Also: will you write a commit message, or do you prefer me doing it?

This is a quite large change codebase. I'm not sure I can do it well.
I would appreciate it if you could do it.

melifaro accepted this revision.Aug 20 2021, 9:38 PM

melifaro added reviewers: network, ae, olivier, • hselasky.

This revision is now accepted and ready to land.Aug 20 2021, 9:39 PM

• hselasky added inline comments.Aug 21 2021, 7:18 PM

sys/net/if_infiniband.c
374	Maybe move the "int af = ..." down here, if this is the only place it is used.

Rebased on latest main branch.
Moved down RO_GET_FAMILY()

This revision now requires review to proceed.Aug 22 2021, 3:47 PM

Done.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 22 2021, 10:58 PM

Closed by commit rG62e1a437f328: routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549). (authored by zlei, committed by melifaro). · Explain Why

This revision was automatically updated to reflect the committed changes.

melifaro added a commit: rG62e1a437f328: routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549)..

@melifaro Thanks very much!

melifaro added a commit: rGe8df60a69a0e: routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549)..Sep 7 2021, 9:31 PM

melifaro mentioned this in D18581: Add ability to forward IPv4 packets trough IPv6 only router.Dec 10 2021, 9:47 PM

zlei mentioned this in D49172: kern: wg: remove overly-restrictive address family check.Sun, Mar 2, 1:51 AM

Route IPv4 packets via IPv6 next-hops
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 94034

sys/contrib/ipfilter/netinet/ip_fil_freebsd.c

sys/dev/cxgbe/tom/t4_listen.c

sys/dev/iicbus/if_ic.c

sys/net/debugnet.c

sys/net/if_disc.c

sys/net/if_ethersubr.c

sys/net/if_fwsubr.c

sys/net/if_gif.c

sys/net/if_gre.c

sys/net/if_infiniband.c

sys/net/if_loop.c

sys/net/if_me.c

sys/net/if_spppsubr.c

sys/net/if_tuntap.c

sys/net/route.h

sys/net/route/route_ctl.c

sys/netgraph/netflow/netflow.c

sys/netgraph/ng_iface.c

sys/netinet/ip_fastfwd.c

sys/netinet/ip_input.c

sys/netinet/ip_output.c

sys/netinet/toecore.c

sys/ofed/drivers/infiniband/core/ib_addr.c

Route IPv4 packets via IPv6 next-hopsClosedPublicActions