Page MenuHomeFreeBSD

Streamline infiniband code according to ethernet code
ClosedPublic

Authored by hselasky on Dec 16 2020, 12:10 PM.
Tags
None
Referenced Files
F102654169: D27631.id81062.diff
Fri, Nov 15, 9:53 AM
F102639697: D27631.id81062.diff
Fri, Nov 15, 5:00 AM
Unknown Object (File)
Thu, Nov 14, 10:45 PM
Unknown Object (File)
Thu, Nov 14, 5:39 PM
Unknown Object (File)
Thu, Nov 14, 3:59 PM
Unknown Object (File)
Thu, Nov 14, 3:08 PM
Unknown Object (File)
Thu, Nov 14, 10:02 AM
Unknown Object (File)
Thu, Nov 14, 9:58 AM
Subscribers

Details

Summary

MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking

Test Plan

Test OK

Server side:

  1. kldload mlx5en mlx5ib ipoib
  2. ifconfig ib0 inet6 1::2/64 up
  3. iperf -V -s

Client side:

  1. kldload mlx5en mlx5ib ipoib
  2. ifconfig ib0 inet6 1::1/64 up
  3. opensm -d mlx5_0 -B
  4. iperf -V -i 1 -t 10 -c 1::2

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Why ?

RFC 4391 section 8 IPv6 Stateless Autoconfiguration is quite clear IMO. There are local IPoIB addresses, with recommended format.

It might be the issue is somewhere else, but I see an issue with IPv6 neighbour solicitation being broken. Basically the Link-local address is not generated according the specification you mentioned, so the IB switch doesn't recognize it and distribute it as it should. Ping6 works (ignore the duplicate response). See below:

1 0.000000 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
2 1.024207 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
3 1.000161 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
4 1.199860 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
5 1.000025 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
6 0.999998 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
7 9.683235 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=0, hop limit=64 (no response found!)
8 0.000154 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=0, hop limit=64 (reply in 9)
9 0.000600 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=0, hop limit=64 (request in 8)
10 0.000060 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=0, hop limit=64
11 1.015989 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=1, hop limit=64 (no response found!)
12 0.000074 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=1, hop limit=64 (reply in 13)
13 0.000310 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=1, hop limit=64 (request in 12)
14 0.000046 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=1, hop limit=64

I do not understand the later claim as well.

NS is distributed to the members of specific multicast groups, which all hosts must join. It is not distributed based on the local address. Also, IB switches should not deal with IPoIB addresses at all, IPoIB is something like L4, out of scope.

Also note that the format of local addresses from RFC is recommended but not required. But I believe that LL is required, so you cannot disable it at whim.

IB switches should not deal with IPoIB addresses

IB switches has no concept of a L2. There may only be one uniq IP per port from what I know, which the switch needs to learn.

I will investigate a bit more before concluding this is the right way to go.

--HPS

This patch is wrong. I'll updated. It is multicast not link local IPv6, which is the issue. My bad. I'll update this patch.

hselasky retitled this revision from Disable link-local address generation for IPv6 over infiniband to Fix transmission of multicast ICMPv6 traffic for IPoIB.
hselasky edited the summary of this revision. (Show Details)

Updated patch.

sys/net/if_infiniband.c
255

This seems weird as well to me; I would assume that non-directed ICMPv6 traffic has M_MCAST set the below clause would handle that?

Why is ICMPv6 special at all compared to TCP or UDP? While link-layer address resolution works on ICMPv6 level it is not a separate protocol like ARP for IPv4 but above IPv6.

sys/net/if_infiniband.c
186–187

@bz: Have a look here, this is what we do for IPv4.

I'll have a closer look at it.

sys/net/if_infiniband.c
255

It might be the answer is that multicast handling in IPoIB is broken. I need to investigate this.

hselasky retitled this revision from Fix transmission of multicast ICMPv6 traffic for IPoIB to Fix handling of broadcast traffic for IP over infiniband.
hselasky edited the summary of this revision. (Show Details)

I think I got it right now, but need to test it a bit more.

sys/net/if_infiniband.c
254

Given there is no BCAST in IPv6, I wonder what this means?

hselasky retitled this revision from Fix handling of broadcast traffic for IP over infiniband to Broadcast fixes for IP over infiniband.
hselasky edited the summary of this revision. (Show Details)

Address comments from @bz .

hselasky added inline comments.
sys/net/if_infiniband.c
254

I just wanted to cases to be symmetric, but like you say there is no BCAST in IPv6, so now the packet is simply dropped.

sys/net/if_infiniband.c
184–185

Sorry, side note: would it be possible to move arp/nd encap logic to if_requesencap callback, so we have it consistent with ether_output() and simplify datapath code?

hselasky added inline comments.
sys/net/if_infiniband.c
184–185

Yes. Weren't you supposed to do that? ;-)

hselasky retitled this revision from Broadcast fixes for IP over infiniband to Streamline infiniband code according to ethernet code.
hselasky edited the summary of this revision. (Show Details)
hselasky added a reviewer: melifaro.

Implement suggestion from @melifaro .

Fix some typos. Zero reserved field.

Fix spelling.

iperf IPv4 and IPv6 test OK.

LGTM.
If it passed IPv4/IPv6 tests and solves the original problem than - good to go? :-)

sys/net/if_infiniband.c
184–185

Yes. Weren't you supposed to do that? ;-)

Well, I guess we misread our dialogue in D26254 :-( Also, unfortunately I don't have IB-capable HW handy.

This revision is now accepted and ready to land.Dec 22 2020, 6:30 PM

It looks good so far, but I want to test it a bit more first.