Page MenuHomeFreeBSD

netlink: add netlink support
ClosedPublic

Authored by melifaro on Jul 31 2022, 10:49 AM.
Tags
None
Referenced Files
F102428483: D36002.id111273.diff
Tue, Nov 12, 3:39 AM
Unknown Object (File)
Mon, Nov 11, 8:15 AM
Unknown Object (File)
Sun, Nov 10, 10:09 AM
Unknown Object (File)
Sun, Nov 10, 10:09 AM
Unknown Object (File)
Sun, Nov 10, 10:08 AM
Unknown Object (File)
Sun, Nov 10, 10:08 AM
Unknown Object (File)
Sun, Nov 10, 10:08 AM
Unknown Object (File)
Sun, Nov 10, 10:08 AM

Details

Summary

What is netlink?

Netlinks is a communication protocol currently used in Linux kernel to modify, read and subscribe for nearly all networking state. Interface state, addresses, routes, firewall, rules, fibs, etc are controlled via netlink.
It is async, TLV-based protocol, providing 1-1 and 1-many communications.

Why netlink is important for FreeBSD?

POSIX defined API for base functions/system calls. There is no such standard for plethora of various protocol/device-level/subsystem-level ioctls. Each subsystem/driver invents its own protocol, handling format and compatibility.
Netlink changes that by providing standard communication layer and basic extendable message formatting. It can serve as a "broker", automatically combining requested data from different sources in a single request (example: interface state dump).

For example, devd can be easily switch to use netlink, retiring one-off protocol. Tools like jail, pfilctl, can be converted to use netlink instead of a bunch of private ioctls. It will be easier for app developers to interact with our network stack.

Immediate drivers for netlink

Nexthop and nexthop-group-related changes in the routing stack opened the way for more effective and feature-rich route-related interaction between userland and kernel. Extending our existing protocol, rtsock(4) is not easy - to provide efficient multipath signalling, one need to introduce a new type of messages, other that RTM_ADD/RTM_DEL for the purposes of signalling route changes. I did a test implementation with extended rtsock and net/bird. De-facto it ended up being an own TLV-based protocol, sharing nothing with rtsock except the socket and base message header. Pushing out a new protocol, which is not even shared by other BSDs doesn't look promising.
Instead, netlink was chosen as a transport.

Implementation overview

Initial implementation was written in GSoC 2021, based on Luigi's work in 2015. It was not possible to write a full-featured netlink implementation in the time allocated for GSoC, especially given the fact that it was shorter than the previous GSoCs. As a result, initial implementation delivered some working code core code, allowing to use netlink sockets in kernel and perform route table manipulations. The code in this diff derives from this implementation, but has largely been rewritten to address large netlink message support, large dumps support, locking, sockets/vnet specifics and so on.

Netlink is implemented via loadable/unloadable kernel module, not touching many kernel parts.
To support async operation handling such as interface creation, dedicated tasqueue is created for each netlink socket. All message processing is handled within these task queues.

Handling messages to/from Linux processes requires their modification (address families and rtableid rewrites), which is performed by the transparent intercept layer. Its functionality allows to rewrite messages, including full message reconstruction.

What works

  • Dumps:
    • routes
    • nexhtops / nexthop groups
    • interfaces
    • interface addresses
    • neighbours (arp/ndp)
    • genetlink families & ops
  • Notifications
    • interface arrival/departure
    • interface address arrival/departure
    • route addition/deletion (from kernel, netlink and rtsock)
  • Modifications
    • adding/deleting routes
    • adding/deleting nexthops/nexthop groups
    • adding/deleting neighbours (arp/ndp)
    • adding/deleting interfaces (only basic version, no properties/updates supported)

Next steps

  • To add nhop/nhg support
  • To add interface change notifications
  • To add rtsock -> netlink notifications
  • To add netlink mpath route creation w/o user nexthop groups
  • To add generic netlink
  • To add ext ack support
  • Simplify netlink<>rtsock bridge
  • Neighbour notifications
  • Ifaddr change notifications
  • Genetlink dummy module
  • Tests
  • Manual pages
Test Plan

For now, latest Linux iproute2/ip binary is used as an integration test tool. A dedicated set of lower-level tests will be added later.

Commands:

-> ip -V
ip utility, iproute2-5.18.0

-> kldload netlink

-> ip r show
0.0.0.0/0 nhid 4 via 10.0.0.1 dev vtnet0 proto static
10.0.0.0/24 nhid 2 dev vtnet0 proto kernel
10.0.0.8 nhid 3 dev lo0 proto static
127.0.0.1 nhid 1 dev lo0 proto kernel

-> ip nexthop add id 137 via 10.0.0.33 dev vtnet0
-> ip nexthop add id 138 via 10.2.0.44 dev vtnet0.2
-> ip nexthop add id 24 group 137,5/138,10
-> ip nexthop
id 24 group 137,5/138,10
id 137 via 10.0.0.33 dev vtnet0
id 138 via 10.2.0.44 dev vtnet0.2

-> ip route add 11.0.0.0/24 nhid 24
-> ip r sh
0.0.0.0/0 via 10.0.0.1 dev vtnet0 proto static
10.0.0.0/24 dev vtnet0 proto kernel
10.0.0.4 dev lo0 proto static
10.2.0.0/24 dev vtnet0.2 proto kernel
10.2.0.1 dev lo0 proto static
11.0.0.0/24 nhid 24 proto unspec
	nexthop via 10.0.0.33 dev vtnet0 weight 5
	nexthop via 10.2.0.44 dev vtnet0.2 weight 10
127.0.0.1 dev lo0 proto kernel


-> ip -6 r sh
prohibit ::/96 nhid 6 via ::1 dev lo0 proto static
::/0 nhid 7 via 2a01:4f8:13a:70c:ffff::1 dev vtnet0 proto static
::1 nhid 1 dev lo0 proto static
prohibit ::ffff:0.0.0.0/96 nhid 6 via ::1 dev lo0 proto static
2a01:4f8:13a:70c:ffff::/96 nhid 5 dev vtnet0 proto kernel
2a01:4f8:13a:70c:ffff::8 nhid 4 dev lo0 proto static
prohibit fe80::/10 nhid 6 via ::1 dev lo0 proto static
fe80::/64 nhid 5 dev vtnet0 proto kernel
fe80::5054:ff:fe14:e319 nhid 4 dev lo0 proto static
fe80::/64 nhid 3 dev lo0 proto kernel
fe80::1 nhid 2 dev lo0 proto static
prohibit ff02::/16 nhid 6 via ::1 dev lo0 proto static

-> ip r add 11.0.0.0/24 via 10.0.0.9
-> ip r show
0.0.0.0/0 nhid 4 via 10.0.0.1 dev vtnet0 proto static
10.0.0.0/24 nhid 2 dev vtnet0 proto kernel
10.0.0.8 nhid 3 dev lo0 proto static
11.0.0.0/24 nhid 5 via 10.0.0.9 dev vtnet0 proto kernel
127.0.0.1 nhid 1 dev lo0 proto kernel

-> ip r del 11.0.0.0/24
-> ip r show
0.0.0.0/0 nhid 4 via 10.0.0.1 dev vtnet0 proto static
10.0.0.0/24 nhid 2 dev vtnet0 proto kernel
10.0.0.8 nhid 3 dev lo0 proto static
127.0.0.1 nhid 1 dev lo0 proto kernel

->  ip l add link vtnet0 name vtnet0.77 type vlan id 77
-> ip l show vtnet0.77
3: vtnet0.77: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/[135] 52:54:00:14:e3:19 brd ff:ff:ff:ff:ff:ff

-> ifconfig vtnet0.77
vtnet0.77: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=480403<RXCSUM,TXCSUM,LRO,LINKSTATE,TXCSUM_IPV6>
	ether 52:54:00:14:e3:19
	groups: vlan
	vlan: 77 vlanproto: 802.1q vlanpcp: 0 parent interface: vtnet0
	media: Ethernet autoselect (10Gbase-T <full-duplex>)
	status: active
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

-> ip l del vtnet0.77
-> ip l
1: vtnet0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 52:54:00:14:e3:19 brd ff:ff:ff:ff:ff:ff
2: lo0: <LOOPBACK,MULTICAST,UP> mtu 16384 qdisc noqueue state UP qlen 1000
    link/ieee1394 08


-> ip a sh
1: vtnet0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 52:54:00:14:e3:19 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.4 peer 10.0.0.255/24 scope global dynamic vtnet0
    inet6 fe80::5054:ff:fe14:e319/64 scope link dynamic vtnet0
    inet6 2a01:4f9:3a:fa00:5054:ff:fe14:e319/64 scope global dynamic vtnet0
2: lo0: <LOOPBACK,MULTICAST,UP> mtu 16384 qdisc noqueue state UP qlen 1000
    link/ieee1394 08
    inet6 ::1/128 scope host dynamic lo0
    inet6 fe80::1/64 scope link dynamic lo0
    inet 127.0.0.1/8 scope global dynamic lo0

 -> ip neigh sh
10.0.0.1 dev vtnet0 lladdr 52:54:00:8c:63:e9 REACHABLE
10.0.0.4 dev vtnet0 lladdr 52:54:00:14:e3:19 REACHABLE
2a01:4f9:3a:fa00:5054:ff:fe14:e319 dev vtnet0 lladdr 52:54:00:14:e3:19 REACHABLE
fe80::5054:ff:fe14:e319 dev vtnet0 lladdr 52:54:00:14:e3:19 REACHABLE
fe80::5054:ff:fe8c:63e9 dev vtnet0 lladdr 52:54:00:8c:63:e9 router STALE

-> kldunload netlink
->

Events:

-> ifconfig vtnet0.2 create
3: vtnet0.2: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 08

-> ifconfig vtnet0.2 inet 10.11.0.1/24
3: vtnet0.2    inet 10.11.0.1 peer 10.11.0.255/24 scope global dynamic vtnet0.2
3: vtnet0.2    inet6 fe80::5054:ff:fe14:e319/64 scope link dynamic vtnet0.2

-> ifconfig vtnet0.2 inet6 2a02:6b8::35/64
3: vtnet0.2    inet6 2a02:6b8::35/64 scope global dynamic vtnet0.2

-> ifconfig vtnet0.2 inet6 2a02:6b8::35 delete
Deleted 3: vtnet0.2    inet6 2a02:6b8::35/64 scope global dynamic vtnet0.2

-> ifconfig vtnet0.2 -alias 10.11.0.1
Deleted 3: vtnet0.2    inet 10.11.0.1 peer 10.11.0.255/24 scope global dynamic vtnet0.2

-> ifconfig vtnet0.2 destroy
Deleted 3: vtnet0.2    inet6 fe80::5054:ff:fe14:e319/64 scope link dynamic vtnet0.2
Deleted 3: vtnet0.2: <BROADCAST,SLAVE,DYNAMIC,200000> mtu 1500 state UNKNOWN
    link/[135] 52:54:00:14:e3:19

-> route -n monitor
-> ip r add 10.0.0.33 via 10.0.0.2

got message of size 184 on Mon Aug 29 18:26:43 2022
RTM_ADD: Add Route: len 184, pid: 0, seq 0, errno 0, flags:<UP,GATEWAY,HOST,DONE>
locks:  inits:
sockaddrs: <DST,GATEWAY>
 10.0.0.33 10.0.0.2

-> ip r del 10.0.0.33

got message of size 184 on Mon Aug 29 18:26:47 2022
RTM_DELETE: Delete Route: len 184, pid: 0, seq 0, errno 0, flags:<GATEWAY,HOST,DONE>
locks:  inits:
sockaddrs: <DST,GATEWAY>
 10.0.0.33 10.0.0.2

Generic netlink:

-> genl-ctrl-list -d
0x0010 nlctrl version 2
    hdrsize 0 maxattr 10
      op GETFAMILY (0x03) <admin_perm,has_doit,has_dump>
      grp notify (0x30)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
share/man/man4/netlink.4
140 ↗(On Diff #111199)
149 ↗(On Diff #111199)
150 ↗(On Diff #111199)

Why not .Dv?

155–156 ↗(On Diff #111199)
171 ↗(On Diff #111199)

Most, not all?

180–182 ↗(On Diff #111199)
This revision now requires changes to proceed.Sep 29 2022, 8:16 PM
share/man/man4/netlink.4
40 ↗(On Diff #111199)

Don't use .Tn, it's deprecated (see mdoc(7)). bcr's "mandoc -T lint" comment is also applicable here.

What should I use instead (if any)?

Will go through rtnetlink.4 later. For now, after fixing typos I reported, could you spellcheck as https://docs.freebsd.org/en/books/fdp-primer/manual-pages/#manual-pages-testing explains?

share/man/man4/netlink.4
40 ↗(On Diff #111199)

Short answer: nothing

[pauamma@gadfly] ~/FreeBSD/src% git grep -E 'Intel|AMD|Linux' | grep -E '\.[1-9]:' | grep -Ev '^contrib/' | wc -l
    1382
[pauamma@gadfly] ~/FreeBSD/src% git grep -E '\.Tn +(Intel|AMD|Linux)' | grep -E '\.[1-9]:' | grep -Ev '^contrib/' | wc -l
     133

Using 3 common trademarked names: points to .Tn being the exception and nothing the overwhelming majority.

mdoc(7) says this about .Tn:

Supported only for compatibility, do not use this in new manuals.
Even though the macro name (“tradename”) suggests a semantic
function, historic usage is inconsistent, mostly using it as a
presentation-level macro to request a small caps font.

And https://docs.freebsd.org/en/books/fdp-primer/manual-pages/#manual-pages-markup recommends not using those:

There is some appearance-based markup which is usually best avoided.

183 ↗(On Diff #111199)
191 ↗(On Diff #111199)
192 ↗(On Diff #111199)
197 ↗(On Diff #111199)
199 ↗(On Diff #111199)
201 ↗(On Diff #111199)

Is that what you meant?

212 ↗(On Diff #111199)
215 ↗(On Diff #111199)
217 ↗(On Diff #111199)
221 ↗(On Diff #111199)
224 ↗(On Diff #111199)
226 ↗(On Diff #111199)
227 ↗(On Diff #111199)
228 ↗(On Diff #111199)
230 ↗(On Diff #111199)

Unless all-uppercase is needed here.

233 ↗(On Diff #111199)
235 ↗(On Diff #111199)
238 ↗(On Diff #111199)
242 ↗(On Diff #111199)
244 ↗(On Diff #111199)
254 ↗(On Diff #111199)

.Fa again?

257 ↗(On Diff #111199)
260 ↗(On Diff #111199)
267 ↗(On Diff #111199)

Does

#define NETLINK_FIREWALL	3	/* not supported */

mean there's no firewall (yet?), or no sockopt for it?

272 ↗(On Diff #111199)
278 ↗(On Diff #111199)
283 ↗(On Diff #111199)

"per-file"? Do you mean "per-socket"?

304 ↗(On Diff #111199)

Should say how it reports success. error = 0?

321 ↗(On Diff #111199)
333 ↗(On Diff #111199)

Does that exist already?

melifaro marked 45 inline comments as done.
  • Use STAILQ mbuf macros in nl_io_queue
  • Update man pages to reflect the comments
share/man/man4/netlink.4
171 ↗(On Diff #111199)

Some messages such as NLMSG_DONE simply put error code as single u32 value immediately after the header, instead of enclosing it into TLV. There are just 2-3 such occurrences in the protocol, but they do exist.

201 ↗(On Diff #111199)

Yes, thank you!

267 ↗(On Diff #111199)

Whoops. Part of this section was copied from ipfw(8) and not rephrased :-) Should be better now. Thank you for noticing!

283 ↗(On Diff #111199)

No, it's indeed per-(source)-file:

13:23 [0] m@devel2 sysctl net.netlink.debug
net.netlink.debug.nl_linux_debug_level: 7
net.netlink.debug.nl_route_debug_level: 7
net.netlink.debug.nl_nhop_debug_level: 9
net.netlink.debug.nl_neigh_debug_level: 7
net.netlink.debug.nl_iface_drivers_debug_level: 7
net.netlink.debug.nl_iface_debug_level: 7
net.netlink.debug.nl_route_core_debug_level: 7
net.netlink.debug.nl_generic_debug_level: 9
net.netlink.debug.nl_writer_debug_level: 7
net.netlink.debug.nl_parser_debug_level: 7
net.netlink.debug.nl_io_debug_level: 7
net.netlink.debug.nl_domain_debug_level: 7
net.netlink.debug.nl_mod_debug_level: 7
333 ↗(On Diff #111199)

Nyet. Should be added soon.

share/man/man4/netlink.4
171 ↗(On Diff #111199)

Would something like "Most messages encode their attributes as type-length-value pairs" be more better, assuming that's actually the case -- that most messages are all TLV, except a few that are entirely non-TLV?

283 ↗(On Diff #111199)

Source file granularity is probably not obvious to end users though, something like controllable "per functional area" or so might be clearer?

melifaro marked an inline comment as done.

Address emaste@ manpage comments.

melifaro added inline comments.
share/man/man4/netlink.4
127 ↗(On Diff #111199)

It's the field name - so to me, it's closer to the variable name than the variable type.

Address the remaining manpage comments.

Done for this round of reviewing.

share/man/man4/netlink.4
247 ↗(On Diff #111199)
350 ↗(On Diff #111199)
353 ↗(On Diff #111199)
share/man/man4/rtnetlink.4
41 ↗(On Diff #111199)

Or maybe "is intended as the".

65 ↗(On Diff #111199)
69 ↗(On Diff #111199)
72 ↗(On Diff #111199)

Maybe "origin routing protocol" instead of "originator" if that's what you mean?

77 ↗(On Diff #111199)

Or "the user"

78 ↗(On Diff #111199)

Can you clarify "self-originated"? If you mean like "datagrams for 10.11.12.0/24 are routed through em0 (whose address is 10.11.12.13)", I would use "local" instead.

79 ↗(On Diff #111199)
84–85 ↗(On Diff #111199)

Can you clarify those? I'm guessing RT_SCOPE_LINK means directly reachable through an interface (like "route add -interface") and RT_SCOPE_UNIVERSE means all other routes including the default route if there's one, but if I'm right, that needs to be explicitely stated, because "scope" sent me on the IPv6 address scope garden path.

88 ↗(On Diff #111199)
101 ↗(On Diff #111199)
104 ↗(On Diff #111199)
110 ↗(On Diff #111199)
113 ↗(On Diff #111199)
116 ↗(On Diff #111199)

Empty?

133–137 ↗(On Diff #111199)

Empty?

144–147 ↗(On Diff #111199)

Empty?

155–158 ↗(On Diff #111199)

Empty?

175 ↗(On Diff #111199)
178 ↗(On Diff #111199)
This revision was not accepted when it landed; it landed in state Needs Review.Oct 1 2022, 2:19 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
melifaro marked 11 inline comments as done.