Page MenuHomeFreeBSD

sysctl(9): Enable vnet sysctl variables be loader tunable
ClosedPublic

Authored by zlei on Apr 18 2023, 4:40 AM.
Tags
None
Referenced Files
F107105491: D39638.diff
Fri, Jan 10, 4:54 AM
Unknown Object (File)
Fri, Dec 27, 6:06 PM
Unknown Object (File)
Fri, Dec 27, 2:52 AM
Unknown Object (File)
Thu, Dec 26, 1:07 PM
Unknown Object (File)
Tue, Dec 24, 2:44 PM
Unknown Object (File)
Thu, Dec 19, 1:07 PM
Unknown Object (File)
Nov 27 2024, 7:06 PM
Unknown Object (File)
Nov 26 2024, 7:10 PM

Details

Summary

Complete phase two of 3da1cf1e88f8.

In 3da1cf1e88f8, the meaning of the CTLFLAG_TUN flag is extended to automatically check if there is a kernel environment variable which shall initialize the SYSCTL during early boot. It works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTLs which belong to VNETs.

This change extends the meaning of the CTLFLAG_TUN flag further, to allow it works for the SYSCTLs which belong to VNETs. A typical usage is CTLFLAG_VNET | CTLFLAG_RWTUN. CTLFLAG_VNET | CTLFLAG_RDTUN is also supported but that is not much useful, as per VNET readonly tunables can be shared with a global one.

Note that the values of SYSCTLs belong to VNETs will not reflect the change of kernel environment variable after the kernel modules been initialized. This behavior is currently the same with that of SYSCTLs not belong to VNETs.

The following SYSCTLs belong to VNETs have CTLFLAG_TUN flag, and will be fixed by this change:

net.add_addr_allfibs
net.bpf.optimize_writers
net.inet.tcp.fastopen.ccache_buckets
net.key.spdcache.maxentries
net.key.spdcache.threshold
net.link.bridge.pfil_onlyip
net.link.bridge.pfil_bridge
net.link.bridge.pfil_ipfw_arp
net.link.bridge.pfil_member
net.link.bridge.pfil_local_phys
net.link.bridge.log_stp
net.link.bridge.bridge_inherit_mac
net.link.lagg.default_use_flowid
net.link.lagg.default_use_numa
net.link.lagg.default_flowid_shift
net.link.lagg.lacp.debug
net.link.lagg.lacp.default_strict_mode

The values of these SYSCTLs belong to VNETs are loaded by TUNABLE_XXX_FETCH, thus are not affected by this change:

net.inet.ip.reass_hashsize
net.inet.tcp.hostcache.cachelimit
net.inet.tcp.hostcache.hashsize
net.inet.tcp.hostcache.bucketlimit
net.inet.tcp.syncache.bucketlimit
net.inet.tcp.syncache.cachelimit
net.inet.tcp.syncache.hashsize

Given TUNABLE_XXX_FETCH has maximum flexibility, those SYSCTLs in newly created vnets will have new values when corresponding kernel environment variables change. Ideally the CTLFLAG_TUN flag of those SYSCTLs can be removed, or we step further, automatically update the 'master' copy of the vnet SYSCTLs and newly created vnets can have new values, then we can eliminate TUNABLE_XXX_FETCH as much as possible.

Discussed with: hselasky, glebius
Fixes: 3da1cf1e88f8 Extend the meaning of the CTLFLAG_TUN flag ...
MFC after: 3 weeks
Relnotes: yes

Test Plan

Test the following cases, on x86-64 and riscv hardwares:

  1. kenv, load modules, create new vnets
  2. create vnet, then kenv && load modules
  3. kenv && load modules, check vnet0 (the host)
#!/bin/sh

# unload it incase it is already loaded
kldstat -qm if_bridge && kldunload if_bridge

j1=$( jail -ic vnet persist )

# The default is 1, we set it to 0 in system envionment
echo "set system envionment"
kenv net.link.bridge.pfil_member=0

kldload if_bridge

echo "sysctl in host (vnet0)"
sysctl net.link.bridge.pfil_member

echo "sysctl in vnet $j1"
jexec $j1 sysctl net.link.bridge.pfil_member

j2=$( jail -ic vnet persist )
echo "sysctl in new vnet $j2"
jexec $j2 sysctl net.link.bridge.pfil_member

kldunload if_bridge
jail -R $j2
jail -R $j1

Before this change:

set system envionment
net.link.bridge.pfil_member="0"
sysctl in host (vnet0)
net.link.bridge.pfil_member: 1
sysctl in vnet 1
net.link.bridge.pfil_member: 1
sysctl in new vnet 2
net.link.bridge.pfil_member: 1

after:

set system envionment
net.link.bridge.pfil_member="0"
sysctl in host (vnet0)
net.link.bridge.pfil_member: 0
sysctl in vnet 1
net.link.bridge.pfil_member: 0
sysctl in new vnet 2
net.link.bridge.pfil_member: 0

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

zlei requested review of this revision.Apr 18 2023, 4:40 AM
sys/kern/link_elf.c
1934

XXX link_elf.c is not tested.

sys/kern/linker_if.m
158

This #ifdef requires D36882 .

sys/kern/link_elf.c
1937

Technically this variable should be "ssize_t".

1943

What is the reason behind this check? I find it odd!

sys/kern/link_elf.c
1937

Technically this variable should be "ssize_t".

void
vnet_data_copy(void *start, int size)
{
...
}

vnet_data_copy() should also be updated accordingly.

1943

What is the reason behind this check? I find it odd!

I've no idea. It is copy-pasted from parse_vnet().

#ifdef VIMAGE
static int
parse_vnet(elf_file_t ef)
{
...
#if defined(__i386__)
	/* In case we do find __start/stop_set_ symbols double-check. */
	if (size < 4) {
		uprintf("Kernel module '%s' must be recompiled with "
		    "linker script\n", ef->lf.pathname);
		return (ENOEXEC);
	}

	/* Padding from linker-script correct? */
	pad = *(uint32_t *)((uintptr_t)ef->vnet_stop - sizeof(pad));
	if (pad != LS_PADDING) {
		uprintf("Kernel module '%s' must be recompiled with "
		    "linker script, invalid padding %#04x (%#04x)\n",
		    ef->lf.pathname, pad, LS_PADDING);
		return (ENOEXEC);
	}
	/* If we only have valid padding, nothing to do. */
	if (size == 4)
		return (0);
#endif
...
}

Some analysis:

# egrep -r '(CTLFLAG_R[DW]TUN\s*\|\s*CTLFLAG_VNET)|(CTLFLAG_VNET\s*\|\s*CTLFLAG_R[DW]TUN)' sys

Affected and fixed:

net.add_addr_allfibs		( sys/net/route/route_ifaddrs.c )
net.link.bridge.pfil_onlyip	( sys/net/if_bridge.c)
               .pfil_bridge
               .pfil_ipfw_arp
               .pfil_member
               .pfil_local_phys
               .log_stp
               .bridge_inherit_mac

net.link.lagg.default_use_flowid	( sys/net/if_lagg.c )
             .default_use_numa
             .default_flowid_shift

net.link.lagg.lacp.debug		( sys/net/ieee8023ad_lacp.c )
                  .default_strict_mode

net.bpf.optimize_writers		( sys/net/bpf.c )

net.key.spdcache.maxentries		( sys/netipsec/key.c )
                .threshold

net.inet.tcp.fastopen.ccache_buckets	( sys/netinet/tcp_fastopen.c )

Not affected ( loaded by TUNABLE_XXX_FETCH )

net.inet.tcp.hostcache.cachelimit	( sys/netinet/tcp_hostcache.c )
                      .hashsize
                      .bucketlimit

net.inet.tcp.syncache.bucketlimit	( sys/netinet/tcp_syncache.c )
                     .cachelimit
                     .hashsize

net.inet.ip.reass_hashsize		( sys/netinet/ip_reass.c )

Conceptually LGTM, it would be nice to have a high-level description of what the feature does in the commit message. It took some time to me to understand that, for example, 'kenv net.inet.ip.fw.default_to_accept=0' won't change the relevant value in the existing VNETs.

Maybe add a word about it in UPDATING ?

Conceptually LGTM, it would be nice to have a high-level description of what the feature does in the commit message. It took some time to me to understand that, for example, 'kenv net.inet.ip.fw.default_to_accept=0' won't change the relevant value in the existing VNETs.

Yes I planned to.
I mainly test the change in stable/13, as my test VM's root fs is ZFS and current/14 has some problem with it.
I'll prepare a new VM with UFS to test the change, and I think link_elf.c should also be test covered.

Maybe add a word about it in UPDATING ?

I see previous sections in UPDATING are mainly for breaking changes and new features.
In this change the kernel linkers are affected, so it should be significant and deserves a word about it in UPDATING.

Can releng comment on it?

andrew added inline comments.
sys/kern/linker_if.m
158

It looks safe to MFC. On 13 and earlier the script that parses the .m files will ignore the #ifdef so you'll get the propagate_vnets definition, however it nothing implements it or calls it the only issue will be a slightly larger kernel file.

Don't have expertise in the linker, especially I don't understand the i386 code snippet in there. But I like what the change does. Thanks!

  1. link_elf.c: remove unneeded check against i386
  2. link_elf_obj.c: break on propagating vnets, there should be only one vnet_set

The changes to link_elf.c is tested with a minimal foo.ko which is linked to DYN elf type.

root@:~ # readelf -h foo.ko
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 09 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            FreeBSD
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices x86-64
  Version:                           0x1
  Entry point address:               0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          10352 (bytes into file)
  Flags:                             0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         7
  Size of section headers:           64 (bytes)
  Number of section headers:         23
  Section header string table index: 20
sys/kern/link_elf.c
1941

parse_vnet() has checked the size, so it is safe here.

zlei marked 3 inline comments as done.May 6 2023, 1:47 AM
zlei added inline comments.
sys/kern/link_elf.c
1934

Tested with QEMU riscv emulator.

zlei edited the test plan for this revision. (Show Details)
zlei added reviewers: rwatson, bz.
zlei edited the summary of this revision. (Show Details)

Added an UPDATING entry.

Sorry for mistakenly included D39852, corrected now.

sys/kern/kern_sysctl.c
514

This may tigger panic if SYSCTL nodes are declared with SYSCTL_PROC and without flag CTLFLAG_NOFETCH, such as

SYSCTL_PROC(_net_inet_carp, OID_AUTO, allow,
    CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RWTUN | CTLFLAG_MPSAFE,
    &VNET_NAME(carp_allow), 0, carp_allow_sysctl, "I",
    "Accept incoming CARP packets");

in sys/netinet/ip_carp.c .

I will post separated fixes for them.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 9 2023, 8:21 AM
This revision was automatically updated to reflect the committed changes.