Page MenuHomeFreeBSD

ng_eiface(4) and ng_iface(4) should play better with vnet(9)
AbandonedPublic

Authored by dave_freedave.net on Thu, Feb 27, 4:56 PM.
Referenced Files
Unknown Object (File)
Tue, Mar 4, 10:27 AM
Unknown Object (File)
Sat, Mar 1, 8:39 AM
Unknown Object (File)
Sat, Mar 1, 4:19 AM
Unknown Object (File)
Fri, Feb 28, 10:57 PM
Unknown Object (File)
Fri, Feb 28, 4:08 PM
Unknown Object (File)
Fri, Feb 28, 2:50 PM
Unknown Object (File)
Fri, Feb 28, 1:14 PM
Unknown Object (File)
Fri, Feb 28, 8:12 AM

Details

Reviewers
glebius
Summary

Now these nodes move to jail with their contained struct ifnet.
Additionally add 'setvnethome' message so that nodes will be automatically
shutdown when the vnet(9) is destroyed instead of returned.
Finally provide a sample script, jei, to create an ng_eiface(4), connect it,
send it to jail, rename it and optionally set its MAC address (order matters).

Test Plan

Having learned my lesson on D44615, I have been testing on current with invariants etc.
Also I ran kyua(1) for sys/netgraph/bridge and sys/netgraph/basic as they seem to apply.

I have also created a script to verify this doesn't leak permissions into jail and run that.
For my testing I use a fairly simple jail.conf and yes all three jails share same path:

mount.devfs;
mount.fdescfs;
mount.procfs;
exec.clean;
path="/jail/test";
host.hostname="$name.jail";
exec.consolelog="/var/tmp/$name";

test {
  vnet;
  exec.created = "/usr/local/bin/jei $name tst1 br0:link 00:0C:29:39:B5:4C";
  exec.start = "/bin/sh /etc/rc";
  exec.stop = "/bin/sh /etc/rc.shutdown jail";
}

# you must create this interface and name it before starting this jail.
traditional {
  vnet;
  vnet.interface = trad0;
  exec.start = "/bin/sh /etc/rc";
  exec.stop = "/bin/sh /etc/rc.shutdown jail";
}

# This is for use with test script to verify permissions aren't leaked.
bare {
  vnet;
  exec.start = "/bin/sh /etc/rc";
  exec.stop = "/bin/sh /etc/rc.shutdown jail";
}

Before trying my changes you need something like traditional to even see the problem. Here is how you could go about creating trad0 interface for it:

# ngctl
+ mkpeer . bridge b uplink
+ name .:b br0
+ msgs br0: setpersistent
+ disconnect .: b
+ mkpeer . eiface e ether
+ name .:e trad0
+ disconnect .: e
+ connect br0: trad0: link ether

This makes our br0 and trad0 nodes but has not changed interface name for ifconfig(8). Let's do that next:

# ifn=$(ngctl msg trad0: getifname | sed '1d' | awk -F '"' '{ print $2 }')
# ifconfig $ifn name trad0

Now go ahead and start our traditional netgraph(8) jail with jail -cv traditional (and without my changes). Right away we can see something is wrong:

ngctl show -n trad0:
 Name: trad0             Type: eiface          ID: 00000012   Num hooks: 1
jexec traditional ngctl show -n trad0:
  Name: trad0             Type: ether           ID: 00000001   Num hooks: 0

Already its lying to both the main system and the jail. The main system has an eiface node but not the interface in ifconfig(8). The jail got an ether attached to its interface. There is even a comment about how this is supposed to be prevented in ng_ether.c saying not to create or attach to netgraph nodes[1]. It does not account for vnet(9).

This gets worse though once you think about that code. We can intentionally confuse it by renaming the interface in the jail with jexec traditional ifconfig trad0 name foo0. Now shutdown the jail with jail -rv traditional and poke about.

# ngctl ls -n | grep foo0
  Name: foo0            Type: ether           ID: 00000028   Num hooks: 0

Yikes! It magically created another node for us and now we have an ether and an eiface pointing at the same struct ifnet.

I have been using ng_eiface(4) with jails since FreeBSD 9-ish, and only recently noticed this. If you shutdown the eiface the ether goes away too, it does not panic. But then I don't usually change interface names in jails so maybe it would eventually.

But if we have ng_eiface(4) play nice with vnet(9) not only can you change the interface for ifconfig(8), you can change the name (independently) with ngctl(8) too. I was initially inclined to update ngctl(8) to have a "shutdown all" that could run in the jail so scripts would not have to care about the names or if somebody renamed them. That almost works but you have to put that in exec.stop after /bin/sh /etc/rc.shutdown jail and the nodes can be destroyed before shutdown timesout. For this solution to work we would need another state for jail.conf(5) that is after exec.stop but before exec.poststop.

That is why I settled on a new message to set the ng_eiface(4) or ng_iface(4) vnet(9) "home". Now they get cleaned up with the rest of vnet(9), and it happens after all exec.stop (succeed or timeout). The jei script uses this after it has sent a node to the jail vnet(9) and this causes vnet_netgraph_uninit to do its job and destroy the node when the vnet(9) is destroyed. I tried setting if_home_vnet every time an interface moved but that would break the traditional netgraph jail (the interface would never come back) and so I settled on a new message.

This method lets you "undo" as well. Just ifconfig trad0 -vnet traditional then ngctl msg trad0: setvnethome and its back from the jail and its home vnet(9) is reset too. Or you can have your jail gift the node to a sub-jail and set its home there (or not as you choose).

Getting these nodes to play with vnet(9) means that node ID is now unique across all jails (much like a process ID). Changing this should not break any users as they never get to set the ID anyway. This is true for the device ID in ng_eiface(4) and ng_iface(4) as well.

Finally there is my test script which uses the bare configuration above and manually creates interfaces to verify connectivity but more importantly to verify that new permissions were not granted by moving nodes to vnet(9)s. In netgraph(8) you often reference a node by another. That is how we could name our ng_eiface(4) before by using the path ".:e" until we managed to name it. Once you move an interface into a jail you can see that it is connected to "br0" you just can't message "br0" in any way from the jail. From the system you can't message the interface via "br0:link4" either. You can disconnect from your side (if in jail you disconnect the ether hook, on main system one of the bridge linkX).

Here is my permissions testing script:

#! /bin/sh
#
# For purposes of testing we need to create this picture:
#                                ╎
#                  <host system> ╎ <jail test>
#                                ╎
#          +-------------+       ╎               N ifaces sent
#          |             |       ╎
#  +---+---+---+   +-----+-----+ ╎ +-----------+   +-----------+   +-----------+
#  | br0       |   | tst0      | ╎ | tst1      |   | tst...    |   | tstN      |
#  | ng_bridge |   | ng_eiface | ╎ | ng_eiface |   | ng_eiface |   | ng_eiface |
#  +-+---+---+-+   +-----------+ ╎ +-----+-----+   +-----+-----+   +-----+-----+
#     |  |   |                   ╎       |               |               |
#     |  |   +---------------------------+               |               |
#     |  |                       ╎                       |               |
#     |  +-----------------------------------------------+               |
#     |                          ╎                                       |
#     +------------------------------------------------------------------+
#                                ╎
#                          +-----------+
#                         /      ╎      \
#                  +-----+----+  ╎  +----+-----+
#                  | ppNh     |  ╎  | ppNj     |
#                  | ng_iface |  ╎  | ng_iface |
#                  +----------+  ╎  +----------+
#                                ╎
#
# We have to create ng_eiface(8) and ng_iface(8) because they both have a
# `struct ifnet` and because we changed and want to test both.
#
# For this to just run through and verify we didn't grant extra permissions
# we have to make some extras and do basic connectivity verification.
#
# Basically a jail should not be able to reference nodes in other virtual nets.
# Nor can host. True before my changes, still true after. How can they be
# linked then? They must be linked BEFORE moving into jail. What changed is
# you can now see nodes that have links going outside jail and they must be
# protected from incorrect access now.
# Before changes `ifconfig <ifname> vnet <jailname>` did this for ng_eiface:
#       move the `struct ifnet` to jail
#       incorrectly attach ng_ether to `struct ifnet` (does hide access)
# Now that results in:
#       move ng_eiface to jail
#       move `struct ifnet` to jail (so care needed for access now)
#
# The way this works is that all netgraph commands have to look up where they
# are (or assume `.` the root of that curvnet). Each node also knows which vnet
# it is in. So when the netgraph subsytem parses a path to look up "here" in the
# code, it must be in curvnet or EACESS returned. Netgraph parsing of paths only
# allows 2 components. So you can't work your way back to yourself like
#       tst1:ether:link3
# That is an invalid arg so no need to test these.
#
# But enforcing the node is ultimately in same curvnet is why you can't monkey
# with anything on "the other side". You can have an effect though. The host can
# remove a link from a bridge (and the other side may be in jail). And a jail
# admin can remove the link from an ng_eiface to the bridge. But neither can
# shutdown a node on the other side using a path from a hook.
#
# There are no checks in code for data traversing links. That is unchanged.

BR="br0"
JL="bare"

AWK="/usr/bin/awk"
GREP="/usr/bin/grep"
IFCONFIG="/sbin/ifconfig"
JAIL="/usr/sbin/jail"
JEXEC="/usr/sbin/jexec"
NGCTL="/usr/sbin/ngctl"
PING="/sbin/ping"
SED="/usr/bin/sed"
TR="/usr/bin/tr"

S=0
F=0
QUIET=1

# Check Success
CS()
{
        if [ ${1} -eq 0 ]; then
                [ $QUIET -ne 0 ] && printf 'SUCCESS\n'
                S=$((S + 1))
        else
                [ $QUIET -ne 0 ] && printf 'FAIL\n'
                F=$((F + 1))
        fi
}

# Expect Success
ES()
{
        local func=$1
        shift
        if [ $QUIET -eq 0 ]; then
                $func "$@"
        else
                echo -n "$func $@ => "
                $func "$@" > /dev/null 2>&1
        fi
        CS $?
}

# Check Failure
CF()
{
        if [ ${1} -eq 0 ]; then
                [ $QUIET -ne 0 ] && printf 'FAIL\n'
                F=$((F + 1))
        else
                [ $QUIET -ne 0 ] && printf 'SUCCESS\n'
                S=$((S + 1))
        fi
}

# Expect Failure
EF()
{
        local func=$1
        shift
        if [ $QUIET -eq 0 ]; then
                $func "$@"
        else
                echo -n "$func $@ => "
                $func "$@" > /dev/null 2>&1
        fi
        CF $?
}

printf "creating bridge => "
${NGCTL} -f- <<- EOF  > /dev/null 2>&1
        mkpeer bridge l link
        name .:l ${BR}
        msg ${BR}: setpersistent
EOF
CS $?

# pass in node, e.g "tst1" and correct eiface or iface will be set to matching
# name, this works because they respond the same to `getifname` message.
rename_iface()
{
        local name
        name=$(${NGCTL} msg ${1}: getifname | ${SED} '1d' | ${AWK} -F'"' '{print $2}')
        ES ifconfig ${name} name ${1}
}

make_eiface()
{
        printf "creating ng_eiface ${2} => "
        ${NGCTL} -f- <<- EOF > /dev/null 2>&1
                mkpeer eiface e ether
                name .:e ${2}
                rmhook . e
                connect ${1}: ${2}: link ether
        EOF
        CS $?
}

# Matrix of tests
# in addition to just working as interfaces, we have to try access through hooks
# and back again. some should work others will not.
#
# iface |S/F| where | command
#-------+---+-------|----
#  tst0 | ✅| host  | ifconfig tst0 inet ... ping
#  tst1 | ✅| jail  | ifconfig tst1 inet ... ping
#
#  tst2 | ✅| jail  | ngctl rmhook tst2: ether
#  tst3 | ❌| jail  | ngctl rmhook tst3:ether link3
#  tst4 | ✅| jail  | ngctl shutdown tst4:
#  tst5 | ❌| jail  | ngctl shutdown tst5:ether
#  tst6 | ✅| host  | ngctl rmhook ${BR}: link6
#  tst7 | ❌| host  | ngctl rmhook ${BR}:link7 ether
#  tst8 | ❌| host  | ngctl shutdown ${BR}:link8
#
#       | ✅| host  | ngctl shutdown ${BR}:

for n in 0 1 2 3 4 5 6 7 8; do
        make_eiface ${BR} tst${n}
        rename_iface tst${n}
done

make_iface_pair()
{
        printf "creating $1 <--> $2 ng_iface pair => "
        ${NGCTL} -f- <<- EOF  > /dev/null 2>&1
                mkpeer iface i inet
                name .:i ${1}
                rmhook . i
                mkpeer iface i inet
                name .:i ${2}
                rmhook . i
                connect ${1}: ${2}: inet inet
        EOF
        CS $?
}

# similar matrix of things to verify
#
# pair |S/F| where | command
#  pp0 | ✅| host  | ifconfig pp0h inet ... ping
#  pp0 | ✅| jail  | ifconfig pp0j inet ... ping
#
#  pp1 | ✅| host  | ngctl rmhook pp1h: inet
#  pp2 | ❌| host  | ngctl rmhook pp2h:inet inet
#  pp3 | ✅| jail  | ngctl rmhook pp3j: inet
#  pp4 | ❌| jail  | ngctl rmhook pp4j:inet inet
#  pp5 | ✅| host  | ngctl shutdown pp5h:
#  pp6 | ❌| host  | ngctl shutdown pp6h:inet
#  pp7 | ✅| jail  | ngctl shutdown pp7j:
#  pp8 | ❌| jail  | ngctl shutdown pp8j:inet

for n in 0 1 2 3 4 5 6 7 8; do
        make_iface_pair pp${n}h pp${n}j
        rename_iface pp${n}h
        rename_iface pp${n}j
done

# initiallize interfaces that will remain on host: tst0, pp0h
# don't bother initializing others as we just need for access tests.

ES ${IFCONFIG} tst0 inet 192.168.0.1 netmask 255.255.255.0
ES ${IFCONFIG} pp0h inet 192.168.1.1 192.168.1.2

# start jail and configure its interfaces
ES ${JAIL} -cv ${JL}

# send tst1...N (not tst0).
for n in 1 2 3 4 5 6 7 8; do
        ES ${IFCONFIG} tst${n} vnet ${JL}
done
# send all pp?j to jail.
for n in 0 1 2 3 4 5 6 7 8; do
        ES ${IFCONFIG} pp${n}j vnet ${JL}
done

# initialize tst1 and pp0j on jail for connectivity (should still work).
ES ${JEXEC} ${JL} ${IFCONFIG} tst1 inet 192.168.0.2 netmask 255.255.255.0
ES ${JEXEC} ${JL} ${IFCONFIG} pp0j inet 192.168.1.2 192.168.1.1

# verify host can get to jail
ES ${PING} -c 1 192.168.0.2
ES ${PING} -c 1 192.168.1.2

# verify jail can get to host
ES ${JEXEC} ${JL} ${PING} -c 1 192.168.0.1
ES ${JEXEC} ${JL} ${PING} -c 1 192.168.1.1

# Pick up on matrix for ng_eiface after ifconfig lines
ES ${JEXEC} ${JL} ${NGCTL} rmhook tst2: ether
EF ${JEXEC} ${JL} ${NGCTL} rmhook tst3:ether link3
ES ${JEXEC} ${JL} ${NGCTL} shutdown tst4:
EF ${JEXEC} ${JL} ${NGCTL} shutdown tst5:ether
ES ${NGCTL} rmhook ${BR}: link6
EF ${NGCTL} rmhook ${BR}:link7 ether
EF ${NGCTL} shutdown ${BR}:link8

# Now pick up on matric for ng_iface after ifconfig
ES ${NGCTL} rmhook pp1h: inet
EF ${NGCTL} rmhook pp2h:inet inet
ES ${JEXEC} ${JL} ${NGCTL} rmhook pp3j: inet
EF ${JEXEC} ${JL} ${NGCTL} rmhook pp4j:inet inet
ES ${NGCTL} shutdown pp5h:
EF ${NGCTL} shutdown pp6h:inet
ES ${JEXEC} ${JL} ${NGCTL} shutdown pp7j:
EF ${JEXEC} ${JL} ${NGCTL} shutdown pp8j:inet

# There is a bunch of clutter left but jail should still be able to shutdown
# Which means all netgraph nodes that weren't already shutdown are back in
# the host.
ES jail -rv ${JL}

printf "\n\ntotals:\n\tSUCCESS=%d\n\tFAIL=%d\n" $S $F

# And now we are going to totally just attempt to destroy all ifaces even
# though some are already gone.
${NGCTL} shutdown ${BR}: > /dev/null 2>&1
for n in 0 1 2 3 4 5 6 7 8; do
        ${NGCTL} shutdown tst${n}: > /dev/null 2>&1
done
for n in 0 1 2 3 4 5 6 7 8; do
        ${NGCTL} shutdown pp${n}h: > /dev/null 2>&1
        ${NGCTL} shutdown pp${n}j: > /dev/null 2>&1
done

It should only have expected results and no failures.

Finally you can copy /usr/share/examples/jails/jei to /usr/local/bin and jail -cv test to see jei in action. You must create the bridge br0 as before. Although jei will connect to any valid netgraph(8) path not just a bridge. But whatever "it" is, to connect the new eiface it must already exist. jei uses a specific order and doesn't name your ng_eiface(4) or your interface for ifconfig(8) until after it has been moved to the jail. I exploit this to have a lan0 and jail0 in my jails and the rc.conf is the same for all of them.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

By design, moving an eiface ifnet from one vnet to another always implied that its netgraph node will remain attached in the parent vnet. This is not an omission or a mistake, but a well established mode of operation on which certain applications heavily depend on, and which this patch proposes to change, for reasons not clearly stated.

Having an ng_eiface node in a parent vnet and a ng_ether node in a child vnet attached to the same ifnet is perfectly legal and has it uses, why exactly do you propose to disable this?

Why exactly do you want to devirtualize V_ng_iface_unit?

What exactly do you want to accomplish by destroying the ifnet and its ng_eiface node on child vnet shutdown, instead of returning it to the parent vnet?

In general, the patch and the accompanying text is overly complex and should be broken in smaller parts.

glebius requested changes to this revision.Thu, Feb 27, 6:22 PM

I do not like the plan. The picture drawn shows that a netgraph node in one vnet is connected to a node in a different vnet. This is basically a violation of the idea of vnet. Virtualized stacks should communicate with each other via network protocols, not kernel pointers. The only legal exclusion is epair(4). You may create a new netgraph node for your purpose - a node that is present in two vnets, that would be a second legal exclusion to the virtualization rule.

P.S. I will once again reiterate that concept of if_vmove() and concept of "a parent vnet" is a source of problems and we should work towards getting rid of it rather than piling up more dependencies on the flawed concepts.

This revision now requires changes to proceed.Thu, Feb 27, 6:22 PM

I think I'll just accept my understanding is flawed now to save time and withdraw this. Thank you.

I do not like the plan. The picture drawn shows that a netgraph node in one vnet is connected to a node in a different vnet. This is basically a violation of the idea of vnet. Virtualized stacks should communicate with each other via network protocols, not kernel pointers. The only legal exclusion is epair(4).

Loaning a ng_eiface ifnet to a child vnet has been the cornerstone mechanism for creating inter-vnet communication via netgraph for 20+ years, long before epair(4) came to existence. Now you're here to tell us this suddenly became illegal - wonderful!

You may create a new netgraph node for your purpose - a node that is present in two vnets, that would be a second legal exclusion to the virtualization rule.

Creating a new "wormhole" nethgraph node with endpoints in two vnets could be doable, but it would introduce an unnecessary processing overhead compared to the already available and established mechanism of loaning a ng_eiface node to a child vnet via if_vmove() .

P.S. I will once again reiterate that concept of if_vmove() and concept of "a parent vnet" is a source of problems and we should work towards getting rid of it rather than piling up more dependencies on the flawed concepts.

Your obsession with getting rid of the "flawed" if_vmove() is noted, but for the sake of other people who may have a different view, and who have applications relying on this very concept for 20+ years, please do not take that route.

In D49158#1121374, @zec wrote:

Your obsession with getting rid of the "flawed" if_vmove() is noted, but for the sake of other people who may have a different view, and who have applications relying on this very concept for 20+ years, please do not take that route.

I know we don't agree on that. In my view it was 20 years of fixing bugs due to the original "moving" concept. And I believe some bugs remain. Once I have some code to demonstrate, I'd like to have a discussion with you, but today it doesn't make sense to flood this particular review with this discussion.

@dave_freedave.net What do you think: a netgraph node that is facing two vnets, as Marko nicely named it "a wormhole node", is it going to help your setup?

I vastly under appreciated that folks rely upon the ng_eiface not moving with the struct ifnet. Probably because I've been using them in jails for over a decade and only recently noticed myself.

I encountered that situation because I was exploring the ordering of ng_eiface creation, move to jail, and rename. The idea being I could have the same interface names in jails if I waited until after the move to rename them and therefore use the same config for dhcpcd. I was then surprised at what happened when an interface was renamed in a jail and then the jail was shut down. I then went off with the wrong assumption that it was missed, moving ng_eiface to the new vnet.

Anyway, breaking others existing config is reason enough for me close. A "wormhole" would work, but sometime I have to accept I'm chasing the wrong thing and switch directions, this feels like one of those times.