Page MenuHomeFreeBSD

Add qemufwcfg driver and FUSE filesystem.
Needs ReviewPublic

Authored by theraven on Jul 17 2024, 12:39 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 23, 6:41 PM
Unknown Object (File)
Thu, Jan 23, 6:35 PM
Unknown Object (File)
Thu, Jan 23, 4:16 PM
Unknown Object (File)
Wed, Jan 22, 5:20 PM
Unknown Object (File)
Tue, Jan 14, 9:03 PM
Unknown Object (File)
Fri, Jan 10, 2:50 PM
Unknown Object (File)
Fri, Jan 10, 2:49 PM
Unknown Object (File)
Fri, Jan 10, 2:49 PM
Subscribers

Details

Reviewers
markj
asomers
Summary

The QEMU firmware configuration device (also exposed by bhyve) is a
device for injecting data into virtual machines. It provides a simple
map from 16-bit numbers (selectors) to blobs. A small number of
selectors have well-defined meanings, including one that provides names
(which may contain slashes and be modelled as paths) for the dynamically
allocated set.

This commit includes two parts:

  • A NetBSD-compatible qemufwcfg driver.
  • A FUSE filesystem that talks to the driver.

Linux provides an in-kernel filesystem for translating qemu firmware
configuration blobs things into a directory tree of files. NetBSD
adopts a model that better respects the principle of least privilege and
provides a very simple in-kernel driver that handles the low-level
device I/O but delegates the filesystem parts to userspace.

This kernel driver is not a port of the NetBSD one but is directly
inspired by it and provides a compatible interface. The NetBSD FUSE
filesystem can run on top of it.

The NetBSD FUSE filesystem brings in a lot of complexity via libfuse,
which is not in the FreeBSD base system. libfuse is designed for
implementing high-performance multi-threaded FUSE filesystems. This is
overkill for a read-only filesystem intended for providing rarely-read
small data. Instead, this implements a very simple single-threaded
version with no dependencies outside of the base system. This runs in
Capsicum mode, with no ability to access anything other than stderr, the
(opened) fuse device and the qemufwcfg device.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 58673
Build 55561: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
theraven added inline comments.
sys/dev/qemufwcfg/qemufwcfg.c
256

I've made this an assert. If I read the man page correctly, all of the ways that it can fail depend on passing flags in mda_flags. We pass MAKEDEV_CHECKNAME, which enables some possible failures, but they should not actually be possible (or should happen all of the time).

Add a default BMAP. The kernel's FUSE seems to require it.

sys/dev/qemufwcfg/qemufwcfg.c
133

There was an effort to allow for oneliners, but prohibit them for multi-line comments. I'll have to go try to find what happened to it. I'd follow that here, in general, though the / clearly has a meaning to doxygen and shouldn't be eliminated.

157

Why not store all that junk in the softc rather than recompute it every time? seems like overkill.

174

same comment here. Just store these in softc, compute it once and then use the softc data members directly.

225

mulit-line comment.

226

This is out of place here. It should be centralized so people can find it in debuggers.

241

sc will be freed entirely after we return. We're single threaded there, so there's no reason to set this to NULL.

244

So if it's really 4 characters, why the indirection an obfuscation with the different const stuff?
Also, printing characters here are a bad idea. You should print the hex values since if they aren't ASCII, they are hard to dig out of the dmesg (especially if they are NULs).

335

blank line between decl and write_selecctor()

sys/modules/Makefile
334

This needs to be conditional... It can only exist on x86 and arm64...

sbin/mount_qemufwcfg/mount_qemufwcfg.cc
6

You can omit the rest of the boiler plate.

sys/dev/qemufwcfg/qemufwcfg.c
133

/// and /* are doxygen comments and, importantly, are picked up by LSP things so if you're editing in a modern editor (vim, vs code, and so on) they can pop up the docs. I generally prefer /* for multi-line comments but missed some where they stated as single-line comments and were then wrapped. Please flag any where I missed them.

157

I'm not sure I follow. We need to 'recompute' the index here because this is where we're setting the index.

174

I'm not sure what you see is being computed here. We have one offset for MMIO, one for device port things. We could store two fields, but you'll end up with basically the same number of instructions, more memory, and three things that have to be in sync when one is trivially computable from the other.

sc->io_type is invariant after it's set.

sys/dev/qemufwcfg/qemufwcfg.c
174

You have exactly one offset at any time. since sc->io_type is invariant after attach, you can precompute all the offsets when you set that type in attach. There's no need to check each time you go to use it.
So you'd have sc->data_offset and sc->write_offset that you'd use here and elsewhere.

207

You only ever set io_type here. Store the data and write offsets here too in new softc members. Then you can use them directly everywhere rather than having different ifs.

This looks like a cool project. I've only reviewed tinyfuse.hh so far; I'll come back for the rest later. BTW, if you want to see an example of a fuse file system written in C++ that doesn't use libfuse, you could check tests/sys/fs/fusefs/mockfs.cc . It basically wraps googlemock.

sbin/mount_qemufwcfg/tinyfuse.hh
137

Some people would complain about your comment style here: // vs /* . Fair point about the protocol, though; it does have a number of bugs.

172

If error is meant to be an errno, then you should invert it here. The FUSE protocol uses negative error numbers.

404

BTW, if you don't want to call mount_fusefs, it's possible to use nmount directly. For an example, see https://github.com/freebsd/freebsd-src/blob/a4469a0d19b64bf518c12e8c24c81ec513a45e7d/tests/sys/fs/fusefs/mockfs.cc#L521 .

465

FYI, the FUSE_OPEN request is optional. In my experience, it's often not necessary because it's somewhat redundant with FUSE_LOOKUP. If a fuse server doesn't want to implement it, it can set FUSE_NO_OPEN_SUPPORT in fuse_init_out.flags. In that case, this function should return ENOSYS. So I suggest that the base class return ENOSYS here.

465

FYI, FUSE_OPEN is optional. In my experience, it's not usually necessary because it's somewhat redundant with FUSE_LOOKUP. If a file system doesn't want to implement it, it can set FUSE_NO_OPEN_SUPPORT in fuse_init_out.flags and return ENOSYS here. So I suggest that the base class's implementation return ENOSYS.

475

Ditto here, except that the relevant flag is FUSE_NO_OPENDIR_SUPPORT.

490

Rather than allow anything, a better default is ENOSYS. The kernel will interpret that as "allow anything, and assume that all future accesses will succeed, so don't bother sending FUSE_ACCESS again".

532

FWIW, libfuse returns a default value of ENOSYS here and for every other request handler. There's nothing intelligent that the kernel can do with that, though.

560

The default should return ENOSYS. The kernel will interpret that as "success, and don't bother ever sending FUSE_FLUSH again".

574

This isn't valid. The kernel requires that out.attr.mode & S_IFMT always return the same value for every inode. Also, if out.attr.size differs from what's already cached, then the kernel will throw away any data cached for the file. I don't think it's possible to provide a correct default implementation for this request.

585

Is this a dupe of line 537?

618

The default implementation should return ENOSYS. The kernel will interpret that as "return the input block number, and don't bother to send FUSE_BMAP again".

This revision now requires changes to proceed.Jul 18 2024, 12:25 AM
theraven marked 8 inline comments as done.

Address asomers' review.

Same features, now with 5% less code!

Thanks @asomers, I now understand FUSE slightly more, and that let me delete around 5% of the userspace code.

sbin/mount_qemufwcfg/tinyfuse.hh
172

Aha, thank you! That explains why the kernel kept rejecting my error codes. I should have realised it used the Linux syscall convention. I've now made the types consistent and ensured that positive values are negated here (after doing Linux things for a bit, I have a habit of accidentally adding a - in front of any errno value I'm returning from a function, so it's nice to have those fixed automatically).

404

To be honest, I quite like the mount_fusefs bits, and it also avoids having to parse all of the default arguments (noatime and so on). One extra exec at mount time isn't too bad.

465

Does that mean open / opendir are there just to support filesystems that want to track some per-descriptor state? I've updated the code to set those flags if the defaults are not overridden (which should mean that this function is never called and exists just for type inference / documentation). This let me delete a load of code, thanks!

475

More code deleted, yay!

490

Thanks. I did that originally, but it turned out I had two bugs in the logic that produced error returns. This is now done.

574

Now that I've fixed the error returns, ENOSYS works fine here.

585

Yup, copy-and-paste error.

theraven added inline comments.
sbin/mount_qemufwcfg/mount_qemufwcfg.cc
6

This is taken from share/examples/etc/bsd-style-copyright, is there a newer template I should be using?

sys/dev/qemufwcfg/qemufwcfg.c
174

The thing I'm not understand is why, on a code path that does a VM exit that takes (conservatively) thousands of cycles, you would prefer to shave off a single conditional move and maybe one other instruction to materialise offsets and grow a structure by two fields.

244

So if it's really 4 characters, why the indirection an obfuscation with the different const stuff?

I don't know what this means.

Also, printing characters here are a bad idea. You should print the hex values since if they aren't ASCII, they are hard to dig out of the dmesg (especially if they are NULs).

On a valid platform, they will be the string "QEMU". Printing hex values makes this less clear. I would imagine that anything else that looks like this device will also provide a string. This is roughly the same interface as x86 CPUID, which we print as characters.

sys/modules/Makefile
334

The MMIO interface is generic. Linux enables this device on Arm PA RISC, PowerPC, SPARC, x86, and RISC-V. I think that's a superset of architectures that we support and there's no reason that it wouldn't be expected to work on others.

theraven marked an inline comment as done.

Fix some review comments, remove a stale comment.

asomers requested changes to this revision.Jul 18 2024, 6:36 PM

The changes to tinyfuse.hh look good. Now I've reviewed mount_qemu_fwcfg, too.

sbin/mount_qemufwcfg/mount_qemufwcfg.cc
29

After 91d35fb663e0276abc4963ae559e4759ea929842 , you can remove all of the WITHOUT_CAPSICUM parts.

263

Is this necessary? The kernel will also cache fusefs files' contents, subject to the vfs.fusefs.data_cache_mode sysctl.

380

1 second? Perhaps you forgot to uncomment the following line?

459–465

I don't understand this statement. Are you assuming that a dirent's ino field should be equal to the off field of the next dirent? That isn't true in general. There's no requirement for a directory's dirents to be sorted by ino. I think the only requirement surrounding off is that the dirents must be sorted in ascending order by off, and each dirent's off must be unique. A simple incrementing counter should suffice, for read-only file systems like this one. And I don't think there's any requirement to return -1 for the offset of the final entry.

536

This will only cache attributes for 1 second. You should set it to the max, too.

563

I don't understand this comment. Won't there be regular files too?

575

Is nlink == 1 valid for directories with subdirectories?

591

f_bsize is tricky; it means one thing in FreeBSD's statfs, but another in fuse.

sbin/mount_qemufwcfg/tinyfuse.hh
450

Cool.

465

Yes. For example, a server could theoretically allow read-only access to a file via one descriptor but read-write via another. But I don't think it's very useful. It's actually a pretty bad idea to rely on the descriptor, because sometimes the kernel doesn't know which file descriptor to associate with a FUSE request. For example, when using aio_write or when the writeback cache is in use.

500

BTW, the kernel will allow all accesses unless the daemon sets the "default_permissions" mount option. If that option is set, then the kernel will do the usual permission checking. You'll usually want to set "allow_other" at the same time. If "allow_other" is unset, then only the user running the daemon will be allowed to access anything.

This revision now requires changes to proceed.Jul 18 2024, 6:36 PM
sbin/mount_qemufwcfg/mount_qemufwcfg.cc
263

I keep changing my mind on this. On the one hand, the underlying device doesn’t support seeking and requires a VM exit to read 1-8 bytes (depending on the architecture), so when someone reads a 100 KiB file in a bunch of small reads, it adds a lot of overhead. On the other hand, I doubt anyone actually cares about the performance.

Then again, the total amount of data typically exposed here is < 1 MiB, so reading each file the first time it’s accessed is not adding much memory overhead and it simplifies the logic quite a bit to just read the entire file into memory once and service requests by returning a sub range into that buffer.

459–465

If I didn’t return -1, I saw infinite loops where the kernel kept issuing readdir requests at offset 0.

If I set the inode to the file’s inode, I got other failures.

If I set the offset to anything other than the offset of the next one as a concrete value, I also got failures.

I traced the NetBSD version with trace to figure out what it was doing and found the values that it used. Nothing else worked for me.

563

Copy and paste error. When I factored this out into a separate function, I left the comment from the copy in opendir.

575

It doesn’t seem to cause errors in the kernel. Possibly it should include the number of .. links, I’m not sure if anything actually cares?

sys/dev/qemufwcfg/qemufwcfg.c
241

Every secure coding guide I've ever read says (for good reasons) that it's a bad idea to leave dangling pointers, even if you know that they won't be read. Unless the free is in the same function, leaving a dangling pointer is just asking for someone to introduce a UAF bug in a subsequent change.

Provide an option to disable caching.

sys/dev/qemufwcfg/qemufwcfg.c
241

It's useless work and never or rarely done for error åpaths in the kernel. It's just as likely to lead to kernel panics as it is to uaf issues. It's very out of place. Both are bad since you also assume it is non null in many places, It's a jump ball which one you'd theoretically see,

244

Better to have useful data.. the error message can say expected vs actual to know what is expected. Having a hex value that happens to be QEMU is fine. There is no need to make that clear its ascii value.

For similar reasons, its a magic 4 byte value. Why not just use uint32_t and do simple compares. The strinh stuff is more error prone.

theraven marked 2 inline comments as done.

Address some more review comments.

sbin/mount_qemufwcfg/mount_qemufwcfg.cc
263

I've made it a compile-time option. It does improve performance a little bit, but not enough to be on by default.

380

The following line fails. I'd like to set this to the largest allowed timeout, but the kernel rejects the maximum value of this type. I guess this is an unsigned type that is interpreted as signed in the kernel somewhere? Dividing it by two works.

sbin/mount_qemufwcfg/mount_qemufwcfg.cc
263

"simplifies the logic quite a bit" sounds like a good enough reason to me.

380

What do you mean by "the kernel rejects the maximum value of this type"? The fusefs test suite contains a test that sets attr_valid to UINT64_MAX, and that test works.

459–465

Infinite loops probably happened because your readdir implementation kept returning the same dirent over and over. You should be returning something unique and sorted for off, and on entry you should skip over any entries whose offset is less than or equal to the offset provided by the kernel in readIn.

575

To be POSIX-compliant it should include the number of .. links. But I'm not aware of anything that cares, either.