Page MenuHomeFreeBSD

Implement getrandom(2) and getentropy(3)
ClosedPublic

Authored by cem on Feb 25 2018, 2:44 AM.
Tags
None
Referenced Files
F108441276: D14500.id40023.diff
Fri, Jan 24, 7:20 PM
Unknown Object (File)
Tue, Jan 21, 12:32 AM
Unknown Object (File)
Mon, Jan 20, 3:16 PM
Unknown Object (File)
Mon, Jan 20, 10:43 AM
Unknown Object (File)
Sat, Jan 18, 10:02 PM
Unknown Object (File)
Sat, Jan 18, 5:47 AM
Unknown Object (File)
Sat, Jan 11, 6:08 PM
Unknown Object (File)
Thu, Jan 9, 3:55 AM

Details

Summary

The general idea here is to provide userspace programs with well-defined
sources of entropy, in a fashion that doesn't require opening a new file
descriptor (ulimits) or accessing paths (/dev/urandom may be restricted
by chroot or capsicum).

getrandom(2) is the more general API, and comes from the Linux world.
Since our urandom and random devices are identical, the GRND_RANDOM flag
is ignored.

getentropy(3) is added as a compatibility shim for the OpenBSD API.

PR: 194204

Test Plan

Unit tests included.

LTP tests pass. Here is an LTP emulation layer sufficient to run these tests on FreeBSD:

$ cat lapi/getrandom.h
#include <sys/random.h>

$ cat lapi/syscalls.h
#include <sys/syscall.h>

$ cat tst_test.h
#include <sys/param.h>
#include <err.h>
#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

static long TEST_RETURN;
static int TEST_ERRNO;

static inline void
TEST(long a)
{
        TEST_RETURN = a;
        TEST_ERRNO = errno;
}

#define tst_syscall syscall
#define __NR_getrandom SYS_getrandom

#define TPASS 1
#define TFAIL 0
#define TPASSMASK 1
#define TTERRNO 2

static size_t tot_pass, tot_fail;

static inline void
tst_res(int flags, const char *fmt, ...)
{
        va_list ap;

        if ((flags & TPASSMASK) == TPASS) {
                printf("PASS ");
                tot_pass++;
        } else {
                printf("FAIL ");
                tot_fail++;
        }

        va_start(ap, fmt);
        vprintf(fmt, ap);
        va_end(ap);

        if (flags & TTERRNO)
                printf(": %d (%s)", errno, strerror(errno));

        puts("\n");
}

#define ARRAY_SIZE nitems

static inline void
SAFE_GETRLIMIT(int res, struct rlimit *out)
{
        int ret;

        ret = getrlimit(res, out);
        if (ret != 0)
                err(1, "getrlimit");
}

static inline void
SAFE_SETRLIMIT(int res, const struct rlimit *in)
{
        int ret;

        ret = setrlimit(res, in);
        if (ret != 0)
                err(1, "setrlimit");
}

struct tst_test {
        size_t tcnt;
        void (*test)(unsigned);
        void (*test_all)(void);
};

static struct tst_test test;

int
main(int argc __unused, char **argv __unused)
{
        size_t i;

        for (i = 0; i < test.tcnt; i++)
                test.test(i);
        if (test.test_all != NULL)
                test.test_all();

        printf("Pass: %zu/%zu\n", tot_pass, tot_pass + tot_fail);
        return (tot_fail != 0);
}

Under this emulation, all LTP getrandom(2) tests pass. (Compile with, e.g., gcc6 -Wall -Wextra -O2 -g -I. getrandom04.c.)

Truss:

$ LD_LIBRARY_PATH=$(cd lib/libsysdecode ; make -V .OBJDIR) $(cd usr.bin/truss ; make -V .OBJDIR)/truss.full $(cd tests/sys/kern ; make -V .OBJDIR)/sys_getrandom.full getrandom_randomness
...
getrandom("@\M^Pb\M-"\M^S\M-d\^Br\^A\^A1$"...,4096,0) = 4096 (0x1000)
mmap(0x0,266240,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34367328256 (0x80073d000)
getrandom("\aJql\M-Ej\M-S\M^]\M-+\f^\M-e\^Y"...,4096,1) = 4096 (0x1000)
getrandom("\M-_\M^\_\M-K\M-I\M^I.l\M-S\M^]"...,4096,2) = 4096 (0x1000)
getrandom("G\M-1Y\M^N\M^_3Y[4\M^I\M^Z\^S"...,4096,3) = 4096 (0x1000)
passed
writev(1,[{"passed",6},{"\n",1}],2)              = 7 (0x7)
...

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I really really don't like the Linux getrandom(2) API in general.

This implementation of getrandom(2) allowed way too big read per call, allowing it in general could, although not necessarily allowing a (known and effectively useful) attack at this moment, expose continuous output of kernel PRNG to one single caller. The call itself is computational expensive, and because it was done in the kernel, it could easily allow a DoS attack the way it is currently constructed.

Note that it seems, at least according to the manual page of Linux getrandom() they allowed a quite big read too, and implementing getrandom() in kernel means it's hard to mitigate this. If we must implement the same semantics, we should probably further penalize the callers by adding voluntary yields to other threads, but for compatibility I don't see any good reason not to implement it with arc4random().

The OpenBSD getentropy() API has restricted allowable maximum to 256, thereby limited the userland's ability to obtain long, continuous output sequence from the kernel and it's much narrower usage would discourage misuse too.

This revision now requires changes to proceed.Feb 25 2018, 10:07 AM

What specific changes do you want made?

This implementation of getrandom(2) allowed way too big read per call, allowing it in general could, although not necessarily allowing a (known and effectively useful) attack at this moment, expose continuous output of kernel PRNG to one single caller.

The same criticism can be leveled at /dev/urandom. This uses the exact same kernel interface (READ_RANDOM_UIO).

(Additionally, this is an orthogonal and theoretical concern — if our PRNG is vulnerable to a single caller reading continuous output, even getentropy()'s 256 bytes can be repeatedly polled by a single attacker. Additionally, it probably means SHA256 is broken.)

The call itself is computational expensive, and because it was done in the kernel, it could easily allow a DoS attack the way it is currently constructed.

How? This uses the exact same interface as /dev/urandom. If random(4) can easily allow a DoS already, that problem seems orthogonal to this change.

Note that it seems, at least according to the manual page of Linux getrandom() they allowed a quite big read too, and implementing getrandom() in kernel means it's hard to mitigate this. If we must implement the same semantics, we should probably further penalize the callers by adding voluntary yields to other threads,

There are already voluntary yields every PAGE_SIZE bytes in READ_RANDOM_UIO via uiomove().

but for compatibility I don't see any good reason not to implement it with arc4random().

arc4random() doesn't provide the blocking/non-blocking behavior essential to the API. If arc4random()'s chacha20 + reseed is superior to raw Fortuna, it could be moved inside READ_RANDOM_UIO and used by /dev/urandom as well.

cem edited the test plan for this revision. (Show Details)
In D14500#304139, @cem wrote:

What specific changes do you want made?

My preference would be only implementing a blocking getentropy() in kernel, but am open with other proposals as long as there is compelling reasons.

This implementation of getrandom(2) allowed way too big read per call, allowing it in general could, although not necessarily allowing a (known and effectively useful) attack at this moment, expose continuous output of kernel PRNG to one single caller.

The same criticism can be leveled at /dev/urandom. This uses the exact same kernel interface (READ_RANDOM_UIO).

Yes, but my understanding is this new API is intended for all applications (while /dev/urandom access may be blocked with permission), e.g. an application in capability mode should have access to it.

(Additionally, this is an orthogonal and theoretical concern — if our PRNG is vulnerable to a single caller reading continuous output, even getentropy()'s 256 bytes can be repeatedly polled by a single attacker. Additionally, it probably means SHA256 is broken.)

Agreed.

The call itself is computational expensive, and because it was done in the kernel, it could easily allow a DoS attack the way it is currently constructed.

How? This uses the exact same interface as /dev/urandom. If random(4) can easily allow a DoS already, that problem seems orthogonal to this change.

This will not go through any permission checks.

Note that it seems, at least according to the manual page of Linux getrandom() they allowed a quite big read too, and implementing getrandom() in kernel means it's hard to mitigate this. If we must implement the same semantics, we should probably further penalize the callers by adding voluntary yields to other threads,

There are already voluntary yields every PAGE_SIZE bytes in READ_RANDOM_UIO via uiomove().

You are right here, but still, why must this be done inside the kernel?

but for compatibility I don't see any good reason not to implement it with arc4random().

arc4random() doesn't provide the blocking/non-blocking behavior essential to the API. If arc4random()'s chacha20 + reseed is superior to raw Fortuna, it could be moved inside READ_RANDOM_UIO and used by /dev/urandom as well.

Would you mind elaborating the scenario where non-blocking behavior is desirable on FreeBSD, given that we have RANDOM_ATTACH entropy source and additional entropy can be preloaded by loader?

In D14500#304139, @cem wrote:

The same criticism can be leveled at /dev/urandom. This uses the exact same kernel interface (READ_RANDOM_UIO).

Yes, but my understanding is this new API is intended for all applications (while /dev/urandom access may be blocked with permission), e.g. an application in capability mode should have access to it.

I guess so, but the concern is still a completely hypothetical attack against SHA256 and Fortuna. And maybe a local DoS — I'm still unclear on that.

How? This uses the exact same interface as /dev/urandom. If random(4) can easily allow a DoS already, that problem seems orthogonal to this change.

This will not go through any permission checks.

I suppose. The only checks on random(4) today are the VFS ones, though, and probably those only exist because of the constraints of the unix filesystem — every file must have permissions.

I don't see it being especially useful to restrict access to Fortuna output on a UID basis. Do you have a use in mind?

There are already voluntary yields every PAGE_SIZE bytes in READ_RANDOM_UIO via uiomove().

You are right here, but still, why must this be done inside the kernel?

Why must it not be done inside the kernel? The only distinction is that a smaller per-syscall threshold imposes a slightly higher CPU cost on a buggy, malicious or compromised userspace application requesting lots of bytes. It isn't an effective or principled means of resource-constraining the userspace application.

I suspect but have not empirically demonstrated that a userspace application repeatedly polling the getentropy() interface for 256 bytes at a time uses approximately the same time in the kernel, if not more, as a userspace application making large read requests.

Additionally, userspace CSPRNGs have a strong tradition of being broken in subtle ways.

arc4random() doesn't provide the blocking/non-blocking behavior essential to the API. If arc4random()'s chacha20 + reseed is superior to raw Fortuna, it could be moved inside READ_RANDOM_UIO and used by /dev/urandom as well.

Would you mind elaborating the scenario where non-blocking behavior is desirable on FreeBSD,

I don't have a specific scenario in mind. It seems desirable to conform to the intent of userspace applications. If an application has decided that a non-blocking request is desirable, I don't think blocking is the right thing to do — this is the kind of behavior that drives application developers crazy. I suppose we could document that GRND_NONBLOCK is not allowed and return EINVAL if it is used?

On the other hand, the random(4) manual page specifically calls this out in the second paragraph:

The generator will start in an unseeded state, and will block reads until
it is seeded for the first time.  This may cause trouble at system boot
when keys and the like are generated from random so steps should be taken
to ensure a seeding as soon as possible.

The prominent placement makes it seem like unseeded state could be a real problem.

given that we have RANDOM_ATTACH entropy source and additional entropy can be preloaded by loader?

I doubt RANDOM_ATTACH provides very much entropy at all, especially on embedded hardware. Is it actually sufficient to seed random(4) alone?

rc.d/random and /var/db/entropy-file + /boot/entropy may provide sufficient entropy to seed random(4) most of the time. There is a chance the on-disk entropy doesn't exist — either in the installer, or if there is filesystem corruption, or an administrator simply deletes the files, or any other possible cause. /boot may be readonly and we may be unable to write out any entropy there.

There may be various reasons an appliance or embedded system may want to avoid writing some entropy to disk and thus may struggle to re-seed the random device soon after reboot.

In D14500#304788, @cem wrote:
In D14500#304139, @cem wrote:

The same criticism can be leveled at /dev/urandom. This uses the exact same kernel interface (READ_RANDOM_UIO).

Yes, but my understanding is this new API is intended for all applications (while /dev/urandom access may be blocked with permission), e.g. an application in capability mode should have access to it.

I guess so, but the concern is still a completely hypothetical attack against SHA256 and Fortuna. And maybe a local DoS — I'm still unclear on that.

Plus it has loosen the permission check from something that can be restricted with file system permissions or devfs rules to unrestricted.

How? This uses the exact same interface as /dev/urandom. If random(4) can easily allow a DoS already, that problem seems orthogonal to this change.

This will not go through any permission checks.

I suppose. The only checks on random(4) today are the VFS ones, though, and probably those only exist because of the constraints of the unix filesystem — every file must have permissions.

I don't see it being especially useful to restrict access to Fortuna output on a UID basis. Do you have a use in mind?

No but you can restrict this with file system permissions or devfs rules (e.g. in a jail) now, with the proposed change, it's gone, and I'm asking for a good reason for that change.

There are already voluntary yields every PAGE_SIZE bytes in READ_RANDOM_UIO via uiomove().

You are right here, but still, why must this be done inside the kernel?

Why must it not be done inside the kernel?

See my comments above.

The only distinction is that a smaller per-syscall threshold imposes a slightly higher CPU cost on a buggy, malicious or compromised userspace application requesting lots of bytes. It isn't an effective or principled means of resource-constraining the userspace application.

It's easier to discipline CPU usage when the code is in userland than in kernel, and there is no good reason so far that you have raised to support doing it in kernel.

I suspect but have not empirically demonstrated that a userspace application repeatedly polling the getentropy() interface for 256 bytes at a time uses approximately the same time in the kernel, if not more, as a userspace application making large read requests.

Additionally, userspace CSPRNGs have a strong tradition of being broken in subtle ways.

But how can this be fixed by doing it in kernel?

arc4random() doesn't provide the blocking/non-blocking behavior essential to the API. If arc4random()'s chacha20 + reseed is superior to raw Fortuna, it could be moved inside READ_RANDOM_UIO and used by /dev/urandom as well.

Would you mind elaborating the scenario where non-blocking behavior is desirable on FreeBSD,

I don't have a specific scenario in mind. It seems desirable to conform to the intent of userspace applications.

This seems to be a non-goal.

If an application has decided that a non-blocking request is desirable, I don't think blocking is the right thing to do

Well, two things:

  1. There are enough places in FreeBSD that requires a unblocked entropy device in the first place to work. You should expect the entropy device be unblocked quite early, and more importantly:
  1. The only sane outcome of an application that needs cryptographically secure entropy data, requested non-blocking behavior and when the kernel do tell them to try again, is to either abort, or retry (which is essentially the blocking behavior), because userland application can never reliably generate any meaningful entropy themselves.

Allowing non-blocking behavior means the application developer is free to do creative and harmful things that they have no idea what it could be, and we should not give them the chance to make this kind of mistakes.

— this is the kind of behavior that drives application developers crazy. I suppose we could document that GRND_NONBLOCK is not allowed and return EINVAL if it is used?

Or, that API should not exist at all.

If the goal is to provide compatibility, then it's better to emulate the same API but ignore the GRND_NONBLOCK request and treat it as a blocking request.

On the other hand, the random(4) manual page specifically calls this out in the second paragraph:

The generator will start in an unseeded state, and will block reads until
it is seeded for the first time.  This may cause trouble at system boot
when keys and the like are generated from random so steps should be taken
to ensure a seeding as soon as possible.

The prominent placement makes it seem like unseeded state could be a real problem.

It *can* be unseeded, but in practice it *should never* be, and when it happens (for instance on a misconfigured system), it's very visible.

given that we have RANDOM_ATTACH entropy source and additional entropy can be preloaded by loader?

I doubt RANDOM_ATTACH provides very much entropy at all, especially on embedded hardware. Is it actually sufficient to seed random(4) alone?
rc.d/random and /var/db/entropy-file + /boot/entropy may provide sufficient entropy to seed random(4) most of the time. There is a chance the on-disk entropy doesn't exist — either in the installer, or if there is filesystem corruption, or an administrator simply deletes the files, or any other possible cause. /boot may be readonly and we may be unable to write out any entropy there.

There may be various reasons an appliance or embedded system may want to avoid writing some entropy to disk and thus may struggle to re-seed the random device soon after reboot.

I can't assert about RANDOM_ATTACH would work 100% on embedded hardware that I haven't seen and being able to seed random(4) alone, but since we are limited by the key size, it's really not that many, and the embedded systems that I ever own that runs FreeBSD would have unblocked way before mounting /, and if they don't, existing applications would break anyway.

Applications demanding for cryptographically secure entropy should be blocked until the entropy device until the entropy device is unblocked. You can not both choose being secure and take the liberty of not properly feeding the entropy source properly at the same time.

Please do not remove O3: Kernel Random Numbers Generator from reviewers, the proposed change have material impact to the kernel random number generator's behavior and I don't think it's Okay to bypass it.

I think mapping NONBLOCK behavior to blocking transparently would be a pretty big POLA violation. If we don't want to support non-blocking operation I would much rather that we fail the request with EINVAL. That said, if existing 3rd party applications are using getrandom() with the non-blocking flag now, then the end result will be that developers decide FreeBSD is broken and just don't use the API at all on FreeBSD via #ifdef's or the like. We have limited control on what we can force application developers to do. I'm also not sure how many administrators choose to restrict access to /dev/random and /dev/urandom in practice. In our default devfs rulesets used for jails we treat /dev/random as being as fundamentally needed as /dev/null and /dev/zero. We could perhaps make access to getrandom a jail permission though such that an admin could create jails which do not permit access to the system call?

Meta comment: This code review has turned into a culture war over whether the feature should exist at all. I don't think that's the function of code review. If you want to argue about this stuff all day, please take it to arch@. Ideally someone other than me actually takes a look at the code changes at some point and ACKs or points out problems.

I reckon it is perfectly valid to review the code as-is, but request pre-commit approval from core or consensus of arch. Please, uh, focus on the code. Thank you.

Plus it has loosen the permission check from something that can be restricted with file system permissions or devfs rules to unrestricted.

I think the restriction here is accidental. You're looking at urandom as if it a potential attack vector exposed to untrusted applications. And in a sense, it is. In another sense, though, it is userland crypto's only hope of providing a secure implementation. That is why it is so important it be available to all consumers. The syscall was not created on a whim.

LibreSSL has been written to use /dev/urandom, but also to have a fallback if there is an exhaustion of file descriptors (which an attacker might try to arrange) or there is some other reason that the library can't open the file.

https://lwn.net/Articles/606141/

There are problems with using fd-based random sources from library code:

The initial patch included a set of emulation functions so that getrandom() calls would always work; they would read the data from /dev/random or /dev/urandom as appropriate. Doing so involved keeping open file descriptors to those devices (lest later calls fail if the application does a chroot()). But using file descriptors in libraries is always fraught with perils; applications may have their own ideas of which descriptors are available, or may simply run a loop closing all descriptors. So the code took pains to use high-numbered descriptors that applications presumably don't care about, and it used fstat() to ensure that the application had not closed and reopened its descriptors between calls.

https://lwn.net/Articles/711013/

Anyway, FreeBSD is not Linux nor is it OpenBSD. And a serious discussion of potential security impact is warranted. But I don't think the ability to restrict access to the devfs node is an intentional feature we are losing — it's just a historical accident.

No but you can restrict this with file system permissions or devfs rules (e.g. in a jail) now, with the proposed change, it's gone, and I'm asking for a good reason for that change.

What is the justification for needing to run a jail without access to a good random source?

John suggested a compromise position where jails can restrict use of this syscall. I'm not sure I buy that it's useful, but would that be an ok compromise for you?

Why must it not be done inside the kernel?

See my comments above.

Sorry, maybe I'm dumb. But if they answered my question, I wouldn't be asking it. Can you clarify?

The only distinction is that a smaller per-syscall threshold imposes a slightly higher CPU cost on a buggy, malicious or compromised userspace application requesting lots of bytes. It isn't an effective or principled means of resource-constraining the userspace application.

It's easier to discipline CPU usage when the code is in userland than in kernel, and there is no good reason so far that you have raised to support doing it in kernel.

Consider userspace program consisting of while(1) { getentropy(buf, 256); }. The CPU use is all in kernel — either in READ_RANDOM_UIO or syscall entry/leave code.

I suspect but have not empirically demonstrated that a userspace application repeatedly polling the getentropy() interface for 256 bytes at a time uses approximately the same time in the kernel, if not more, as a userspace application making large read requests.

Additionally, userspace CSPRNGs have a strong tradition of being broken in subtle ways.

But how can this be fixed by doing it in kernel?

Doing it in kernel (as is random(4)) has the advantage of being the smaller change and least likelihood to introduce a regression.

I don't have a specific scenario in mind. It seems desirable to conform to the intent of userspace applications.

This seems to be a non-goal.

I don't follow. If we don't care about userspace programs, why do we provide a syscall interface at all? Why implement kevent() or O_NONBLOCK and instead just do all I/O blocking, since it is easier?

The kernel exists to serve userspace applications. A secure kernel that doesn't do anything is worthless.

If an application has decided that a non-blocking request is desirable, I don't think blocking is the right thing to do

  1. There are enough places in FreeBSD that requires a unblocked entropy device in the first place to work. You should expect the entropy device be unblocked quite early, and more importantly:

Well, you hope. But it is not guaranteed.

  1. The only sane outcome of an application that needs cryptographically secure entropy data, requested non-blocking behavior and when the kernel do tell them to try again, is to either abort, or retry (which is essentially the blocking behavior), because userland application can never reliably generate any meaningful entropy themselves.

I'm not sure I buy that these are the only options.

Allowing non-blocking behavior means the application developer is free to do creative and harmful things that they have no idea what it could be, and we should not give them the chance to make this kind of mistakes.

Ok — what about the compromise of EINVAL on NONBLOCK requests?

— this is the kind of behavior that drives application developers crazy. I suppose we could document that GRND_NONBLOCK is not allowed and return EINVAL if it is used?

Or, that API should not exist at all.

Please at least try to work with me here. A flat "no" without justification isn't persuasive.

The API exists in the real world; the question is whether we make application developers and porters emulate it on FreeBSD repeatedly, in potentially broken ways, or whether we provide native support. I am ok ignoring your NACK if this is your only feedback. I'm trying to find common ground and compromise, though.

If the goal is to provide compatibility, then it's better to emulate the same API but ignore the GRND_NONBLOCK request and treat it as a blocking request.

Strongly disagree. I'd rather signal the calling program with ABRT or TERM or whatever than silently block NONBLOCK requests.

It *can* be unseeded, but in practice it *should never* be, and when it happens (for instance on a misconfigured system), it's very visible.

In other worse, this API *can* be useful? Misconfigured systems are real systems we cannot just ignore.

Here's an example (on Linux) of early boot blocking causing problems: https://lwn.net/Articles/693189/

I can't assert about RANDOM_ATTACH would work 100% on embedded hardware that I haven't seen and being able to seed random(4) alone, but since we are limited by the key size, it's really not that many, and the embedded systems that I ever own that runs FreeBSD would have unblocked way before mounting /, and if they don't, existing applications would break anyway.

Applications demanding for cryptographically secure entropy should be blocked until the entropy device until the entropy device is unblocked.

They get that behavior with the ordinary flags=0 invocation.

You can not both choose being secure and take the liberty of not properly feeding the entropy source properly at the same time.

Non-block mode does not return insecure bytes; it just returns an error code instead of blocking.

Please do not remove O3: Kernel Random Numbers Generator from reviewers, the proposed change have material impact to the kernel random number generator's behavior and I don't think it's Okay to bypass it.

I don't see how this change impacts sys/dev/random number generation in any way. Please clarify.

In D14500#305728, @jhb wrote:

I think mapping NONBLOCK behavior to blocking transparently would be a pretty big POLA violation. If we don't want to support non-blocking operation I would much rather that we fail the request with EINVAL.

+1

That said, if existing 3rd party applications are using getrandom() with the non-blocking flag now, then the end result will be that developers decide FreeBSD is broken and just don't use the API at all on FreeBSD via #ifdef's or the like.

A similar result will happen if we block when nonblocking was requested, except it will take them longer to debug the problem and they will be mad at us for wasting their time. I'd rather they find out sooner than later. Ideally, they even file a bug and ask for the behavior to be added.

We have limited control on what we can force application developers to do. I'm also not sure how many administrators choose to restrict access to /dev/random and /dev/urandom in practice. In our default devfs rulesets used for jails we treat /dev/random as being as fundamentally needed as /dev/null and /dev/zero.

+1

We could perhaps make access to getrandom a jail permission though such that an admin could create jails which do not permit access to the system call?

That is a compromise I can live with. Of course, there's no point implementing this if delphij will not concede any ground.

And to clarify, I'd much prefer to have code review first, but lacking any compelling reason not to, I plan to go ahead and commit this in the next few days.

Adding cperciva and jmg who may be able to help provide some opinions about how they think this should go.

sys/kern/sys_getrandom.c
101 ↗(On Diff #39698)

The freebsd32 one should probably be in sys/compat/freebsd32/freebsd32_misc.c. For system calls we put the freebsd32 syscalls in sys/compat/freebsd32 that are standard. (aio used to be optional which is why they are still in vfs_aio.c rather than under sys/compat/freebsd32). That said, I don't think you actually need the freebsd32_getrandom() at all but can just reuse sys_getrandom() for freebsd32 as is just as we do for read() and write().

Not looking at the code, but on the design front:

  1. We absolutely need to have a system call for getting entropy from the kernel. I think it should be unconditionally available, because there's going to be someone who doesn't check their return codes and just assumes that their buffer got filled with random bits.
  1. It should fill an arbitrarily large buffer; performance should not be an issue (with per-CPU entropy output, locking is a non-issue, and we should run at almost memcpy speeds), and allowing the user to read GB of entropy at a time is perfectly safe with modern ciphers. Also, *someone* writing code on Linux is going to assume that they can fill large buffers, and not supporting it would make things break very badly.
  1. Trying to guess if we have enough entropy to safely unblock the entropy output is never going to work. It is mathematically impossible to figure out how much entropy we have.

Adding cperciva and jmg who may be able to help provide some opinions about how they think this should go.

Thanks! Is jmg still active? I haven't heard from him in a couple years.

  1. We absolutely need to have a system call for getting entropy from the kernel. I think it should be unconditionally available, because there's going to be someone who doesn't check their return codes and just assumes that their buffer got filled with random bits.

Both APIs are unconditionally available, as-written.

Not checking return codes is a harder problem to solve. How do you propose to handle the cases where these APIs would produce an error code today? Signal the process (with a default-terminate signal), like kern.trap_enotcap=1?

I think it's an interesting suggestion. It deviates pretty far from our conventional approach to system calls, including read(2) of the random device. But, if there is consensus I am willing.

  1. It should fill an arbitrarily large buffer;

Agree.

performance should not be an issue (with per-CPU entropy output, locking is a non-issue, and we should run at almost memcpy speeds),

Performance goals are a totally orthogonal concern to the API. We can worry about that separately. Current random(4) device performance is totally adequate (hundreds of MB/s).

and allowing the user to read GB of entropy at a time is perfectly safe with modern ciphers.

Agreed.

Also, *someone* writing code on Linux is going to assume that they can fill large buffers, and not supporting it would make things break very badly.

Yes. There's no point providing a compatible API if it is not compatible.

  1. Trying to guess if we have enough entropy to safely unblock the entropy output is never going to work. It is mathematically impossible to figure out how much entropy we have.

I'm not sure I follow. Our random(4) Fortuna already does track when it has sufficient input to consider itself seeded, and this revision does not change that in any way. The method is:

READ_RANDOM_UIO() invokes the ra_pre_read() method of the underlying algorithm (e.g., Fortuna) repeatedly until that algorithm returns true via the ra_seeded() method. (READ_RANDOM() is similar, but does not block until the algorithm is seeded -- instead returning a zero byte result.)

random_fortuna_pre_read() checks fs_pool[0].fsp_length must be >= the effective min pool size, which is default DEFPOOLSIZE = 64 bytes. If this is the case, it may invoke random_fortuna_reseed_internal(). random_fortuna_reseed_internal() unblocks the random(4) device, as well as triggering ra_seeded() to return true.

fs_pool[0].fsp_length begins as zero at boot. It is modified in the Fortuna ra_event_processor method, random_fortuna_process_event(), if the event's he_destination (mod NPOOLS = 32) is zero. In that case, the event's he_size is added to fsp_length .

The ra_event_processor method is invoked from a couple different avenues. One possible source of entropy is random(4) device writes, via randomdev_write() -> randomdev_accumulate(). This source is restricted to PAGE_SIZE input bytes every 100 milliseconds. After hashing, the output entropy (2 32-bit words, or 8 bytes) is fed to he_destination of zero (as well as 1-7). So: a single thread write(/dev/random) takes ~700 milliseconds to seed our Fortuna.

The other source of entropy is random_harvestq_fast_process_event(). This is invoked from a couple places.

random_kthread walks through a global ring of collected entropy events and passes them to ra_event_processor via random_harvestq_fast_process_event. At boot, this ring is empty.

random_harvestq_prime looks for a module name "boot_entropy_cache" and divvies up the contents 8 bytes at a time to incrementing he_destinations. It also subtracts 256 bytes from the front arbitrarily. For Fortuna's 32 pools, this means at least 256 + 1744 = 2000 bytes from this source are sufficient to unblock the random device.

Finally, random_harvest_direct will send 8 bytes of entropy to the next he_destination round-robin. It is invoked from random_kthread when any harvest_context.hc_entropy_fast_accumulator.buf[] entry is non-zero. It is also invoked from random_sources_feed(), which invokes rs_read() methods on any registered random sources, up to some "read rate" + 1 multiple of the poolcount. (The read rate will be zero until random(4) device is successfully unblocked, so for our purposes, rs_read() is invoked only 32 times per random_sources_feed().)

random_sources_feed() is invoked from random_kthread().

Some notes about random_kthread(): It starts at SI_SUB_KICK_SCHEDULER, and it blocks for 100ms every iteration, regardless of whether the underlying algorithm is seeded yet or not.

Anyway: none of this is affected by this revision. Our random device already blocks until it is seeded, and I do not suggest changing that behavior.

sys/kern/sys_getrandom.c
101 ↗(On Diff #39698)

Ok, it can be moved.

That said, I don't think you actually need the freebsd32_getrandom() at all but can just reuse sys_getrandom() for freebsd32 as is just as we do for read() and write().

Hm, and that works fine with userspace's 32-bit pointers? I guess so.

sys/kern/sys_getrandom.c
101 ↗(On Diff #39698)

Yes, what happens is that the freebsd32 layer will actually build a UAP block from the 64-bit registers zeroing the upper 32-bits. Only when that UAP isn't compatible with the native system call is a freebsd32 wrapper needed. (Usually things like merging the two halves of an off_t argument or having to translate structures pointed to by one of the arguments.)

cem planned changes to this revision.Mar 6 2018, 6:44 PM
cem added inline comments.
sys/kern/sys_getrandom.c
101 ↗(On Diff #39698)

Perfect, thanks. I'll update the revision.

cem marked 3 inline comments as done.

Remove duplicate freebsd32_getrandom and just re-use sys_getrandom -- jhb's
suggestion.

Please take a look at comments inline.

I'm still not in favor of the getrandom() API but I will not object it either, as long as a cryptographer signs off it.

lib/libc/gen/Makefile.inc
60 ↗(On Diff #40006)

The symbol in this .c should be added to Symbol.map.

lib/libc/gen/getentropy.c
53 ↗(On Diff #40006)

Not sure if I have followed, but it seems that this would only happen when entropy device is uninitialized and the tsleep returned EINTR. If that's the case, why should we continue the loop?

Note that ENOSYS should be handled by fallback with kern.arandom or an explicit abort() should be done.

lib/libc/sys/getrandom.2
71 ↗(On Diff #39698)

How was this guaranteed? (And where does this 256 come from?)

lib/libc/tests/gen/Makefile
66 ↗(On Diff #39698)

I don't think you really need to explicitly use LIBADD=c here?

sys/kern/sys_getrandom.c
79 ↗(On Diff #40006)

Please change "flags & GRND_NONBLOCK" to "((flags & GRND_NONBLOCK) == GRND_NONBLOCK)".

tests/sys/kern/Makefile
27 ↗(On Diff #40006)

Similar to the userland test case -- I don't think 'c' is needed here.

cem planned changes to this revision.Mar 6 2018, 11:16 PM
cem marked 2 inline comments as done.

Please take a look at comments inline.

I will address the feedback, thanks!

lib/libc/gen/getentropy.c
53 ↗(On Diff #40006)

Not sure if I have followed, but it seems that this would only happen when entropy device is uninitialized and the tsleep returned EINTR. If that's the case, why should we continue the loop?

getentropy(2) API on OpenBSD explicitly and intentionally does not return EINTR. Do you have a suggestion for how to mimic that better?

Note that ENOSYS should be handled by fallback with kern.arandom or an explicit abort() should be done.

Yes, nice catch. Thanks.

lib/libc/sys/getrandom.2
71 ↗(On Diff #39698)

It is guaranteed by the behavior of READ_RANDOM_UIO.

256 originated from OpenBSD getentropy(), but the Linux folks decided to adopt it as well in getrandom(). In practice, we actually provide up to PAGE_SIZE random bytes without interruption, but I don't think it's a good idea to make a stronger guarantee than the original Linux API.

lib/libc/tests/gen/Makefile
66 ↗(On Diff #39698)

It is needed — if it were not, I wouldn't have added it :-).

Without the explicit LIBADD, the test code is linked against the system libc instead of the OBJDIR version. Ditto for the other test below.

sys/kern/sys_getrandom.c
79 ↗(On Diff #40006)

jhb@ has argued in the past for allowing this form of implicit boolean conversion (bitmask test).

The == GRND_NONBLOCK is excessively verbose. I'm willing to add != 0 to explicitly make the conversion to boolean explicit.

cem marked 6 inline comments as done.

Incorporate delphij's suggestions:

  • Add getentropy to Symbol.map
  • Explicit abort() in getentropy(3) in the case that new libc is installed on old kernel
  • Explicitly convert GRND_NONBLOCK check to boolean

Add getrandom to truss(1).

lib/libc/sys/getrandom.2
71 ↗(On Diff #39698)

I think (I could be wrong here, @jhb please correct me) the only guaranteed behavior is "will not be interrupted by signals" but not the number of bytes being read (in other words, if you ask for e.g. 1MB worth of data, the kernel will keep reading up to PAGE_SIZE from the current p_random_alg_context each time, and uiomove() it, but at the time no signal is being checked?

By the way, D5788 has set devfs_iosize_max_clamp to 1 which limits DEVFS_IOSIZE_MAX to INT_MAX (2GiB) by default, and enforced INT_MAX for 32-bit applications. There is no similar restriction in kern_getrandom, which may or may not be a problem (I'd defer that to @cperciva ) but you might want to document the difference in comments in kern_getrandom at least, if this is intentional.

My reading of Linux manual page is that they do allow arbitrary size but the process can be interrupted by signal when the request is large, but I haven't read their code to find out.

lib/libc/tests/gen/Makefile
66 ↗(On Diff #39698)

(No action requested) Hmm good to know. If the test suite for C library is using system C library and not the code being tested itself, should that be considered as a build system bug?

sys/kern/sys_getrandom.c
94 ↗(On Diff #40023)

Note that the default devfs limit of size is DEVFS_IOSIZE_MAX (by default, INT_MAX or 2GiB) while buflen would accept up to SSIZE_MAX. If I understood the code correctly, the current code would ignore signal once the read started.

79 ↗(On Diff #40006)

[No action requested] Ok this is better. An alternative would be to use "!!(flags & GRND_NONBLOCK)' by the way, but the difference is less important.

I was going to test the behavior re: interruption on Linux with the following test program, but I do not have enough free RAM on my Linux system because it is also running Chrome 😂 😂 😂 . If anyone else has 2GB of RAM spare on a Linux system, the following test might be interesting (try pressing Ctrl+C after starting it):

#include <sys/random.h>
#include <err.h>
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
        void *buf;
        ssize_t ret;

        buf = malloc(INT_MAX-1);
        if (buf == NULL)
                err(1, "malloc");

        ret = getrandom(buf, INT_MAX, 0);
        printf("ret: %zd\n", ret);
        if (ret < 0)
                err(1, "getrandom");
        return 0;
}
lib/libc/sys/getrandom.2
71 ↗(On Diff #39698)

I don't understand the question around guarantees. Can you clarify what the question is?

By the way, D5788 has set devfs_iosize_max_clamp to 1 which limits DEVFS_IOSIZE_MAX to INT_MAX (2GiB) by default, and enforced INT_MAX for 32-bit applications. There is no similar restriction in kern_getrandom, which may or may not be a problem

There is a similar restriction present in kern_getrandom. It is this code:

61         if (buflen > IOSIZE_MAX)
62                 return (EINVAL);

I believe I borrowed it from sys_read().

IOSIZE_MAX is similar to DEVFS_IOSIZE_MAX except that the corresponding clamp sysctl is false by default. The getrandom(2) API is not part of devfs so I don't know how much it makes sense to use the DEVFS_ macro, but I do not feel strongly about it and would be happy to make that change if you (or Colin) prefers.

My reading of Linux manual page is that they do allow arbitrary size but the process can be interrupted by signal when the request is large, but I haven't read their code to find out.

Yep. Arbitrary size is constrained somewhat by the return type of the API (ssize_t). I think we should make the same guarantees (and no stronger) in our manual pages as Linux (and I have tried to do so).

It is probably desirable to check for signals occasionally during large reads in case the user wants to abort the running program. We would just want to do it after the first page of output has been copied out. That enhancement could be made, but is maybe orthogonal to this change — similar behavior impacts random(4) today (modulo DEVFS_IOSIZE_MAX vs IOSIZE_MAX as discussed).

lib/libc/tests/gen/Makefile
66 ↗(On Diff #39698)

Yep, it is probably a build system bug.

sys/kern/sys_getrandom.c
94 ↗(On Diff #40023)

Yes, that's a good point.

I get about 200 MB/s out of /dev/urandom with a GENERIC (debug-enabled) kernel on amd64. So an INT_MAX sized read(2) can already uninterrupted block for about 10 seconds on a debug-enabled kernel. That's not great. I can confirm it is uninterruptible today with this command:

dd if=/dev/urandom bs=2147483647 count=1 of=/dev/null

SSIZE_MAX getrandom(2) could block uninterruptibly for even longer (modulo available RAM).

So, I think we want both random(4) and this API to become interruptible.

lib/libc/sys/getrandom.2
71 ↗(On Diff #39698)

It looks like an attempt was made to add signal / EINTR detection to the read path in r273997, but it is non-functional because uiomove() does not return EINTR / ERESTART and no other check for signals mid-read (after the blocking phase) is performed.

sys/kern/sys_getrandom.c
59 ↗(On Diff #40023)

Style: there should be an explicit comparison with 0.

tests/sys/kern/sys_getrandom.c
54 ↗(On Diff #40023)

Style: missing space before "*/". Comments should be sentences.

67 ↗(On Diff #40023)

This can be in the loop header.

101 ↗(On Diff #40023)

You could make the test more robust by filling the entire buffer with 0x7c and checking every unmodified byte after each call to getrandom().

cem marked 3 inline comments as done.Mar 12 2018, 3:22 PM
cem added inline comments.
tests/sys/kern/sys_getrandom.c
67 ↗(On Diff #40023)

I don't believe so — see the continue case. I suppose that case could force ret to zero for the loop increment, but I don't think that's obviously better.

Address CR feedback:

  • Allow interruption of very large random(4) / getrandom(2) read requests (check is performed at the 10MB boundary). (delphij)
  • Style fixes and test robustness enhancement. (markj)

Could you please create a review of just the sys/dev/random/randomdev.c portion of change (see my comments inline) as this is a self-contained bugfix and is orthogonal to the rest part of this review? It's easier to understand smaller changesets and future readers of code history would have an easier time to see the reasoning when they are done in smaller and self-contained units.

lib/libc/sys/getrandom.2
71 ↗(On Diff #39698)

Yes you are indeed right on that non-functional part and that was a nice catch!

Please note that once the sys/dev/random/random_dev.c change gets landed, it would actually provide a greater than 256 bytes guarantee (the code reads in PAGE_SIZE quantities for large requests, and a PCATCH is done each chunk of read).

sys/dev/random/randomdev.c
171 ↗(On Diff #40198)

Please add a comment here that uiomove may yield (after each read of as much as PAGE_SIZE).

173 ↗(On Diff #40198)

s/MB/MiB/

177 ↗(On Diff #40198)

Thanks for doing this!

One suggestion: if I was you I'll use a "whole" number (e.g. 1 << 24 or 1 << 23 instead of 10* (1 << 20)). This way you can easily test the condition using a bitmask instead of division.

[Potential bikeshed warning] It seems to be reasonable to use 2MiB (superpage size on amd64) but I'm fine with other options.

182 ↗(On Diff #40198)

I think the != should be changed to <= here, with the comments below updated (the read is actually done and total_read is being updated regardless of error if the signal happen to deliver right before reading the last page and when that page would end up making a new chunk).

Could you please create a review of just the sys/dev/random/randomdev.c portion of change (see my comments inline) as this is a self-contained bugfix and is orthogonal to the rest part of this review? It's easier to understand smaller changesets and future readers of code history would have an easier time to see the reasoning when they are done in smaller and self-contained units.

Sure, that makes sense. I will do so.

sys/dev/random/randomdev.c
173 ↗(On Diff #40198)

I don't think the distinction matters given the non-specificity of "few."

177 ↗(On Diff #40198)

Sure, it can easily be changed to be a power of two. I'll let the compiler optimizer convert the logical division into a bitmask check.

2 MB would result in polling ~100 times per second on my machine — that seemed too high to me. On the other hand we don't want tiny little MIPS machines to be stuck blocking forever. I was going to round 10 up to 16 MB, but I do not feel strongly about this number.

182 ↗(On Diff #40198)

I am not sure about this check. I think <= is definitely wrong. Keep in mind that uio_resid is modified (decremented) by uiomove() while total_read is simultaneously modified (incremented) in the body of the loop.

I think we only want to return EINTR or RESTART if *no* bytes were read and the error was one of those two. Given the code, I don't think that is possible. So I suspect the right thing to do is to just remove the total_read <=> uio_resid portion of the conditional.

Remove randomdev.c changes from this review.

cem marked 3 inline comments as done.Mar 13 2018, 6:32 PM

New review for randomdev changes is at https://reviews.freebsd.org/D14684 .

Anything else for this review?

This revision was not accepted when it landed; it landed in state Needs Review.Mar 21 2018, 1:16 AM
This revision was automatically updated to reflect the committed changes.