Page MenuHomeFreeBSD

tests/sys/audit: Avoid race caused by starting auditd(8) for testing
ClosedPublic

Authored by arichardson on Feb 1 2021, 7:17 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 24, 3:31 AM
Unknown Object (File)
Thu, Oct 24, 3:30 AM
Unknown Object (File)
Thu, Oct 24, 3:30 AM
Unknown Object (File)
Thu, Oct 24, 3:30 AM
Unknown Object (File)
Thu, Oct 24, 3:30 AM
Unknown Object (File)
Thu, Oct 24, 3:09 AM
Unknown Object (File)
Fri, Oct 18, 7:43 PM
Unknown Object (File)
Fri, Oct 18, 7:29 AM
Subscribers

Details

Summary

In the CheriBSD CI we reproducibly see the first test in sys/audit
(administrative:acct_failure) fail due to a missing startup message.
It appears this is caused by a race condition when starting auditd:
service auditd onestart returns as soon as the initial auditd() parent
exits (after the daemon(3) call).
We can avoid this problem by setting up the auditd infrastructure
in-process: libauditd contains audit_quick_{start,stop}() functions that
look like they are ideally suited to this task.
This patch also avoids forking lots of shell processes for each of the 418
tests by using auditon(A_SENDTRIGGER, &trigger, sizeof(trigger)) to check
for a running auditd(8) instead of using service auditd onestatus.

With these two changes (and D28388 to fix the XFAIL'd test) I can now
boot and run cd /usr/tests/sys/audit && kyua test without any failures
in a single-core QEMU instance. Before there would always be at least one
failed test.

Besides making the tests more reliable in CI, a nice side-effect of this
change is that it also significantly speeds up running them by avoiding
lots of fork()/execve() caused by shell scripts:
Running kyua test on an AArch64 QEMU took 315s before and now takes 68s,
so it's roughly 3.5 times faster. This effect is even larger when running
on a CHERI-RISC-V QEMU since emulating CHERI instructions on an x86 host
is noticeably slower than emulating AArch64.

Test Plan

aarch64+amd64 QEMU no longer fail.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 37053
Build 33942: arc lint + arc unit

Event Timeline

This looks great! One question: what happens if auditd is already running? Will the tests run successfully? And what's "fake" about the fake auditd?

tests/sys/audit/utils.c
299

If two tests are running in parallel, they might both try to start fake auditd. The only thing that saves you is the fact that the audit tests are currently all marked as is_exclusive.

This looks great! One question: what happens if auditd is already running? Will the tests run successfully? And what's "fake" about the fake auditd?

Well, it's not actually a real auditd and doesn't log properly. I'm not sure what else is different. The comment above audit_quick_start() says:

/*
 * Do just what we need to quickly start auditing.  Assume no system logging or
 * notify.  Return:
 *   0	 on success,
 *  -1   on failure.
 */

It also appears to be (completely(?)) in-process, so maybe we don't actually need the cleanup function anymore.

I tried running the tests after running service auditd onestart and they continue to work since the is_auditd_running() returns true and we can use the existing auditd.
However, I'm not sure that is a good idea since at least one test (auditctl_success) changes the running auditd configuration.
Maybe we should be skipping these tests when a running auditd is detected (possibly with an env-var override to test against the running auditd)?

tests/sys/audit/utils.c
299

Ah yes, I keep forgetting that kyua can actually run tests in parallel.
I'm not sure what happens if multiple processes run audit_quick_start: my guess would be one of them fails. Will try to test this with a SMP QEMU.

Running the tests in parallel (by commenting out the is_exclusive) causes failures as expected (0 != audit_quick_start(): Failed to start fake auditd: File exists).

It also then results in a panic:

root@freebsd-amd64:/usr/tests/sys/audit # kyua -v parallelism=16 test
administrative:adjtime_failure  ->  passed  [0.153s]
administrative:adjtime_success  ->  passed  [0.149s]
administrative:auditctl_failure  ->  failed: /local/scratch/alr48/cheri/freebsd/tests/sys/audit/utils.c:300: 0 != audit_quick_start(): Failed to start fake auditd: File exists  [0.148s]
administrative:auditctl_success  ->  passed  [0.145s]
administrative:auditon_default_failure  ->  passed  [0.057s]
administrative:auditon_default_success  ->  failed: /local/scratch/alr48/cheri/freebsd/tests/sys/audit/utils.c:300: 0 != audit_quick_start(): Failed to start fake auditd: File exists  [0.143s]
panic: audit_syscall_enter: td->td_ar != NULL
cpuid = 0
time = 1612265481
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0093bb59d0
vpanic() at vpanic+0x188/frame 0xfffffe0093bb5a20
panic() at panic+0x43/frame 0xfffffe0093bb5a80
audit_syscall_enter() at audit_syscall_enter+0x1b7/frame 0xfffffe0093bb5ac0
amd64_syscall() at amd64_syscall+0x702/frame 0xfffffe0093bb5bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0093bb5bf0
--- syscall (3, FreeBSD ELF64, sys_read), rip = 0x8003e556a, rsp = 0x7fffffffca48, rbp = 0x7fffffffca60 ---
KDB: enter: panic
[ thread pid 809 tid 100093 ]
Stopped at      kdb_enter+0x37: movq    $0,0x10a084e(%rip)

ping? I'm not sure we can make it possible to run tests in parallel, so the is_exclusive needs to stay.

Call atexit() to make it easier to run outside of kyua

I think we can commit this. But please open a bug for the panic you discovered. And it might be a good idea to add a comment to the Makefile explaining the necessity for is_exclusive.

This revision is now accepted and ready to land.Feb 18 2021, 12:27 AM