HomeFreeBSD

OpenZFS 8909 - 8585 can cause a use-after-free kernel panic

Description

OpenZFS 8909 - 8585 can cause a use-after-free kernel panic

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

PROBLEM

There's a race condition that exists if zil_free_lwb races with either
zil_commit_waiter_timeout and/or zil_lwb_flush_vdevs_done.

Here's an example panic due to this bug:

> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only

> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()

If there's an outstanding lwb that's in zil_commit_waiter_timeout
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call zil_free_lwb. If we end up calling zil_free_lwb, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the zil_commit_waiter_t structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the LWB_STATE_ISSUED state, and zil_free_lwb is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
zil_free_lwb, which will result in a use-after-free situation when the
lwb's zio completes, and zil_lwb_flush_vdevs_done is called.

This race condition is prevented in zil_close by calling zil_commit
before zil_free_lwb is called, which will ensure all outstanding (i.e.
all lwb's in the LWB_STATE_OPEN and/or LWB_STATE_ISSUED states)
reach the LWB_STATE_DONE state before the lwb's are freed
(zil_commit will not return untill all the lwb's are
LWB_STATE_DONE).

Further, this race condition is prevented in zil_sync by only calling
zil_free_lwb for lwb's that do not have their lwb_buf pointer set.
All lwb's not in the LWB_STATE_DONE state will have a non-null value
for this pointer; the pointer is only cleared in
zil_lwb_flush_vdevs_done, at which point the lwb's state will be
changed to LWB_STATE_DONE.

This race *is* present in zil_suspend, leading to this bug.

At first glance, it would appear as though this would not be true
because zil_suspend will call zil_commit, just like zil_close, but
the problem is that zil_suspend will set the zilog's zl_suspend
field prior to calling zil_commit. Further, in zil_commit, if
zl_suspend is set, zil_commit will take a special branch of logic
and use txg_wait_synced instead of performing the normal zil_commit
logic.

This call to txg_wait_synced might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the LWB_STATE_DONE state before it returns.
This is because, if there's an lwb "stuck" in
zil_commit_waiter_timeout, waiting for it's lwb to timeout, it will
maintain a non-null value for it's lwb_buf field and thus zil_sync
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after zil_commit is called from zil_suspend, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used zil_commit_waiter_timeout.

SOLUTION

The solution to this, is to ensure all outstanding lwb's complete before
calling zil_free_lwb via zil_destroy in zil_suspend. This patch
accomplishes this goal by forcing the normal zil_commit logic when
called from zil_sync.

Now, zil_suspend will call zil_commit_impl which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when zil_commit_impl is called will be
guaranteed to reach the LWB_STATE_DONE state by the time it returns.

Further, no new lwb's will be created via zil_commit since the zilog's
zl_suspend flag will be set. This will force all new callers of
zil_commit to use txg_wait_synced instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when zil_destroy is
called will be in the LWB_STATE_DONE state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940

Details

Provenance
Prakash Surya <prakash.surya@delphix.com>Authored on Dec 7 2017, 7:26 PM
Brian Behlendorf <behlendorf1@llnl.gov>Committed on Dec 28 2017, 6:18 PM
Parents
rG823d48bfb182: Call commit callbacks from the tail of the list
Branches
Unknown
Tags
Unknown

Event Timeline

Brian Behlendorf <behlendorf1@llnl.gov> committed rG2fe61a7ecc50: OpenZFS 8909 - 8585 can cause a use-after-free kernel panic (authored by Prakash Surya <prakash.surya@delphix.com>).Dec 28 2017, 6:18 PM