HomeFreeBSD

Stop ganging due to past vdev write errors

Description

Stop ganging due to past vdev write errors

Problem

While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
metaslab_alloc_dva():

	/*
	 * Avoid writing single-copy data to a failing,
	 * non-redundant vdev, unless we've already tried all
	 * other vdevs.
	 */
	if ((vd->vdev_stat.vs_write_errors > 0 ||
	    vd->vdev_state < VDEV_STATE_HEALTHY) &&
	    d == 0 && !try_hard && vd->vdev_children == 0) {
		metaslab_trace_add(zal, mg, NULL, psize, d,
		    TRACE_VDEV_ERROR, allocator);
		goto next;
	}

Proposed Solution

Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).

Testing

I first created a pool with 3 vdevs:

$ zpool list -v volpool

NAME        SIZE  ALLOC   FREE

volpool    22.5G   117M  22.4G

  xvdb     7.99G  40.2M  7.46G

  xvdc     7.99G  39.1M  7.46G

  xvdd     7.99G  37.8M  7.46G

And used zinject like so with each one of them:

$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool

And got the vdevs to the following state:

$ zpool status volpool

  pool: volpool

 state: ONLINE

status: One or more devices has experienced an unrecoverable error.

...<cropped>..

action: Determine if the device needs to be replaced, and clear the

...<cropped>..

config:



	NAME        STATE     READ WRITE CKSUM

	volpool     ONLINE       0     0     0

	  xvdb      ONLINE       0     1     0

	  xvdc      ONLINE       0     1     0

	  xvdd      ONLINE       0     4     0


I also double-checked their write error counters with sdb:

sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0  # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1

Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:

$ sudo zdb volpool | grep gang

        ganged count:              1384



$ sudo zdb volpool | grep gang

        ganged count:              1393



$ sudo zdb volpool | grep gang

        ganged count:              1402



$ sudo zdb volpool | grep gang

        ganged count:              1414

Then I updated my bits with this patch and the gang count stayed the
same.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14003

Details

Provenance
Serapheim Dimitropoulos <serapheim@delphix.com>Authored on Oct 11 2022, 7:27 PM
Brian Behlendorf <behlendorf1@llnl.gov>Committed on Nov 1 2022, 7:36 PM
Parents
rG25096e11800a: zvol_wait logic may terminate prematurely
Branches
Unknown
Tags
Unknown

Event Timeline

Brian Behlendorf <behlendorf1@llnl.gov> committed rG37d5a3e04b7b: Stop ganging due to past vdev write errors (authored by Serapheim Dimitropoulos <serapheim@delphix.com>).Nov 1 2022, 7:36 PM