Details

Reviewers

cperciva
mav
chuck

Summary

Users of nvme_completion_poll are all in the initialization path. Most
of the commands they queue and wait for finish quickly as they involve
no I/O to the drive's media. These command finsh much faster than a
single tick, on the order of a few microseconds. Adaptively polling for
the first tick allows us to return much earlier than we would
otherwise. The cumulative effect of not waiting until the next tick to
re-poll the condition is impressive (~80 of 100 ms saved).

Use this same technique waiting for RDY state transitions as well. Those
transition quickly as well and we have to wait for a couple of
them. This saves the rest (~20ms).

This eliminates almost 100ms of delay on boot in cperciva's EC2 test
harness and makes nvme disappear from the flame graph of boot times.

Tested by: cperciva
Sponsored by: Netflix

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 41902
Build 38790: arc lint + arc unit

Event Timeline

imp created this revision.Oct 1 2021, 5:33 PM

Herald added a subscriber: dab. · View Herald TranscriptOct 1 2021, 5:33 PM

imp requested review of this revision.Oct 1 2021, 5:33 PM

Harbormaster completed remote builds in B41891: Diff 96087.Oct 1 2021, 5:33 PM

more spin

Harbormaster completed remote builds in B41892: Diff 96088.Oct 1 2021, 5:38 PM

Also do wait_for_ready

Harbormaster completed remote builds in B41896: Diff 96092.Oct 1 2021, 5:42 PM

imp retitled this revision from adaptive spin to nvme: Use adaptive spinning when polling for completion or state change.Oct 1 2021, 8:18 PM

imp edited the summary of this revision. (Show Details)

imp added reviewers: cperciva, mav, chuck.

imp added inline comments.Oct 1 2021, 8:21 PM

sys/dev/nvme/nvme_ctrlr.c
266	TICKS_2_US here too
sys/dev/nvme/nvme_private.h
476	TICKS_2_US here

use TICKS_2_USEC

Harbormaster completed remote builds in B41902: Diff 96105.Oct 1 2021, 8:23 PM

It looks OK, just in the first chunk instead of sanity2 I'd call it timeout2 or somehow else.

But I think it could look better with pause_sbt() with interval doubled every call instead of DELAY(). I haven't benchmarked pause_sbt() recently, but when it was developed years ago it was able to sleep for as little as few microseconds.

implement mav@'s suggestion. Add some basic profiling. Used it to determine that
1.5 is a better scale factor than 2. Most of these commands take between 30us
and 200us, and there's little variance within a drive, but from vendor to vendor
the variance is much larger.

Harbormaster completed remote builds in B41908: Diff 96116.Oct 1 2021, 9:42 PM

It looks good to me. I'd definitely remove the NVME_MEASURE_WAIT blocks after initial testing to not pollute the code. Also SBT_1US in the precision field is not very needed, since C_PREL(1) will be bigger after first couple iterations, so 0 would work just fine too.

more tweaks

Harbormaster completed remote builds in B41913: Diff 96124.Oct 2 2021, 12:01 AM

In D32259#728245, @mav wrote:

It looks good to me. I'd definitely remove the NVME_MEASURE_WAIT blocks after initial testing to not pollute the code. Also SBT_1US in the precision field is not very needed, since C_PREL(1) will be bigger after first couple iterations, so 0 would work just fine too.

Crazy question: Why not just do 5 or 10 microsecond sleeps and not worry about scaling. What does this scaling buy us?

A data point: most commands take between 20 and 200 microseconds. The RDY waiting takes between 5 microseconds and 1.3 seconds (for what I hope is pre-release hardware I got from a vendor).

In D32259#728249, @imp wrote:

Crazy question: Why not just do 5 or 10 microsecond sleeps and not worry about scaling. What does this scaling buy us?

If we have to wait for full 10 seconds with 5 microsecond sleeps, it may end up not too different from the original tight loop from CPU usage and power consumption points. This looks like a good compromise to me.

This revision is now accepted and ready to land.Oct 2 2021, 12:37 AM

imp mentioned this in rG83581511d947: nvme: Use adaptive spinning when polling for completion or state change.Oct 2 2021, 1:18 AM

83581511d9476ef5084f47e3cc379be7191ae866 should have closed this.

mav mentioned this in rG86721e606c51: nvme: Use adaptive spinning when polling for completion or state change.Jan 21 2022, 2:28 AM

nvme: Use adaptive spinning when polling for completion or state change
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 96105

sys/dev/nvme/nvme_ctrlr.c

sys/dev/nvme/nvme_private.h

nvme: Use adaptive spinning when polling for completion or state changeClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 96105

sys/dev/nvme/nvme_ctrlr.c

sys/dev/nvme/nvme_private.h

nvme: Use adaptive spinning when polling for completion or state change
ClosedPublic
Actions

Revision Contents
Changeset List