Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 42762 Build 39650: arc lint + arc unit
Event Timeline
The nvmecontrol of FreeBSD guest can output the changed size. but gpart still reports the original size. how can we let kernel reload this change? nvme reset seems no a fix.
root@bsd16:~ # nvmecontrol devlist nvme0: bhyve-NVMe nvme0ns1 (120GB) nvme1: bhyve-NVMe nvme1ns1 (120GB) nvme2: bhyve-NVMe nvme2ns1 (100TB) root@bsd16:~ # diskinfo nvd2 nvd2 512 21474836480 41943040 0 0
When guest boots, nvd2 is 20GB.
Also try debian linux.
here is result
root@deb15:~# nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 QNVMe001 bhyve-NVMe 1 85.90 GB / 85.90 GB 512 B + 0 B 1.0 /dev/nvme1n1 QNVMe002 bhyve-NVMe 1 109.95 TB / 109.95 TB 512 B + 0 B 1.0 root@deb15:~# fdisk /dev/nvme1n1 Welcome to fdisk (util-linux 2.36.1). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Device does not contain a recognized partition table. Created a new DOS disklabel with disk identifier 0xd42a43b0. Command (m for help): p Disk /dev/nvme1n1: 40 GiB, 42949672960 bytes, 83886080 sectors Disk model: bhyve-NVMe Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0xd42a43b0
fdisk cannot report new size ether.
Hmmm, I have patches, I think, that propigate the new size of a namespace as well as a dozen other issues with changing things around dynamically... I'll dust them off and put them up for review and link them here.
I took a quick look yesterday, and my guess was nvd (and perhaps nda) needed to call disk_resize()
Yea, they need to be notified when the size changes, though, and only the nvme driver knows that.
There's a nvme_sim_ns_change callback that's currently registered for nda. This is the same as nvd_new_disk,
but I'm not sure that nvd_new_disk handles the change case well.
I'll have to thread this through. My current ns enhancement work is incomplete and doesn't handle this deatail.
https://reviews.freebsd.org/D32963 is the current state of things, but I don't see any calls to disk_resize() in it.
scsi and ata disks work,but nvme doesn't. I've not worked through all the details of that review, btw, since it
is a work in progress that had other things take priority over and I've not returned to it
Neat! I will defer to Chuck and Warner on reviewing the NVMe specific behavior, but I think it looks fine (and I'm happy the resize callback works well for this without needing refactoring).
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
2984 | Is the NSID always 1? |
Yea, they need to be notified when the size changes, though, and only the nvme driver knows that.
There's a nvme_sim_ns_change callback that's currently registered for nda. This is the same as nvd_new_disk,
but I'm not sure that nvd_new_disk handles the change case well.
I'll have to thread this through. My current ns enhancement work is incomplete and doesn't handle this deatail.https://reviews.freebsd.org/D32963 is the current state of things, but I don't see any calls to disk_resize() in it.
scsi and ata disks work,but nvme doesn't. I've not worked through all the details of that review, btw, since it
is a work in progress that had other things take priority over and I've not returned to it
after some dig, I found nvd driver will not response disk resize, I use nda to have a try. when host resize the disk, guest panic.
panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro cpuid = 5 time = 1637052864 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00 vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50 panic() at panic+0x43/frame 0xfffffe00e21e7bb0 sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00 _sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0 pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0 nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70 nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0 nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0 nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10 nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0 fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 12 tid 100098 ] Stopped at kdb_enter+0x37: movq $0,0x1276e2e(%rip)
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
2984 | Yes, NSID will always be 0x1 as the emulation only supports a single namespace. |
So if I take these patches, how do I change the namespace size in bhyve? I can fix nvd and nda
Apologies, I should have put something about how to use this.
- Create a file to back the NVMe device (e.g., truncate -s 1G /vms/nvme-aen/disk1.img)
- Start the VM
- Increase the size of the file backing the NVMe device (e.g., truncate -s +1G /vms/nvme-aen/disk1.img)
After increasing the file size, you should seem something like the following on the console:
nvme0: async event occurred (type 0x2, info 0x00, page 0x04)
I have made a quick patch for nvd, after nvme reconstruct the namespace, nvd will call disk_resize().
please take a look at D33028
panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro cpuid = 5 time = 1637052864 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00 vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50 panic() at panic+0x43/frame 0xfffffe00e21e7bb0 sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00 _sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0 pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0 nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70 nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0 nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0 nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10 nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0 fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 12 tid 100098 ] Stopped at kdb_enter+0x37: movq $0,0x1276e2e(%rip)
nvme_ctrlr_async_event_log_page_cb() is called by nvme_qpair_process_completions()
which is a interrupt thread, this thread mark as sleeping prohibited. if we query device and update namesapce here, it will cause a kernel panic.
because query device namespace wll call nvme_complete_poll(),which has a pause_sbt()
Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.
Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.
Here is my initial thinking, create a thread which listening kevents, when namespace changed, nvme_ctrlr_async_event_log_page_cb() trigger a event, this thread will query nvme device and update the corresponding namespace data. after that, invoke consumer to call disk_resize();
I don't think we need the namespace info before interrupts are enabled, so I don't think we need a kthread for this. At most, we may need a workqueue to run job...
I don't think we need the namespace info before interrupts are enabled, so I don't think we need a kthread for this. At most, we may need a workqueue to run job...
Yes, you are right.
following your comments, I have figured out how to make it works.
- add an extra thread when controller starts.
- add a new async_event_queue to controller structure.
namespace changed async event trigger -> push into async_event_queue -> this extra thread query device namesapce and update namespace data -> notify consumer to disk_resize()
quick test shows this flow works. is it the right direction?
currently only namesapce follow this path.
I'm in the process of porting this change over to illumos, and it looks like we should also be setting the appropriate bit in the oaes field in nvme_controller_data.
At present, nvme identify does not indicate the capability:
root@bhyvetest:~# nvmeadm -v identify nvme0 nvme0: Identify Controller Controller Capabilities and Features Model: bhyve-NVMe Serial: NVME-4-0 ... Optional Asynchronous Events Supported Namespace Attribute Notices: unsupported Firmware Activation Notices: unsupported Asynchronous Namespace Access Change Notices: unsupported Predictable Latency Event Aggregation: unsupported LBA Status Information Notices: unsupported
Here's the patch I'm testing in illumos:
diff +++ b/usr/src/cmd/bhyve/pci_nvme.c @@ -514,6 +514,11 @@ pci_nvme_init_ctrldata(struct pci_nvme_softc *sc) cd->ver = 0x00010300; cd->oacs = 1 << NVME_CTRLR_DATA_OACS_FORMAT_SHIFT; +#define NVME_CTRLR_DATA_OAES_NSCHANGE_SHIFT 8 + cd->oaes = 1 << NVME_CTRLR_DATA_OAES_NSCHANGE_SHIFT; cd->acl = 2; cd->aerl = 4;
Thanks for catching this. Your change will work just fine, but I'm tempted to add a helper wrapper like the below. Thoughts?
diff --git a/usr.sbin/bhyve/pci_nvme.c b/usr.sbin/bhyve/pci_nvme.c index 8cd3a949f916..5caaa9de5c5a 100644 --- a/usr.sbin/bhyve/pci_nvme.c +++ b/usr.sbin/bhyve/pci_nvme.c @@ -280,6 +280,9 @@ typedef enum { PCI_NVME_AE_INFO_MAX, } pci_nvme_async_info; +#define PCI_NVME_AE_SHIFT 8 +#define PCI_NVME_AE_MASK(event) (1 << (event + PCI_NVME_AE_SHIFT)) + /* Asynchronous Event Notifications */ struct pci_nvme_aen { pci_nvme_async_type atype; @@ -536,6 +539,7 @@ pci_nvme_init_ctrldata(struct pci_nvme_softc *sc) cd->cntrltype = NVME_CNTRLTYPE_IO; cd->oacs = 1 << NVME_CTRLR_DATA_OACS_FORMAT_SHIFT; + cd->oaes = PCI_NVME_AE_MASK(PCI_NVME_AE_INFO_NS_ATTR_CHANGED); cd->acl = 2; cd->aerl = 4; @@ -950,8 +954,7 @@ pci_nvme_aen_process(struct pci_nvme_softc *sc) status = NVME_SC_INTERNAL_DEVICE_ERROR; break; } - mask >>= 8; - if (((1 << aen->event_data) & mask) == 0) + if ((PCI_NVME_AE_MASK(aen->event_data) & mask) == 0) continue; switch (aen->event_data) { case PCI_NVME_AE_INFO_NS_ATTR_CHANGED:
I think the whole set of these (OAES) definitions should probably end up in sys/dev/nvme/nvme.h anyway, like the NVME_CTRLR_DATA_OACS_* ones that are already there. That would just make it analogous to the existing oacs assignment. Linking it to the pci_nvme_async_info in bhyve may make sense too though.