Page MenuHomeFreeBSD

bhyve: Passthrough host's NVMe device health logpage to guests.
AbandonedPublic

Authored by wanpengqian_gmail.com on Jul 29 2020, 8:49 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 23, 7:51 PM
Unknown Object (File)
Fri, Jan 3, 12:25 PM
Unknown Object (File)
Dec 1 2024, 6:30 AM
Unknown Object (File)
Nov 2 2024, 1:44 PM
Unknown Object (File)
Oct 26 2024, 10:26 PM
Unknown Object (File)
Oct 26 2024, 10:26 PM
Unknown Object (File)
Oct 26 2024, 10:26 PM
Unknown Object (File)
Oct 26 2024, 10:26 PM

Details

Reviewers
grehan
jhb
chuck
Group Reviewers
bhyve
Summary

Currently, NVMe controller only presents the SMART data I/O statistics.
for some OSs, they will check the health logpage to decide whether the
NVMe device is healthy or not.

This patch will passthrough the host NVMe device health logpage
to the vm.

Within the guest, nvmecontrol logpage -p2 nvme0 will output
a copy value of host's NVMe device(configable in command line).

For example, here is a output within guest.

root@smart:~ # nvmecontrol logpage -p 2 nvme0
SMART/Health Information Log
============================
Critical Warning State:         0x00
 Available spare:               0
 Temperature:                   0
 Device reliability:            0
 Read only:                     0
 Volatile memory backup:        0
Temperature:                    324 K, 50.85 C, 123.53 F
Available spare:                100
Available spare threshold:      10
Percentage used:                2
Data units (512,000 byte) read: 359129
Data units written:             14856403
Host read commands:             12535684
Host write commands:            326035273
Controller busy time (minutes): 1484
Power cycles:                   36
Power on hours:                 3667
Unsafe shutdowns:               22
Media errors:                   0
No. error info log entries:     2
Warning Temp Composite Time:    72
Error Temp Composite Time:      67
Temperature 1 Transition Count: 0
Temperature 2 Transition Count: 0
Total Time For Temperature 1:   0
Total Time For Temperature 2:   0
Test Plan

Inside guest, check the logpage output.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Thank you for determining why some guest OS's (I'm guessing Windows?) believe that the device isn't healthy!
A couple of observations:

  1. If the backing storage for the namespace isn't an NVMe device, what will this code do?
  2. Were you able to determine which fields are important to the OS in question? If so, a better approach might be to fix the missing fields in the current implementation. This would have the added benefit of working with a ZVol or file-based backing storage.

When I am using NVMe device as cache for XPEnology guest, it shows that NVMe device is not healthy, cannot be used. because these two fields are 0.

Available spare:                100
Available spare threshold:      10

and it also monitor controller's Temperature. after setting these 3 fileds, the device can be used.

Were you able to determine which fields are important to the OS in question? If so, a better approach might be to fix the missing fields in the current implementation. This would have the added benefit of working with a ZVol or file-based backing storage.

I have a quick fix for this issue before, see bhyve: Initial some NVMe controller health log data.
But I think we can have a better solution.

If the backing storage for the namespace isn't an NVMe device, what will this code do?

the ioctl(sc->health_passthrough_ctx, NVME_PASSTHROUGH_CMD, &pt); will failed and we didin't set anything.

I am thinking if user provide other device except NVMe, we can convert it. for example, fetch values from SATA/SAS device, and set it.
(Also we can have a patch for AHCI Controller later.)

As for this patch, I can improve it later:

  1. Current code lack of capsicum ability.
  2. If user provide a SATA/SAS SSD, we can read and convert it.
  3. If user provide a filename instead of RAW device. we can read/store SMART/Health data in it. it should be useful for accounting purpose. since these fileds can indicate how much data VM reads/writes. (Currenly these values are start from zero and not saved frequently)
Data units (512,000 byte) read: 359129
Data units written:             14856403
Host read commands:             12535684
Host write commands:            326035273

While this is an interesting approach, the fix you proposed in D24202 is more in line with the goal of emulating an NVMe device, and I'd be more comfortable committing those changes.

Your observation that the data units values restart at zero is very valid. My tentative plan for this is to optionally save the current values when the VM is powered down using the new configuration file format. This saved device state could then be used on a subsequent restart to provide the continuity you describe.

Having written code to read Health data from SCSI, ATA, and NVMe devices and present it in a common format, I experienced first hand the complexities this entails. And even after adding this complexity, it would not cover use cases like file or ZVol backing storage. Instead, it might be better to use pptdevs if the VM requires more realistic data.

...
Your observation that the data units values restart at zero is very valid. My tentative plan for this is to optionally save the current values when the VM is powered down using the new configuration file format. This saved device state could then be used on a subsequent restart to provide the continuity you describe.

Please do not have bhyve storing VM state persistence data in the config file that is written each time on shutdown, it would be cleaner to create a separate persistence store for that. The same mechanism and tooling could be used to implement it, and by using loading order, config then state, any config item can be overloaded by last state value.

Having written code to read Health data from SCSI, ATA, and NVMe devices and present it in a common format, I experienced first hand the complexities this entails. And even after adding this complexity, it would not cover use cases like file or ZVol backing storage. Instead, it might be better to use pptdevs if the VM requires more realistic data.

I totally agree with your opition. adding complexities code is worthy or not.

But I have reasons for these.

  1. FreeBSD is a suitable solution for NAS and VM host. but some users are using the comsumer hardware. In that case, VT-D is not available. we have to provide other solutions for SMART data.
  1. Currently if we provide raw disk, the health data is not pass too.
  1. Mainstream HDD types are SATA/SAS/NVMe, if we support there 3 types, we can cover 99% scenario.
  1. Instead of using fake SMART data, we provide real SMART data to VM. I know it is best to monitor SMART in FreeBSD host. But in that case, user should familiar with FreeBSD OS and install extra app/script for that. and VM can monitor SMART and alert user too.
  1. For ZVol or File base disk image, as I mention before. we can provide a small binary file to store controller running data cross reboot/shutdown.

Nowadays many people are interesting using FreeBSD/FreeNAS as their home servers. As that are not good user of FreeBSD, they want such features. (Thay are not native English speaker, so thay cannot file a question to FreeBSD forum or mailling list.)

Add CAPSICUM ability to current patch.

I totally agree with your opition. adding complexities code is worthy or not.

But I have reasons for these.

  1. FreeBSD is a suitable solution for NAS and VM host. but some users are using the comsumer hardware. In that case, VT-D is not available. we have to provide other solutions for SMART data.

I had not considered using the NVMe emulation as a bridge between the guest and a physical device when VT-D is not available. It's an interesting approach that I would like to think about more. Let me do some research and get back to you.

After thinking about this and talking it over with some of the other bhyve developers, the better approach would be to create a new device model specifically for passing commands and data between a real NVMe device and a guest. The new device model (perhaps pci_nvme_proxy.c) might have a small amount of overlap with the NVMe device model with respect to hooking PCI reads and writes and probably queue creation. But the majority of the functionality would be handled by either nvme(4) or cam(3) requests. Using nvme(4) ioctl's as you have is a good model for Admin commands, but you will want to experiment with how best to implement the I/O path. Let me if you have questions.