Page MenuHomeFreeBSD

Add EINTEGRITY as new error number
ClosedPublic

Authored by mckusick on Jan 7 2019, 2:11 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 1:01 PM
Unknown Object (File)
Sun, Jan 12, 7:21 PM
Unknown Object (File)
Sat, Jan 4, 11:36 AM
Unknown Object (File)
Fri, Jan 3, 4:41 AM
Unknown Object (File)
Dec 25 2024, 1:20 AM
Unknown Object (File)
Dec 22 2024, 6:14 AM
Unknown Object (File)
Dec 21 2024, 5:39 PM
Unknown Object (File)
Dec 4 2024, 4:25 PM

Details

Summary

This proposal is to create a new EINTEGRITY error number. EINTEGRITY's purpose is to respond to errors where an integrity check such as a check-hash or a cross-correlation has failed. For example, an uncorrectable error has been found in a filesystem or a disk subsystem such as gmirror(8) or graid3(8). These changes make no use of the new error, they just add it. Later proposals will be made for the use of the new error number and it will be added to manual pages as appropriate.

As a side effect, this Phabricator thread provides a template for the rather large set files that must be updated to add a new error number. The four primary files that need to be updated are:

lib/libc/sys/intro.2 - the high-level description of the error
sys/sys/errno.h - the number assigned to the error
lib/libc/gen/errlst.c - the error string assigned to the error
lib/libc/nls/C.msg - template for translations of the error string

Additionally these other subsystems need to be updated:

cddl/lib/libdtrace/errno.d - the DTrace description of the error

contrib/libc++/include/__errc - C++ library number and error string
contrib/libc++/include/errno.h

contrib/openbsm/libbsm/bsm_errno.c - OpenBSM uses of the error
contrib/openbsm/sys/bsm/audit_errno.h
sys/bsm/audit_errno.h
sys/security/audit/bsm_errno.c

stand/liblua/lerrno.c - Lua use of the error

sys/compat/linux/linux_errno.inc - Linux compatibility for the error

contrib/libxo/tests/gettext/strerror.pot - libxo definitions for the error
contrib/libxo/tests/gettext/po/pig_latin/strerror.po

sys/compat/cloudabi/cloudabi_errno.c

Note that FileXX has been added to the front of the path names in the diffs below to force the ordering by area rather than the usual alphabetical by pathname.

Test Plan

gnn - please review DTrace changes in cddl/lib/libdtrace/errno.d
dim - please review C++ changes in contrib/libc++/include/__errc and contrib/libc++/include/errno.h
brueffer - please review changes in OpenBSM files contrib/openbsm/libbsm/bsm_errno.c,

contrib/openbsm/sys/bsm/audit_errno.h, sys/bsm/audit_errno.h, and sys/security/audit/bsm_errno.c

cem - please review Lua changes in stand/liblua/lerrno.c
emaste - please review Linux compatibility in sys/compat/linux/linux_errno.inc
ed - please verify that error should be mapped rather than added in sys/compat/cloudabi/cloudabi_errno.c

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 21853

Event Timeline

This proposal is to create a new EINTEGRITY error number. EINTEGRITY's purpose is to respond to errors where an integrity check such as a check-hash or a cross-correlation has failed. For example, an uncorrectable error has been found in a filesystem or a disk subsystem such as gmirror(8) or graid3(8).

How do you envision making use of this error code? Does it meaningfully differ from EIO? In the case of gmirror in particular, AFAIK it cannot determine bitrot and relies on the underlying volume to signal a bad block with EIO (which it then restarts on any remaining component, I think).

(I believe we could use a tighter integration between UFS and underlying geoms like gmirror/graid, I'm just not sure how this error number fits in to that.)

Thanks!

I'll take a look at the lua bits soon.

In D18765#400403, @cem wrote:

This proposal is to create a new EINTEGRITY error number. EINTEGRITY's purpose is to respond to errors where an integrity check such as a check-hash or a cross-correlation has failed. For example, an uncorrectable error has been found in a filesystem or a disk subsystem such as gmirror(8) or graid3(8).

How do you envision making use of this error code? Does it meaningfully differ from EIO? In the case of gmirror in particular, AFAIK it cannot determine bitrot and relies on the underlying volume to signal a bad block with EIO (which it then restarts on any remaining component, I think).

(I believe we could use a tighter integration between UFS and underlying geoms like gmirror/graid, I'm just not sure how this error number fits in to that.)

Thanks!

I'll take a look at the lua bits soon.

Mirror might be able to read both sides but find that they have different values. So it would not know which one to return. Here EIO is not really the right answer as the media is fine, it is just the data that is inconsistent. So saying EINTEGRITY gives a hint that a higher level tool might be able to figure out which one was right and correct the wrong one. None of that is done today, but it is conceivable that it could be done. That said, my main intent is to return this error from UFS when it detects inconsistencies or when a check-hash fails.

Just checked the LUA bits, but they are fine. My acceptance is only for that.

File12/stand/liblua/lerrno.c
149

This is good.

Mirror might be able to read both sides but find that they have different values. So it would not know which one to return. Here EIO is not really the right answer as the media is fine, it is just the data that is inconsistent. So saying EINTEGRITY gives a hint that a higher level tool might be able to figure out which one was right and correct the wrong one. None of that is done today, but it is conceivable that it could be done. That said, my main intent is to return this error from UFS when it detects inconsistencies or when a check-hash fails.

From the point of view of an application using the filesystem, there seems to be no difference between EINTEGRITY and EIO. In both cases, the requested operation could not be performed because the storage had an internal error. Whether the error is detected in the hard disk firmware or in UFS is not relevant to the application proper, but is in the domain of system administration. This system administration may be manual or automated (such as by recreating a filesystem containing data that can be replaced), and needs richer information than an error code can provide.

The libc++ changes look OK to me, and I'll take care of sending them upstream.

Mirror might be able to read both sides but find that they have different values. So it would not know which one to return.

It is feasible, but I don't believe any gmirror (or RAID1) mode does that today because of both the read IO amplification effect and the CPU/cache effect. (At a minimum, a mirrored implementation would need to do 2x the read IOs per logical read, and memcmp() the results, which takes additional CPU and probably more importantly, cache.) (And if such a mode were added, I doubt it would get much use.)

It seems like it would be more efficient if instead, the filesystem could use its checksums to detect bad blocks and then initiate automated recovery. Here is what I have in mind:

  1. Step 1 is keeping track of which component produced bad data. We have options; here are two "brainstorm" quality ideas:
    1. This one is somewhat fiddly and doesn't scale to multiple stacked gmirror/graid layers: we need to track the source component for a given bio -> buf in cache. In practice, expecting a single layer is probably fine (and could be enforced programmatically). We already have bio_from in bio and could transfer some form of that data to the associated buf object at g_vfs_done time.
    2. Another option would be to push the B_CKHASH / b_ckhashcalc functionality down to the bio layer, and modify it slightly. Instead of calculating a checksum and storing it in the buf (bio), the callback would be invoked from g_vfs_done and take as input the contents of the _bio_, and optionally, a caller-provided checksum (for non-embedded checksums). Instead of storing the result, it would return true / false — is this bio intact, as evaluated by the checksum, or not. This avoids the layering problem of attempting to store the sub-component in the buf cache. And we can automate recovery while still within GEOM, before bufdone is invoked. I think this option is more compelling, although limits recovery to GEOM volumes. That's probably ok.
  2. Step 2 is actually initiating the recovery. Once g_vfs_done detects a bad read, it needs to signal to the underlying GEOM which region was damaged. This could be something like an ioctl or special bio_attribute bio, expressing the logical byte range of data known to be bad.
  3. GEOMs supporting the interface would, at a minimum, blacklist the bad sector(s) on the given component before completing the ioctl or bio. They would probably also go on to evict, resilver, or quarantine the bad disk, depending on properties of the mirror/group and administrative policy. Once the bad sector is blacklisted successfully and the "start recovery" ioctl or bio completes, GEOM could reissue the BIO_READ to attempt to fetch it from another component (or in the case of RAID5/6 type configurations, possibly recompute it from available data and FEC). If any underlying GEOM does not support the "start recovery" ioctl or bio, it's easy for g_vfs_done to detect that (ENOTTY or EOPNOTSUPP) and stop attempting recovery on that volume (flag of some kind).
  4. I should add, from a filesystem hardening perspective, it is probably useful for non-mirror/raid GEOMs to blacklist detected bad sectors, too! At least they can log something useful. It may make sense to clear that blacklist on overwrite, or to start rejecting write operations, or some other recovery mode determined by administrative policy.

Here EIO is not really the right answer as the media is fine, it is just the data that is inconsistent.

If the data was consistent when it was written to the media and now it isn't, it is probably arguable that the media is not fine :-).

So saying EINTEGRITY gives a hint that a higher level tool might be able to figure out which one was right and correct the wrong one. None of that is done today, but it is conceivable that it could be done. That said, my main intent is to return this error from UFS when it detects inconsistencies or when a check-hash fails.

It seems like we have the capability to detect the right mirror component already, when checksums are available and correctly updated. I think we should go ahead and do that recovery automatically, rather than punting on the problem to the end user. Once automatic recovery avenues are exhausted, it makes sense to pass some error up to the user. I'm just not sure if EINTEGRITY vs EIO really helps the user determine how to recover any better. At that point it seems like the outcome is "get ready to buy replacement disks and recover from backups ASAP" either way.

(Lua lerrno.c change looks fine.)

In D18765#400620, @dim wrote:

The libc++ changes look OK to me, and I'll take care of sending them upstream.

Great, thanks.

Could you also take a look at the sanitizer changes? (contrib/compiler-rt/lib/sanitizer_common/sanitizer_errno.cc and contrib/compiler-rt/lib/sanitizer_common/sanitizer_errno_codes.h)

When a mount fails because of a hash-check failure it is helpful to get "Integrity check failed" rather than "Invalid Argument" back. There are numerous reasons that you can get "Invalid argument" including miss-typed disk or mount-point names and other administrative errors. Knowing that it was due to an integrity check gives a strong indication that a run of fsck is needed. Actually in the case of mount we have the ability to print a more complete message, but the low-level functions in the filesystems need to be able to return something other than EINVAL so that the mount-layer function can know how to formulate its error message.

File05/cddl/lib/libdtrace/errno.d
229

The binding version ought to be 1.13 to match dtrace -V on CURRENT.

Add the E in front of INTEGRITY

File15/contrib/compiler-rt/lib/sanitizer_common/sanitizer_errno_codes.h
31

Looking again at these changes, I don't think they are needed for the compiler-rt parts. Only some very basic errno values are required for the operation of the sanitizers, such as ENOMEM, EINVAL, etc. The EOWNERDEAD value is an exception because it is returned for some mutex related functions.

E.g., I don't think a sanitizer interceptor will ever be required to check against EINTEGRITY, and therefore compiler-rt does not need this particular errno value.

mckusick removed a reviewer: dteske.

Update diffs based on reviewer comments:
gnn - corrections to File05 cddl/lib/libdtrace/errno.d
dim - drop unneeded changes to sanitizer, previously File14 and File15
Drop dteske as reviewer as gnn has reviewed the DTrace change

Still awaiting reviews from:
brueffer on OpenBSM
emaste on CloudABI and Linux compatibility

errno.d looks good to me.

This revision is now accepted and ready to land.Jan 9 2019, 2:24 AM
emaste added inline comments.
File13/sys/compat/linux/linux_errno.inc
145

OK

File20/sys/contrib/cloudabi/cloudabi_types_common.h
154

Interesting, CloudABI has a subset of FreeBSD errnos and maps unhandled ones to ENOSYS. I'm a bit surprised that other errnos aren't coerced to a known, closest corresponding value, as is done in the Linuxulator.

cloudabi_types_common.h claims to be automatically generated - I suspect that CLOUDABI_EINTEGRITY should not exist, and EINTEGRITY should be mapped to EINVAL.

CC @ed

File20/sys/contrib/cloudabi/cloudabi_types_common.h
154

Exactly. Omitting this here and mapping it to CLOUDABI_EIO or something in cloudabi_errno.c makes most sense.

File13/sys/compat/linux/linux_errno.inc
145

Hm, if this can be thrown at runtime rather than just mount time, it seems like EIO is a closer fit. But maybe this is intended purely for mount time detection. (I've been thinking of this as a runtime error code, like EIO, but maybe that's why I'm confused.)

If this is only for mounting filesystems, starting mirror/raid and the like, then the new error makes sense to me, including mapping it to EINVAL in the Linux compatibility layer.

The man page addition in intro(2) could make this a bit clearer.

Mount is just one of the places that can return EINTEGRITY. It can also come back when accessing a file or directory whose meta information fails a check-hash.

It is useful to the application or user to distinguish EINTEGRITY from EIO. With EIO they know that the underlying media has failed and there is really nothing they can do. With EINTEGRITY the individual file has become corrupted, but if they still have its contents, they can remove it and rewrite it as the filesystem media itself is still working. Alternatively they can request the system administrator to run a check of the filesystem to attempt to clean up the file.

Mount is just one of the places that can return EINTEGRITY. It can also come back when accessing a file or directory whose meta information fails a check-hash.

Ah, ok. For that overall use, I think EIO would be more appropriate than EINVAL for Linuxemul.

It is useful to the application or user to distinguish EINTEGRITY from EIO. With EIO they know that the underlying media has failed and there is really nothing they can do. With EINTEGRITY the individual file has become corrupted, but if they still have its contents, they can remove it and rewrite it as the filesystem media itself is still working. Alternatively they can request the system administrator to run a check of the filesystem to attempt to clean up the file.

I'm not sure this is a good model of drives. Individual sectors reads can throw EIO with basically the same properties as-described for EINTEGRITY (corrupt reads). The rest of the media may appears to work fine. (And for HDD media, that logical sector may resume "working" after being rewritten, via transparent remapping.)

Either case is a relatively strong signal that the disk is failing or going to fail soon, and if it is surplus to the RAID or mirrorset, should probably be evicted and replaced. (And if it isn't surplus, hope you had backups… and it should still be replaced.)

It may still be interesting to users that their HDD corrupted a sector without detection, because that is usually a worse failure mode than detected corruption (EIO)! But I don't think a sysadmin is going to swoop in and repair the filesystem with a hex editor and tweezers; they're going to replace a failing disk and resilver.

mckusick edited the summary of this revision. (Show Details)
mckusick edited the test plan for this revision. (Show Details)
mckusick added a reviewer: ed.

Based on feedback from Ed Maste and Ed Schouten do not add EINTEGRITY to CloudABI. Rather map it to EINVAL as is done in Linux compatibility.

This revision now requires review to proceed.Jan 10 2019, 12:13 AM

At a minimum EINTEGRITY is useful for mount since it allows the mount command to notify the user that the mount failed because of an integrity error which very likely can be fixed with a run of fsck versus an I/O error which probably cannot.

Lua bits look good

At a minimum EINTEGRITY is useful for mount since it allows the mount command to notify the user that the mount failed because of an integrity error which very likely can be fixed with a run of fsck versus an I/O error which probably cannot.

We should make mount return a different exit value for this perhaps... We could use that to automate a running of fsck in some cases.

EINTEGRITY means the end-to-end notion of what should be there isn't. This is similar to EIO (can't read the sector). The integrity error, though, means that some can't fail sanity check failed, which is a larger class of errors.

This revision is now accepted and ready to land.Jan 10 2019, 4:25 AM
In D18765#401317, @imp wrote:

At a minimum EINTEGRITY is useful for mount since it allows the mount command to notify the user that the mount failed because of an integrity error which very likely can be fixed with a run of fsck versus an I/O error which probably cannot.

We should make mount return a different exit value for this perhaps... We could use that to automate a running of fsck in some cases.

EINTEGRITY means the end-to-end notion of what should be there isn't. This is similar to EIO (can't read the sector). The integrity error, though, means that some can't fail sanity check failed, which is a larger class of errors.

Thank-you. Much more eloquent description of why we need EINTEGRITY.

audit and OpenBSM changes look good to me; I'll make sure they'll be applied upstream as well.

Update the intro(2) manual page to better reflect the meaning as suggested by imp, cem, etal.

All the pieces have now been signed off on with promises to have the libc++ and OpenBSM pieces upstreamed. So once description is agreed on this should be ready to go.

This revision now requires review to proceed.Jan 11 2019, 11:11 PM
This revision is now accepted and ready to land.Jan 16 2019, 5:49 PM
This revision was automatically updated to reflect the committed changes.

Review limited to intro.2.

head/lib/libc/sys/intro.2
477 ↗(On Diff #52915)

Unless you mean the integer values (which doesn't look consistent with the declarations in source files), I would toss "severity" somewhere in there for clarity, perhaps "The severity of" or "falls in severity". (Or if you meant neither, it needs clarifiying too.)

head/sys/sys/errno.h
187 ↗(On Diff #52915)

Since you're touching this. (If too long, maybe "same as" or "Must be largest".)

See suggested clarification.

head/lib/libc/sys/intro.2
477 ↗(On Diff #52915)

Does this change clarify it for you?

The integrity error falls in the kernel I/O stack between

(Any change based on my comments probably should be in a followup commit or review.)

head/lib/libc/sys/intro.2
477 ↗(On Diff #52915)

It does, because it makes it clear that you were speaking of which IO stack components can generate it, not its severity as I thought. I'm not familiar enough with the intro.2 audience to say whether it should also include hints or pointers about its severity or callers' expected handling of it, so I'll leave that decision to you.

Since the change is such a minor edit, I will make it without starting a new Phabricator review.
Thanks for pointing out the need for clarification.