Page MenuHomeFreeBSD

Improvement in UFS/FFS directory placement when doing mkdir(2)
ClosedPublic

Authored by mckusick on Mar 24 2023, 6:10 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 8, 1:01 AM
Unknown Object (File)
Mon, Oct 28, 9:32 AM
Unknown Object (File)
Sun, Oct 20, 3:46 AM
Unknown Object (File)
Oct 4 2024, 8:27 PM
Unknown Object (File)
Oct 4 2024, 12:53 PM
Unknown Object (File)
Oct 1 2024, 6:54 PM
Unknown Object (File)
Sep 28 2024, 1:29 PM
Unknown Object (File)
Sep 27 2024, 9:11 PM
Subscribers

Details

Summary

The algorithm for laying out new directories was devised in the 1980s and markedly improved the performance of the filesystem. In those days large disks had at most 100 cylinder groups and often as few as 10-20. Modern multi-terrabyte disks have thousands of cylinder groups. The original algorithm does not handle these large sizes well. This change attempts to expand the scope of the original algorithm to work well with these much larger disks while still retaining the properties of the original algorithm for small disks.

The filesystem implementation is divided into policy routines and implementation routines. The policy routines can be changed in any way desired without risk of corrupting the filesystem. The policy requests are handled by the implementation layer. If the policy asks for an available resource, it is granted. But if it asks for an already in-use resource, then the implementation will provide an available one nearby the request. Thus it is impossible for a policy to double allocate. This change is limited to the policy implementation.

This change updates the ffs_dirpref() routine which is responsible for selecting the cylinder group in which a new directory should be placed. If we are near the root of the filesystem we aim to spread them out as much as possible. As we descend deeper from the root we cluster them closer together around their parent as we expect them to be more closely interactive. Higher-level directories like usr/src/sys and usr/src/bin should be separated while the directories in these areas are more likely to be accessed together so should be closer. And directories within commands or kernel subsystems should be closer still.

We pick a range of cylinder groups around the cylinder group of the directory in which we are being created. The size of the range for our search is based on our depth from the root of our filesystem. We then probe that range based on how many directories are already present. The first new directory is at 1/2 (middle) of the range; the second is in the first 1/4 of the range, then at 3/4, 1/8, 3/8, 5/8, 7/8, 1/16, 3/16, 5/16, etc.

It is necessary to store the depth of a directory in its on-disk inode. We add a new field di_dirdepth to track the depth of each directory. Because there are few spare fields left in the inode, we choose to share an existing field in the inode rather than having one of our own. Specifically we create a union with the di_freelink field. The di_freelink field is used to track inodes that have been unlinked but remain referenced. It is not needed until a rmdir(2) operation has been done on a directory. At that point, the directory has no contents and even if it is kept active as a current directory is no longer able to have any new directories or files created in it. Thus the use of di_dirdepth and di_freelink will never coincide.

Test Plan

Have Peter Holm run his filesystem and fsck_ffs test suite.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

If I mount a filesystem without running fsck on it at all, then the things are undefined? For instance, I often boot my crash boxes single-user and directly mount some partition r/w.

Also, couldn't we calculate the dir depth on lookup, instead of storing it to disk? I do see a complication with nfs server which might avoid lookups, but lookups seems to be a good and almost free opportunity to re-check integrity of the new field.

sys/ufs/ufs/ufs_vnops.c
2128
In D39246#893800, @kib wrote:

If I mount a filesystem without running fsck on it at all, then the things are undefined? For instance, I often boot my crash boxes single-user and directly mount some partition r/w

The field is initialized to zero when the directory is created. When a depth of zero is found in any directory other than UFS_ROOTINO the old directory layout algorithm is used. So unchecked filesystems will just get the old layout behavior (see lines 1292-1297 in ffs_alloc.c above).

Also, couldn't we calculate the dir depth on lookup, instead of storing it to disk? I do see a complication with nfs server which might avoid lookups, but lookups seems to be a good and almost free opportunity to re-check integrity of the new field.

There is no extra cost to storing the depth in space or time (it is always modified when we otherwise need to write the inode). Checking the depth all the way back to the root is a non-trivial cost for deep directories and is a bunch of additional code to maintain.

In D39246#893800, @kib wrote:

Also, couldn't we calculate the dir depth on lookup, instead of storing it to disk? I do see a complication with nfs server which might avoid lookups, but lookups seems to be a good and almost free opportunity to re-check integrity of the new field.

There is no extra cost to storing the depth in space or time (it is always modified when we otherwise need to write the inode). Checking the depth all the way back to the root is a non-trivial cost for deep directories and is a bunch of additional code to maintain.

I do not mean to check it all way back to the root. I mean, if ufs_lookup(dvp) found vp which is directory, then vp.dirdepth must be dvp.dirdepth + 1 or both must be zero.

I do not mean to check it all way back to the root. I mean, if ufs_lookup(dvp) found vp which is directory, then vp.dirdepth must be dvp.dirdepth + 1 or both must be zero.

Are you proposing that dirdepth would be kept in a field in just the incore part of the inode? That makes the inode structure bigger and introduces extra work on every inode load to check and see if it is a directory and if so to calculate its depth. With my current scheme there is no extra space required or time spent on loading inodes. The only extra work is when a directory is created. The depth is just there and can be used as needed. It is not critical that it be correct though of course the layout will be better if it is correct.

I do not mean to check it all way back to the root. I mean, if ufs_lookup(dvp) found vp which is directory, then vp.dirdepth must be dvp.dirdepth + 1 or both must be zero.

Are you proposing that dirdepth would be kept in a field in just the incore part of the inode? That makes the inode structure bigger and introduces extra work on every inode load to check and see if it is a directory and if so to calculate its depth. With my current scheme there is no extra space required or time spent on loading inodes. The only extra work is when a directory is created. The depth is just there and can be used as needed. It is not critical that it be correct though of course the layout will be better if it is correct.

IMO not adding extra deductible data to the on-disk format is good, or if we do add such data, then we should check it dynamically. I think you would go with the dinode field re-use, but then please check that the condition that I formulated, and do the usual action when the broken metadata detected.

IMO not adding extra deductible data to the on-disk format is good, or if we do add such data, then we should check it dynamically. I think you would go with the dinode field re-use, but then please check that the condition that I formulated, and do the usual action when the broken metadata detected.

The obvious place to put the check is in ffs_vgetf() which knows when it has just loaded a new inode. But we do not know the parent directory in ffs_vgetf() and indeed even our caller may not know it as for example in ffs_fhtovp(). I note that ffs_vget() is called from 28 places so it would be a lot of code to check it in all those places (or at least in those that can make the check). I could just add the check in ufs_lookup() although there are four calls to ffs_vget() in that function of which two or three might be sensible to check. And in ufs_lookup() we don't know if ffs_vget() read a new inode or found an existing one in the cache, so we will end up doing a lot more checking than necessary. It just does not seem worth increasing the length of a hot path in the filesystem to marginally improve hint data that is only used in the infrequently called mkdir(2).

The obvious place to put the check is in ffs_vgetf() which knows when it has just loaded a new inode. But we do not know the parent directory in ffs_vgetf() and indeed even our caller may not know it as for example in ffs_fhtovp(). I note that ffs_vget() is called from 28 places so it would be a lot of code to check it in all those places (or at least in those that can make the check). I could just add the check in ufs_lookup() although there are four calls to ffs_vget() in that function of which two or three might be sensible to check. And in ufs_lookup() we don't know if ffs_vget() read a new inode or found an existing one in the cache, so we will end up doing a lot more checking than necessary. It just does not seem worth increasing the length of a hot path in the filesystem to marginally improve hint data that is only used in the infrequently called mkdir(2).

I am only proposing to add the checks where it is feasible to do so, mainly in the lookup, as I mentioned from the beginning. There could be more uses for the directory depth, which is the reason to want the better assurance of the depth data correctness than just a probability. It is useful for similar consistency checks and miscellaneous optimizations. Obvious candidates would be ufs_checkpath(), which can be terminated if we reached lower level than the target.

But ok, lets postpone this till the actual work comes using this feature.

sys/ufs/ffs/ffs_softdep.c
12517

Why was this code block moved? We no longer update i_freelink unless there is inodedep structure for the inode.

Responding to comment.

sys/ufs/ffs/ffs_softdep.c
12517

The previous location was always setting the di_freelink every time that the inode block got updated thus prematurely wiping out di_dirdepth. By moving it down we can only do the update when the file is being unlinked and thus the di_dirdepth is no longer needed. Since unlinking a file will always create an inodedep we are assured that di_freelink will be set when it is needed.

This revision is now accepted and ready to land.Mar 26 2023, 11:28 PM

The concern I have with this change is about user interface. With this patch, if someone renames a directory that is the root of a subtree containing other directories and then unmounts the file system cleanly, fsck will report that the file system is now corrupted because all of the dirdepth values for directories in that subtree are wrong. In fact, every existing file system will immediately be reported as corrupted the first time that the new fsck is run. There are millions of people in the world who know what the fsck output for a clean and uncorrupted UFS file system looks like, and that clean fsck output is very easy to recognize because it's very short and always the same (modulo the line with the last-mounted-on path and the line at the end with the block/inode usage and fragmentation summary). After this patch, fsck can produce millions of lines of output for file systems that aren't really corrupted but rather are merely not optimal, and it will no longer be so simple to tell if a file system is corrupted or not by looking at the fsck output.

I think that fsck should not report incorrect dirdepth values at all, because an incorrect dirdepth value is not considered corruption. It would still be desirable for fsck to set the dirdepth fields to their optimal values, but producing potentially a huge number of lines of output while doing so would obscure the fact that the file system may otherwise be completely consistent, so it would be best if these updates to dirdepth values were done silently. I realize that this changes a different long-standing tradition that fsck reports (and asks about) every individual change that it makes, but prior to this patch, everything that fsck would change represented either a kernel bug or a hardware failure of some kind (ie. a serious problem), whereas a dirdepth value being wrong is a normal situation that the kernel creates intentionally and is thus nothing to worry about.

In D39246#894573, @chs wrote:

The concern I have with this change is about user interface. With this patch, if someone renames a directory that is the root of a subtree containing other directories and then unmounts the file system cleanly, fsck will report that the file system is now corrupted because all of the dirdepth values for directories in that subtree are wrong. In fact, every existing file system will immediately be reported as corrupted the first time that the new fsck is run. There are millions of people in the world who know what the fsck output for a clean and uncorrupted UFS file system looks like, and that clean fsck output is very easy to recognize because it's very short and always the same (modulo the line with the last-mounted-on path and the line at the end with the block/inode usage and fragmentation summary). After this patch, fsck can produce millions of lines of output for file systems that aren't really corrupted but rather are merely not optimal, and it will no longer be so simple to tell if a file system is corrupted or not by looking at the fsck output.

I think that fsck should not report incorrect dirdepth values at all, because an incorrect dirdepth value is not considered corruption. It would still be desirable for fsck to set the dirdepth fields to their optimal values, but producing potentially a huge number of lines of output while doing so would obscure the fact that the file system may otherwise be completely consistent, so it would be best if these updates to dirdepth values were done silently. I realize that this changes a different long-standing tradition that fsck reports (and asks about) every individual change that it makes, but prior to this patch, everything that fsck would change represented either a kernel bug or a hardware failure of some kind (ie. a serious problem), whereas a dirdepth value being wrong is a normal situation that the kernel creates intentionally and is thus nothing to worry about.

You have a very valid point. I addressed this by having just a single message come out when running with -p (preen mode). When encountering a filesystem with no depths it prints `UPDATING FILESYSTEM TO TRACK DIRECTORY DEPTH'' then updates all the directory depths. When running in interactive mode and -y mode it asks once `UPDATE FILESYSTEM TO TRACK DIRECTORY DEPTH?'' then just does it. No further reporting happens in either case.

You are right about a lot of output when a sub-tree gets moved to another depth. My take would be to just silently fix it when running in preen mode but to keep the extended reporting when running in interactive and -y mode. Does that seem like the right solution to you?

You have a very valid point. I addressed this by having just a single message come out when running with -p (preen mode). When encountering a filesystem with no depths it prints `UPDATING FILESYSTEM TO TRACK DIRECTORY DEPTH'' then updates all the directory depths. When running in interactive mode and -y mode it asks once `UPDATE FILESYSTEM TO TRACK DIRECTORY DEPTH?'' then just does it. No further reporting happens in either case.

Oh I see, your diff already does that for the case of dirdepths not initialized, I read right over that part.

You are right about a lot of output when a sub-tree gets moved to another depth. My take would be to just silently fix it when running in preen mode but to keep the extended reporting when running in interactive and -y mode. Does that seem like the right solution to you?

The other case to consider is "fsck -n", and for that mode I don't think we should report all the individual dirdepths as being wrong in that case, since again that would obscure any real corruption or lack thereof. It would be weird for "fsck -y" to report fixing all the individual dirdepths if "fsck -n" didn't report them all individually... it seems like that would lead to questions about whether fsck is really working correctly, so I think "fsck -y" should also not report dirdepth updates individually. Interactive mode would similarly want to parallel the behavior of -y and -n (though I really doubt that anyone except possibly you ever uses interactive mode intentionally). So if you really want to print something in all these non-preen-mode cases where some dirdepths are initialized but not the desired value then you could print a single message for all of them like you do for the dirdepth == 0 case. But I don't think it would ever be useful to print a separate message for every instance of a dirdepth needing to be updated. I suppose you could add a "-v" verbose flag if you want to give people a way to see all the noise, or maybe add it to the output of the existing "-d" debug flag.