Page MenuHomeFreeBSD

Support 64KB cluster size in MSDOSFS
AbandonedPublic

Authored by se on Mar 12 2024, 5:10 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Dec 29, 5:02 AM
Unknown Object (File)
Tue, Dec 10, 2:01 PM
Unknown Object (File)
Tue, Dec 10, 2:01 PM
Unknown Object (File)
Mon, Dec 9, 1:30 PM
Unknown Object (File)
Thu, Dec 5, 11:37 AM
Unknown Object (File)
Dec 2 2024, 5:58 AM
Unknown Object (File)
Nov 30 2024, 9:15 AM
Unknown Object (File)
Nov 18 2024, 9:49 AM
Subscribers

Details

Reviewers
mckusick
kib
imp
Summary

Depending on the file offset, one more memory page may be allocated in the buffer cache than required to hold the size passed in a read request. The reason is that the buffer cache maps sectors to memory pages starting at logical sector 0, but there are file systems that start data blocks at file system offsets that are not a multiple of the page size.
One such file system is MSDOSFS, which consists of a meta data area consisting of any required number of sectors, followed by a data area consisting of clusters of sectors. This data area may start at a sector number that is not a multiple of the number of sectors per cluster. In this case, a request to read a cluster will put the first requested sector of a cluster into the buffer cache memory page at an offset > 0, and in case of 64KB clusters may need to allocate one more memory page than required for 64KB.
This lead to kernel panics when built with INVARIANTS (or internal data corruption without), see PR #277414 which was addressed in commit 7e4ac11b6076e6a9b by returning an error code instead.
This review remaps sectors read from a MSDOSFS to align the start of a cluster with offset 0 of a buffer cache memory page.
A cluster size of 128 will thus fit into 64KB worth of memory pages in the buffer cache.

The patch introduces a new macro (bntocbn) that adds a file system specific offset to the physical block number to obtain a logical block number for indexing of the block in the buffer cache. This offset is calculated when the file system is mounted, based on the first data block of the file system and the PAGE_SIZE.
Besides preventing a crash (or read error after commit 7e4ac11b6076e6a9b), this reduces the amount of pages allocated per cluster by 1 in the general case.

Test Plan

Apply the patch and perform regression tests of the mounted MSDOSFS.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

se requested review of this revision.Mar 12 2024, 5:10 PM
se created this revision.

Update diff to include the new macros breadb() and breadbn() added to sys/buf.h to allow passing of separate physical and buffer cache block number parameters.

This looks somewhat strange to me. Most of the weirdness is in that we have lblkno != blkno for devvp buffers, which I believe we did not have anywhere before.

Then, should the offset for cluster sectors applied to the dblkno argument of breadn_flags() etc, instead of (l)blkno?
And, what would happen to the buffer containing the first data cluster? Wouldn't there be aliasing with the last buffer of the FAT metadata?

In D44319#1011511, @kib wrote:

This looks somewhat strange to me. Most of the weirdness is in that we have lblkno != blkno for devvp buffers, which I believe we did not have anywhere before.

Yes, and I do not know the buffer cache code in sufficient detail to know whether this violates some assumption.
The purpose of this review is to get feedback on whether this simple approach works or whether it leads to inconsistencies in the buffer cache.

If we want to align the buffers for FAT data clusters with VM page boundaries, such a decoupling is required, since the area preceding the data clusters is sized in sectors, not clusters.

Initially, I had considered a different concept, but it would also have lead to an offset between logical and physical sector numbers.
I have not looked at the case of a FAT file system within a partition or slice and the physical sector addresses used in that case, yet.
This may affect the calculation of the offset. I'm going to look at the code and to perform tests.

My understanding is that the logical block number is used as kind of an index into the buffer cache, and the physical sector is used to read the data in, but also in the returned struct buf to allow a dirty buffer to be written out to the correct sectors.

Then, should the offset for cluster sectors applied to the dblkno argument of breadn_flags() etc, instead of (l)blkno?

That was my first concept: add an offset to the sector number obtained in the cluster to sector macro, but subtract that offset again when accessing disk sectors.
This would have required changes to pcbmap() in addition to the introduction of the breadn() macros (to subtract the offset from the physical sector instead of adding it to the logical sector).

My testing of the patches in this review did not show any issues, but I'm not sure whether blocks that are only written to will have buffers without the logical sector offset (possibly leading to aliasing). I plan to write more edge cases and to add traces to the buffer cache page allocation to check for inconsistent parameters.

And, what would happen to the buffer containing the first data cluster? Wouldn't there be aliasing with the last buffer of the FAT metadata?

The first sector is initially read without an offset, since it contains the parameters required to calculate the offset. But the corresponding buffer is immediately released, and subsequent accesses to sector 0 will use the calculated offset. All read accesses use the same offset of 0 to 7 sectors (for the typical case of 512 byte sectors and 4 KB VM pages).

The last FAT sector will precede the first data cluster (unless there are more reserved sectors following the FAT) and will thus be at the end of a buffer cache page, with the first data cluster staring at offset 0 in the next higher buffer cache page.

In general b_lblkno is for the use of the client of the buffer cache to use for whatever purpose they want. The b_blkno is what the underlying hardware is going to use to locate the requested data. The filesystem generally uses its BMAP routine to convert the b_lblkno to the appropriate b_blkno.

As Kostik points out, when referencing filesystem metadata, b_lblkno and b_blkno are usually equal and there is never any translation. But I do not think that there is any intrinsic reason why they have to be the same. So if the filesystem wants to make some alternate use of b_lblkno when working with metadata, I believe that the buffer cache will still work properly. You do need to ensure that b_blkno is always correct any time that the buffer is going to be passed to an underlying device. And you also have to sure that you do not end up getting aliasing issues as already noted.

I suggest you to ask Peter Holm (pho@) to run the stress2 msdosfs tests over this patch.

This revision is now accepted and ready to land.Mar 17 2024, 8:57 PM

D44319.id135654.diff doesn't seem to work for me:

root@mercat1:/usr/src/tools/test/stress2/misc # ./all.sh -onc msdos11.sh
20240318 13:16:17 all: msdos11.sh
** /dev/md10a
** Phase 1 - Read FAT and checking connectivity
** Phase 2 - Checking Directories
** Phase 3 - Checking for Lost Files
Lost cluster chain at cluster 4
1 Cluster(s) lost
Reconnect? yes
No LOST.DIR directory
Clear? yes
Lost cluster chain at cluster 5
1 Cluster(s) lost
Reconnect? yes
No LOST.DIR directory
Clear? yes
:
:
Lost cluster chain at cluster 14724
1 Cluster(s) lost
Reconnect? yes
No LOST.DIR directory
Clear? yes
Lost cluster chain at cluster 14725
1 Cluster(s) lost
Reconnect? yes
No LOST.DIR directory
Clear? yes
Update FATs? yes
2 files, 4.0 GiB free (4161773 clusters)

***** FILE SYSTEM WAS MODIFIED *****
total 0
-rwxr-xr-x  1 root wheel 0 Dec 30  1979 b
FAIL msdos11.sh exit code 1

Summary: Tests which failed within the last two days.
20240318 11:01:27 PHO main-n268815-e48770de6831-dirty mountro4.sh (new)
20240318 11:13:19 PHO main-n268815-e48770de6831-dirty msdos10.sh (new)
20240318 11:15:46 PHO main-n268815-e48770de6831-dirty msdos12.sh (new)
20240318 11:16:03 PHO main-n268815-e48770de6831-dirty msdos13.sh (new)
20240318 13:18:26 PHO main-n268815-e48770de6831-dirty msdos11.sh
20240318 13:18:27 all.sh done, elapsed 0 day(s), 00:02.10
root@mercat1:/usr/src/tools/test/stress2/misc # uname -a
FreeBSD mercat1.netperf.freebsd.org 15.0-CURRENT FreeBSD 15.0-CURRENT #0 main-n268815-e48770de6831-dirty: Mon Mar 18 10:42:34 CET 2024     pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO amd64
root@mercat1:/usr/src/tools/test/stress2/misc # ls -l `which fsck_msdosfs`
-r-xr-xr-x  1 root wheel 39752 Mar 18 12:54 /sbin/fsck_msdosfs
root@mercat1:/usr/src/tools/test/stress2/misc #
In D44319#1012663, @pho wrote:

D44319.id135654.diff doesn't seem to work for me:

root@mercat1:/usr/src/tools/test/stress2/misc # ./all.sh -onc msdos11.sh
20240318 13:16:17 all: msdos11.sh
** /dev/md10a
** Phase 1 - Read FAT and checking connectivity
** Phase 2 - Checking Directories
** Phase 3 - Checking for Lost Files
Lost cluster chain at cluster 4
1 Cluster(s) lost
Reconnect? yes
No LOST.DIR directory
Clear? yes
Lost cluster chain at cluster 5
1 Cluster(s) lost
Reconnect? yes
No LOST.DIR directory
Clear? yes

Thanks a lot! I had looked for a regression test tool that would report issues with the patches, but did not know about these stress tests.
I had tested copying the kernel sourced to a FAT file-system with "rsync -r" and had verified that "rsync -r -c" did not report any issues.

The cause of the test failure was that getblk() calls needed the same offset between logical and physical sector numbers, and I have completed a patch set that lets the stress2/misc/msdos*.sh tests succeed, except for msdos18.sh (which seems to trigger a code path that has not been covered, yet).

I'm not uploading the fixed patches, since I first want to get msdos18.sh working. I plan to add a number of assertions to verify that the difference with logical and physical sectors is observed in all places.

But it may be impossible to address all buffer operations, since cluster reads and writes seem to only support blkno == lbklno.
I have not yet checked whether it is possible to make cluster operations support the sector offset without a rewrite of these functions.

Instead of adding the offset to all operations, a different approach might be to adjust clusters to natural boundaries by addin reserved sectors to FAT file systems created with newfs_msdos and to reject FAT file systems that would require buffers larger than maxbcachebuf (due to unaligned 64 KB clusters).

Document latest version of this attempt to align FAT sectors with VM Pages.
This version passes all MSDOSFS stress tests, except for one (stress2/misc/msdos10.sh) that uses mmap() calls when running "fsx".

This review will be abandonded, since there are 2 operations that receive only logical sector numbers:

  • cluster_read / cluster_write
  • mmap

While cluster operations can be converted into normal read or write operations, this is not true for mmap().

Since mmap() mandates that physical and logical sector numbers are identical and since it is used in many user-land utilities, sometimes without fall-back to read or write operations, it might be possible to remove the notion of physical sector numbers from all file system and buffer cache code.

This revision now requires review to proceed.Jun 1 2024, 12:18 PM

As stated before, the lack of support of distinct physical and logical sector numbers in mmap() would require to drop mmap support from msdosfs, if this approach was to be committed.

Instead, 2 new revisions will be created:

  1. Make newfs_msdos perform alignment by default, to have cluster boundaries match VM page boundaries.
  2. A planned GEOM module (gfat) that intercepts unaligned FAT file system operations and that adjusts sector numbers seen by the kernel to be aligned with VM pages.

The design of the GEOM module has not been finished, and there may be issues with file systems created with inconsistent parameters (e.g. FAT12/16 root directories that are not filling a whole logical sector). An attempt will be made to support as many such cases as possible and to transparently map cluster boundaries seen by the buffer cache to VM pages.