Page MenuHomeFreeBSD

tmpfs: increase memory reserve to a percent of available memory + swap
ClosedPublic

Authored by karels on Dec 12 2023, 10:50 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 16, 6:27 AM
Unknown Object (File)
Mon, Nov 4, 8:11 AM
Unknown Object (File)
Sep 30 2024, 11:37 PM
Unknown Object (File)
Sep 30 2024, 11:37 PM
Unknown Object (File)
Sep 30 2024, 9:25 PM
Unknown Object (File)
Sep 27 2024, 2:22 AM
Unknown Object (File)
Sep 26 2024, 5:10 PM
Unknown Object (File)
Sep 11 2024, 11:00 PM

Details

Summary

The tmpfs memory reserve defaulted to 4 MB, and other than that,
all of available memory + swap could be allocated to tmpfs files.
This was dangerous, as the page daemon attempts to keep some memory
free, using up swap, and then resulting in processes being killed.
Increase the reserve to a fraction of available memory + swap at
file system startup time. The limit is expressed as a percentage
of available memory + swap that can be used, and defaults to 95%.
The percentage can be changed via the vfs.tmpfs.mem_percent sysctl,
recomputing the reserve with the new percentage but the initial
available memory + swap. Note that the reserve can also be set
directly with an existing sysctl, ignoring the percentage. The
previous behavior can be specified by setting vfs.tmpfs.mem_percent
to 100.

PR: 275436
MFC after: 1 month

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Warnings
SeverityLocationCodeMessage
Warningshare/man/man5/tmpfs.5:183SPELL1Possible Spelling Mistake
Unit
No Test Coverage
Build Status
Buildable 54958
Build 51847: arc lint + arc unit

Event Timeline

Please reference the bug number in the commit message, because I believe that the discussion there will be needed when looking for the context of these changes.

sys/fs/tmpfs/tmpfs_subr.c
441

this could be tmpfs_pages_reserved = min(reserved, TMPFS_PAGES_MINRESERVED);

442

As was discussed in the bug, I suggest to add a printf that would announce the new current limit.

445
sys/fs/tmpfs/tmpfs_subr.c
441

Or maybe max(...) :)

442

Should this be on bootverbose?

445

I copied this from a page up; but it seems like sysctl definitions are mixed on this. I'll change both in this file.

This certainly makes things better, but I wonder how it gets along with the other memory pigs, ZFS and bhyve?

This revision is now accepted and ready to land.Dec 13 2023, 11:41 PM

This certainly makes things better, but I wonder how it gets along with the other memory pigs, ZFS and bhyve?

bhyve should act like other user-level memory hogs, although I guess there is some kernel memory too. I can try it though. I have a small ZFS partition on my test machine, I'll try that if I can. I expect that to be similar, in that it reduces free memory (which is what is actually checked for available space), but I'll see if I can get ZFS to chew up some memory on that system. If ZFS allocates much memory before tmpfs_init is called, that will reduce the reserve.

This certainly makes things better, but I wonder how it gets along with the other memory pigs, ZFS and bhyve?

bhyve should act like other user-level memory hogs, although I guess there is some kernel memory too. I can try it though. I have a small ZFS partition on my test machine, I'll try that if I can. I expect that to be similar, in that it reduces free memory (which is what is actually checked for available space), but I'll see if I can get ZFS to chew up some memory on that system. If ZFS allocates much memory before tmpfs_init is called, that will reduce the reserve.

I would not spend cycles testing this. I know for a fact that there are issues here. FreeBSD has a collection of memory pigs, bhyve, zfs and tmpfs, without any co-operative mechanism to say who has priority, or who should have certain amounts of "free at boot time" memory reserved for them. A major contention point for those of us running bhyve on zfs system we almost ALWAYS have to tweak vfs.zfs.arc.max in /boot/loader.conf to keep zfs and bhyve's memory footprints from clashing. I was not aware that tmpfs could also eat "up to 95% of boot time free memory" adding more contention and leading to OOM killing. I personally protect my bhyve vm's by using the wire memory option, so that they either get the memory they need, or they fail to load, since this is not the default situation we see user complaints about there bhyve vm's or other stuff getting killed when memory pressure occurs, most often caused by zfs eating up all the memory.

I spent a little time testing ZFS yesterday out of curiosity. With increasing memory pressure, it looked like it downsized the ARC somewhat. As I added files to tmpfs, the free space didn't go down as much as the content added. But ZFS needs management in most cases; the default seems to assume that the box is a ZFS file server (even for local use), and that ZFS owns most of memory. I also limit the ARC manually. But with any substantial competition for memory, the sysadmin really needs to set limits and/or tune. tmpfs can limit use by setting the size explicitly.

Add sysctls to man page; rename mem_percent sysctl to memory_percent
to be parallel to memory_reserve.

This revision now requires review to proceed.Dec 14 2023, 10:25 PM
sys/fs/tmpfs/tmpfs_subr.c
439

I do not think this printf is useful. I would like to see the default 'size' for tmpfs in case it is not provided by user, which is ultimately what you patch does.

Might be, you should just set size from that reserve on mount without size. IMO it is better UI.

Then in ideal world mount -v could report the size option for tmpfs (instead of the printf), but this is huge endeavor.

share/man/man5/tmpfs.5
183

I'll fix the typo in the next update.

sys/fs/tmpfs/tmpfs_subr.c
439

I added the printf at your request, although I apparently didn't understand what you wanted. But the size is not fixed; the reserve is fixed. The total size and available space float depending on the memory load and swap utilization. For example, when df does statfs, the size is computed dynamically. I could print the "initial size" if you thought that was useful, or I could take this out again.

sys/fs/tmpfs/tmpfs_subr.c
439

On second thought, the initial size doesn't make sense either. This is not about a specific file system, and there may never be a tmpfs mounted at all. This is currently called from tmpfs_init and the sysctl. I think I should just remove the printf.

sys/fs/tmpfs/tmpfs_subr.c
439

Let me explain what I want instead of what code change to do.

I want to see the reason why do I get either ENOSPC or SIGBUS from app when it writes to a tmpfs file/faults on tmpfs mapping. Right now I know that it could only happen if size= option was specified at the mount time. After your change, there is a transient condition that also causes ENOSPC/SIGBUS. I want it to be 1) non-transient so that I can catch the state 2) explicit so that I can check it.

This is why I initially proposed printf (of the total size allowed, to get #2) and then slowly moved in my thought into size-like behavior (to get both #1 and #2). Just printing reservation in pages might be interesting but mostly unusable to correlate it with system behavior, esp. in error state.

sys/fs/tmpfs/tmpfs_subr.c
439

I want to see the reason why do I get either ENOSPC or SIGBUS from app when it writes to a tmpfs file/faults on tmpfs mapping. Right now I know that it could only happen if size= option was specified at the mount time.

No, you can still get ENOSPC when attempting to create a file. And with or without the change, failures are not necessarily due to a transient situation. If the memory exhaustion is due to writing the tmpfs, it probably is not.

I want it to be 1) non-transient so that I can catch the state 2) explicit so that I can check it.

It can't be made non-transient if the memory load is transient. There is no reserved memory or size for tmpfs; there is a current limit based on reserved free memory. For example, consider this case on an unmodified -current system:

mjk-test# mount -t tmpfs tmpfs /mnt
mjk-test# df /mnt
Filesystem Size Used Avail Capacity Mounted on
tmpfs 82G 4.1k 82G 0% /mnt
mjk-test# hacks/use_mem 24g 0.4 &
mjk-test# df /mnt
Filesystem Size Used Avail Capacity Mounted on
tmpfs 56G 4.1k 56G 0% /mnt
mjk-test#

Note that both the size and available space have been decreased as the memory utilization increased. It is observable, but may change over time.

When the reserve is set (where the printf is now), there is no information on whether a tmpfs file system will be created, whether it will have a size of 0, or whether there will be more than one (with or without a size of 0). No memory is reserved, and the limit floats with memory load (according to df and file create at least, or for writes with the change in the parent review). It would be futile to declare memory to be reserved for a file system without pre-allocation, which is not done even when the size is non-zero.

If we tried to reserve memory when a tmpfs file system was mounted, what would we do when a second tmpfs was mounted? What if swap space was removed? What would we do with other memory allocations, as other processes start and grow?

It might be possible to make writes block if space is not available, although that would be inconsistent with behavior on other file systems. For that matter, file creation could block. Personally, I don't think that is a good idea. It seems to me that processes would back up waiting for this, possibly non-interruptably. Ideally blocking would be done before commiting to a write, and it would be painful and maybe racy to back out after a timeout (especially where the current write checks are).

Remove useless printf; fix typo in man page

This revision is now accepted and ready to land.Dec 18 2023, 6:33 PM
sys/fs/tmpfs/tmpfs_subr.c
439

What I proposed could be tweaked into proposal to set size= for tmpfs mount without user-supplied size mount option, based on your reserve knob. Also size= parameter should be made updatable by mount -u if it is currently not (it seems that it is already so). Similarly probably grow a parameter like inonum and also set it automatically.

This is not ideal as well, since e.g. your example of two tmpfs mounts is not gracefully handled, but OTOH it would handle your current concerns without surprising ENOSPC.

I do not insist on my proposal, if you want to go ahead with your current solution, please go ahead. It is just my opinion and so I do not want to agree with the patch in the form of 'Reviewed by'. But I also do not want to block you.

sys/fs/tmpfs/tmpfs_subr.c
439

OK. I think setting a size and limiting only against that would be a regression, as it cannot be honored if there is more than trivial memory competition, and it would (again) be an over-commitment. Note that even with size set, tmpfs will limit file creation.