do not do Direct writes for IO_APPEND if the file is mmap()'d
ClosedPublic
Actions

Authored by rmacklem on Dec 16 2021, 11:37 PM.

Details

Reviewers

kib
markj
alc

Commits

rGcd37afd8b6dd: vm_object: Make is_object_active() global

Summary

Commit 867c27c23a5c modified the NFS client, so that it
would do IO_APPEND writes as direct writes to the NFS server.

However, this could result in stale data in client pages when
the file is mmap(2)'d.

This patch modifies the code so that it does not do direct writes
when the file is mmap(2)'d, by setting a flag in the NFS vnode to
indicate it is mmap(2)'d.

Test Plan

Run a test program that does 10000 append writes, both
with and without the file mmap(2)'d, while printf()s added
to the kernel indicate whether or not direct writes are done.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rmacklem created this revision.Dec 16 2021, 11:37 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 16 2021, 11:37 PM

rmacklem requested review of this revision.Dec 16 2021, 11:37 PM

So this is basically sets the flag forever. You might consider using is_object_active() which would, although not exact, is still better than forever flag, IMO.

Well, unless VOP_INACTIVE() doesn't get called (I know there are
unusual cases where it does not), "forever" ends at NFS VOP_INACTIVE().
--> Is a call to is_object_active() going to be better (ie. find it inactive sooner)

than that?

In D33520#757595, @rmacklem wrote:

Well, unless VOP_INACTIVE() doesn't get called (I know there are
unusual cases where it does not), "forever" ends at NFS VOP_INACTIVE().
--> Is a call to is_object_active() going to be better (ie. find it inactive sooner)
than that?

If the vnode is still open but a last mapping is gone, is_object_active() becomes false, while inactivate() still not yet done. This should be relevant for the issue at hands, because append write means that the file is open.

Make is_object_active() a global function called vm_is_object_active()
and use it as an additional test for "able to do Append direct writes".

Does the object lock need to be held when vm_is_object_active() is
called?

I updated my test program so it fork()'d a child. The parent mmap(2)'d
and then slept for 15seconds while the child did 10000 IO_APPEND writes.
(The 10000 IO_APPEND writes take 12sec when done direct and about 130seconds
when not done direct. This test ran in 24seconds and a printf showed it switched
to direct writes after 15sec.)

Oh, and I think the change of is_object_active() to a global function
should be a separate commit.

In D33520#757599, @rmacklem wrote:

Make is_object_active() a global function called vm_is_object_active()
and use it as an additional test for "able to do Append direct writes".

Does the object lock need to be held when vm_is_object_active() is
called?

Generally yes to get consistent snap of ref_count vs. shadow_count. But you cannot get a ref_count for mmap without also taking a vnode lock, and since for write vnode should be exclusively locked, I think it is good enough to not take the object lock there.

sys/fs/nfsclient/nfs_clvnops.c
157 ↗	(On Diff #100189)	I think these bits are no longer needed.
sys/vm/vm_extern.h
134 ↗	(On Diff #100189)	Name it vm_object_is_active() and put it into vm/vm_object.h + vm/vm_object.c.

kib added reviewers: markj, alc.Dec 17 2021, 3:25 AM

Renamed the function vm_object_is_active() and moved
it to vm_object.c/vm_object.h as requested by kib2.

Also, tested without NMMAPPED and it appears to work
ok, so I removed that part of the patch, as suggested by kib@.

I intend to commit the nfs_clbio.c change as a separate patch.

kib accepted this revision.Dec 17 2021, 7:59 PM

This revision is now accepted and ready to land.Dec 17 2021, 7:59 PM

markj added inline comments.Dec 17 2021, 8:09 PM

sys/fs/nfsclient/nfs_clbio.c
1015 ↗	(On Diff #100216)	Some time ago the code managing shadow chains was refactored and changed so that only OBJ_ANON objects maintain a list of shadowing objects (and shadow_count). Doesn't this mean that vp->v_object->shadow_count is always 0, i.e., the use of this function disables the optimization?

kib added inline comments.Dec 17 2021, 8:36 PM

sys/fs/nfsclient/nfs_clbio.c
1015 ↗	(On Diff #100216)	Oh, indeed, I forgot about OBJ_ANON. But this breaks vm_meter at least. I think shadow_count maintenance should be restored, perhaps in lock-less manner for objects that do not need to maintain shadow_list. I will do that.

markj added inline comments.Dec 17 2021, 9:00 PM

sys/fs/nfsclient/nfs_clbio.c
1015 ↗	(On Diff #100216)	In fact I do not understand why it is sufficient to check ref_count == shadow_count. Consider a file with two copy-on-write mappings, where each mapping has been written to at least once. Then the vnode's VM object has ref_count == shadow_count == 2 (assuming shadow count is maintained). Suppose a process updates the last page in the mappings with a write(). I would expect that write to be visible via either of the mappings, assuming that that page had not already been copied.

kib added inline comments.Dec 17 2021, 9:05 PM

sys/fs/nfsclient/nfs_clbio.c
1015 ↗	(On Diff #100216)	I do not believe this case was ever specified to work only that way. For instance, there might be one more object in the chain, and it might have the written page shadowed. I remember Alan mentioned that real Mach VM contained something that ensured that shadowed mappings only see the state of the backing objects at the moment of shadowing, but in FreeBSD it was removed.

markj added inline comments.Dec 17 2021, 10:16 PM

sys/fs/nfsclient/nfs_clbio.c
1015 ↗	(On Diff #100216)	(In my example there is no need for two CoW mappings; I was thinking of object collapse that but that's not relevant here.) IMO the behaviour in this case should be determined by the VM system, not by individual filesystems, and other filesystems will make the write visible to the shadowing mappings, I believe. I wonder if NFS could instead use the same mechanism that ZFS does to maintain DMU<->ARC consistency, i.e. update_pages(), which locklessly looks up a page at each index in the written range to see if it needs to be updated. OTOH, ZFS uses an internal rangelock to synchronize VOP_WRITE with VOP_GETPAGES, whereas I cannot see anything equivalent in the NFS client. With the VM object lock held we could simply look for resident pages in the written range and write directly to the server if none are found, no?

kib added inline comments.Dec 17 2021, 11:22 PM

sys/fs/nfsclient/nfs_clbio.c
1015 ↗	(On Diff #100216)	Of course there is no such mechanism for NFS, because NFS normally uses buffer cache, and relies on the normal coherence of buffer cache and page cache. It is this specific hack/optimization for O_APPEND, created as a reaction to a deficiency in the NFS wire protocol, that breaks the coherency. I believe that looking for the pages would be not that simple, because pages can be partially valid for NFS (another optimization to avoid extending transfers over wire to fill whole pages). And then all that complications are rather pointless because next operation after our write would detect changed mtime and invalidate the cache anyway.