Page MenuHomeFreeBSD

zfs: Call vn_generic_copy_file_range() if block_cloning disabled
Needs RevisionPublic

Authored by cy on Apr 4 2023, 8:45 PM.
Tags
None
Referenced Files
F102169418: D39419.diff
Fri, Nov 8, 11:29 AM
Unknown Object (File)
Mon, Nov 4, 3:27 AM
Unknown Object (File)
Sep 24 2024, 2:17 AM
Unknown Object (File)
Sep 13 2024, 8:45 PM
Unknown Object (File)
Sep 5 2024, 6:04 AM
Unknown Object (File)
Sep 5 2024, 6:04 AM
Unknown Object (File)
Sep 5 2024, 6:04 AM
Unknown Object (File)
Sep 5 2024, 5:54 AM

Details

Summary

When block_cloning is disabled zfs_clone_range() will return with EXDEV
indicating the operation is unsupported, resulting in regular file
copies from/to the same dataset with an empty target fail and the above
error. Poudriere package repository corruption and random artifacts in
email also result from this regression.

Reported by: cy
Submitted by: rmacklem, mm
Tested by: cy
Obtained from: rmacklem, mm
Fixes: 2a58b312b62f
X-MFC with: 2a58b312b62f

This patch was written by rmacklem and mm.

Test Plan

Tested locally.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

cy requested review of this revision.Apr 4 2023, 8:45 PM
sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
6258

this is avoidably expensive

instead you can 1. vn_start_write 2. validate the target still has the same mount point set 3. ... then you can get dive into zfs data without locking the vnode

6262

this duplicates the other consumer

just add goto target at the end and have both use it

mm requested changes to this revision.Apr 4 2023, 8:57 PM

I still don't see the benefit of the early feature check. Are mac_vnode_check_write() and vn_rlimit_fsize() expensive?
vn_generic_copy_file_range() needs to be called on unlocked vnodes

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
6245โ€“6268

What is the actual benefit doing it this early? What if we can't lock outvp? What if we can't lock invp and therefore unlock outvp in the later code?

IMO what we save with the early check is calling mac_vnode_check_write() and vn_rlimit_fsize() and probably locking invp. In zfs_clone_range the feature check is done after zfs_enter() on both datasets. Is that on purpose?

6320โ€“6324

The comment vn_generic_copy_file_range() says:

/*
 * Copy a byte range of one file to another.  This function can handle the
 * case where invp and outvp are on different file systems.
 * It can also be called by a VOP_COPY_FILE_RANGE() to do the work, if there
 * is no better file system specific way to do it.
 */

That is actually our case. zfs_clone_range() exits with EXDEV if:

  • source and destination are not on the same pool
  • the block_cloning feature is not enabled
  • input and output files have a different block size
  • offset and len are not at block boundaries
  • length is not a multiple of block size, with except for the end of file
  • we are trying to clone a block created in the current transaction group
  • we are cloning encrypted data not in the same dataset

IMO we can fallback to vn_generic_copy_file_range() in all of these cases.

As of the locks, we need to run vn_generic_copy_file_range() on unlocked vnodes,
just look into the function.
In both fuse_vnops.c and nfs_clvnops.c it does not run on locked vnodes.
Even the comment from pjd in zfs_vnops_os.c says:

/*
 * TODO: If offset/length is not aligned to recordsize, use
 * vn_generic_copy_file_range() on this fragment.
 * It would be better to do this after we lock the vnodes, but then we
 * need something else than vn_generic_copy_file_range().
 */

So IMO it should be at the end after unlock.

This revision now requires changes to proceed.Apr 4 2023, 8:57 PM
In D39419#897277, @mm wrote:

I still don't see the benefit of the early feature check. Are mac_vnode_check_write() and vn_rlimit_fsize() expensive?
vn_generic_copy_file_range() needs to be called on unlocked vnodes

vnode locking is expensive and can be avoided with my suggestion

In D39419#897283, @mjg wrote:
In D39419#897277, @mm wrote:

I still don't see the benefit of the early feature check. Are mac_vnode_check_write() and vn_rlimit_fsize() expensive?
vn_generic_copy_file_range() needs to be called on unlocked vnodes

vnode locking is expensive and can be avoided with my suggestion

Then we might think of another approach, what about putting the feature check inside the loop? That way we can make sure that it has always been done on the locked outvp.

Let me say, I'm certainly willing, if it makes more sense, to abandon this revision and let mm or rmacklem commit this.

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
6258

Would you think simply testing for error == EXDEV and calling vn_generic_copy_file_range() would be a better way handle this regression?

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
6258

Ok, I think I understand this. vn_start_write() ref counts
the mount point and so long as "mp == outvp->v_mount"
after the successful call you are saying that the outzfsvfs
setting should be safe to use?

I'll code up that version and email it as an attachment.

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
6258

you have to obtain it from the mount, not throught he vnode

but past that, yes

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
6258

Ok, so how do I get it through the mount point?
(I can only see macros based on the vnode/znode.)

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
6258

->mnt_data

Latest patch from rmacklem@ for testing. His comments:

Attached is an updated patch that I think addresses the
comments in the review.

  • It now uses the "mp" returned by vn_start_write() to acquire the zfsvfs_t pointer, avoiding doing the vnode lock.
  • Uses a "goto do_generic" instead of having the call in two places.
  • Uses "error2" for the return value for zfs_clone_range() so that the check for EXDEV only applies to the return value for zfs_clone_range().

Good luck with testing,etc, rick
ps: I have no idea how to commit a ZFS change, so hopefully

you or mm@ can do it.
mm requested changes to this revision.Apr 5 2023, 7:59 AM

Just simple typo fix, otherwise LGTM.

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c.new
6256 โ†—(On Diff #119862)

Typo fix

This revision now requires changes to proceed.Apr 5 2023, 7:59 AM

We shouldn't make the code more complex and less readable just to optimize for some rare edge cases. If someone disables block cloning, they are most likely not too concerned with performance and trying to avoid an extra vnode lock at the expense of code readability is, in my opinion, the wrong approach.

What should be done instead, IMHO:

  1. Move vnode locking into the vn_copy_file_range() function, so we don't deal with this in file system-specific code, just like with other syscalls.
  2. Reimplement vn_generic_copy_file_range() to assume vnodes are already properly locked.
  3. Fallback to vn_generic_copy_file_range() only within vn_copy_file_range() if VOP_COPY_FILE_RANGE() cannot be used or fails.

PS. How does vn_generic_copy_file_range() avoid a deadlock when it is called with (srcfd, dstfd) and (dstfd, srcfd) simultaneously?
PS2. From what I reading vn_start_write() it can set mp and fail, so vn_finished_write() should only be called if vn_start_write() succeeded and not if mp != NULL. This bug is in both ZFS and vn_generic_copy_file_range().

In D39419#897434, @pjd wrote:

We shouldn't make the code more complex and less readable just to optimize for some rare edge cases. If someone disables block cloning, they are most likely not too concerned with performance and trying to avoid an extra vnode lock at the expense of code readability is, in my opinion, the wrong approach.

There is no real complexity added from this. Also see below.

What should be done instead, IMHO:

  1. Move vnode locking into the vn_copy_file_range() function, so we don't deal with this in file system-specific code, just like with other syscalls.

This would be a major step backwards. All the "prep" work by vfs needs to die, just like sudden disappearance of ->v_data.

  1. Reimplement vn_generic_copy_file_range() to assume vnodes are already properly locked.

So building on the previous point, vn_generic_copy_file_range never has both vnodes locked at the same time. More, with some extra effort it can *avoid* locking the input vnode -- see vop_pgread. Deciding for the func what locks it should hold and whatnot goes directly against it.

  1. Fallback to vn_generic_copy_file_range() only within vn_copy_file_range() if VOP_COPY_FILE_RANGE() cannot be used or fails.

I already commented on this idea here: D38814

PS. How does vn_generic_copy_file_range() avoid a deadlock when it is called with (srcfd, dstfd) and (dstfd, srcfd) simultaneously?

It does not deadlock because it does not lock these 2 vnodes at the same time.

PS2. From what I reading vn_start_write() it can set mp and fail, so vn_finished_write() should only be called if vn_start_write() succeeded and not if mp != NULL. This bug is in both ZFS and vn_generic_copy_file_range().

There is a routine named vn_lock_pair which can be used to secure access to both vnodes deadlock-free and it would get rid of opencoded loop as well. On top of that the vn_start_write/vn_finished_write trip would also disappear, I don't know why it is there in the first place.

All that said, I'm going to code up my variant later today and post it.

In D39419#897434, @pjd wrote:

We shouldn't make the code more complex and less readable just to optimize for some rare edge cases. If someone disables block cloning, they are most likely not too concerned with performance and trying to avoid an extra vnode lock at the expense of code readability is, in my opinion, the wrong approach.

Well I know nothing about ZFS, but while doing this I learned that block cloning is not enabled
by default. I suspect many other ZFS users are like me and would not know that enabling
block cloning was needed for performance (or even how to enable it). I assume there is a
reason it is not always enabled?
--> At this time, I don't think block cloning being disabled is a rare corner case.

What should be done instead, IMHO:

  1. Move vnode locking into the vn_copy_file_range() function, so we don't deal with this in file system-specific code, just like with other syscalls.
  2. Reimplement vn_generic_copy_file_range() to assume vnodes are already properly locked.

One of the goals of this was to retain holes in sparse files when copying. It happens that VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE)
needs to be done with the vnode unlocked. As such, the code below the VOP_COPY_FILE_RANGE() needs to lock/unlock
vnodes as it progresses.

  1. Fallback to vn_generic_copy_file_range() only within vn_copy_file_range() if VOP_COPY_FILE_RANGE() cannot be used or fails.

PS. How does vn_generic_copy_file_range() avoid a deadlock when it is called with (srcfd, dstfd) and (dstfd, srcfd) simultaneously?
PS2. From what I reading vn_start_write() it can set mp and fail, so vn_finished_write() should only be called if vn_start_write() succeeded and not if mp != NULL. This bug is in both ZFS and vn_generic_copy_file_range().

Yes, looking at vn_start_write(), there are cases where mp is returned non-NULL and error != 0.
For those cases vn_finished_write() should not be called.
However, if vn_start_write() returns mp == NULL and error == 0, vn_finished_write() just
returns, so calling it whenever error == 0 is safe but if mp == NULL, unnecessary.
This is similar to the "should you call free() when the pointer is NULL" debate.
Some say "yes", since free() handles that case. I tend to say "no" because it documents
in the code that the pointer can be NULL at that point, but now follow the "yes"
edict, since that seems to be what FreeBSD prefers these days.

For this code, mp is set NULL before the vn_start_write() call and, as such, will
only be returned non-NULL if vn_start_write() returns 0. (I'm pretty sure.)
--> Using either mp != NULL or returned 0 works.
I consider kib@ (and now mjg@) as the "curators" for the VFS, so I am happy
to do whichever they prefer. (kib@ has reviewed some of this code in the past.)

In D39419#897434, @pjd wrote:

PS2. From what I reading vn_start_write() it can set mp and fail, so vn_finished_write() should only be called if vn_start_write() succeeded and not if mp != NULL. This bug is in both ZFS and vn_generic_copy_file_range().

Yes, looking at vn_start_write(), there are cases where mp is returned non-NULL and error != 0.
For those cases vn_finished_write() should not be called.
However, if vn_start_write() returns mp == NULL and error == 0, vn_finished_write() just
returns, so calling it whenever error == 0 is safe but if mp == NULL, unnecessary.
This is similar to the "should you call free() when the pointer is NULL" debate.
Some say "yes", since free() handles that case. I tend to say "no" because it documents
in the code that the pointer can be NULL at that point, but now follow the "yes"
edict, since that seems to be what FreeBSD prefers these days.

For this code, mp is set NULL before the vn_start_write() call and, as such, will
only be returned non-NULL if vn_start_write() returns 0. (I'm pretty sure.)
--> Using either mp != NULL or returned 0 works.

I think we should make this KPI less confusing. Lets ensure that *mpp is NULL always if vn_finished_write() not need to be called. See D39441

A little off topic, but would it make sense to add
"lktype" arguments to vn_lock_pair() so that it can
optionally acquire LK_SHARED vnode locks?

A little off topic, but would it make sense to add
"lktype" arguments to vn_lock_pair() so that it can
optionally acquire LK_SHARED vnode locks?

I did not initially because there are possible complications with LK_SHARED and recursion. What is your scenario where vn_lock_pair() LK_SHARED is useful?

In D39419#897831, @kib wrote:

A little off topic, but would it make sense to add
"lktype" arguments to vn_lock_pair() so that it can
optionally acquire LK_SHARED vnode locks?

I did not initially because there are possible complications with LK_SHARED and recursion. What is your scenario where vn_lock_pair() LK_SHARED is useful?

This one. invp can be LK_SHARED and, maybe outvp as well.
(I know ZFS allows writing with a LK_SHARED vnode. I do not
know if the same is true for zfs_block_clone()?)

since there is active breakage I'm going to commit my stuff from https://github.com/openzfs/zfs/pull/14723 for now.

danfe@ reported a problem via email where rlimit
resulted in an empty file. I think this is due to the
use of SSIZE_MAX for the length and then checking
it with a call to vn_rlimit_fsize().

I think a patch like this (basically cloned from
vn_generic_copy_file-range() is needed):

  • zfs_vnops_os.c.sav 2023-04-10 08:01:05.905906000 -0700

+++ zfs_vnops_os.c 2023-04-10 08:14:40.563743000 -0700
@@ -6242,7 +6242,8 @@ zfs_freebsd_copy_file_range(struct vop_copy_file_range

	struct mount *mp;
	struct uio io;
	int error;
  • uint64_t len = *ap->a_lenp;

+ uint64_t len;
+ ssize_t r = 0;

	/*
	 * TODO: If offset/length is not aligned to recordsize, use

@@ -6280,9 +6281,14 @@ zfs_freebsd_copy_file_range(struct vop_copy_file_range

	io.uio_offset = *ap->a_outoffp;
	io.uio_resid = *ap->a_lenp;
  • error = vn_rlimit_fsize(outvp, &io, ap->a_fsizetd);

+ error = vn_rlimit_fsizex(outvp, &io, 0, &r, ap->a_fsize_td);

	if (error != 0)
		goto out_locked;

+ len = io.uio_resid;
+ /*
+ * No need to call vn_rlimit_fsizex_res before return,
+ * since the uio is local.
+ */

	error = zfs_clone_range(VTOZ(invp), ap->a_inoffp, VTOZ(outvp),
	    ap->a_outoffp, &len, ap->a_outcred);

If you want the diff as plain text, just email me.

danfe@ reported a problem via email where rlimit
resulted in an empty file. I think this is due to the
use of SSIZE_MAX for the length and then checking
it with a call to vn_rlimit_fsize().

I think a patch like this (basically cloned from
vn_generic_copy_file-range() is needed):

  • zfs_vnops_os.c.sav 2023-04-10 08:01:05.905906000 -0700

+++ zfs_vnops_os.c 2023-04-10 08:14:40.563743000 -0700
@@ -6242,7 +6242,8 @@ zfs_freebsd_copy_file_range(struct vop_copy_file_range

	struct mount *mp;
	struct uio io;
	int error;
  • uint64_t len = *ap->a_lenp;

+ uint64_t len;
+ ssize_t r = 0;

	/*
	 * TODO: If offset/length is not aligned to recordsize, use

@@ -6280,9 +6281,14 @@ zfs_freebsd_copy_file_range(struct vop_copy_file_range

	io.uio_offset = *ap->a_outoffp;
	io.uio_resid = *ap->a_lenp;
  • error = vn_rlimit_fsize(outvp, &io, ap->a_fsizetd);

+ error = vn_rlimit_fsizex(outvp, &io, 0, &r, ap->a_fsize_td);

Oops, typo. It should be ap->a_fsizetd and not ap->a_fsize_td.

	if (error != 0)
		goto out_locked;

+ len = io.uio_resid;
+ /*
+ * No need to call vn_rlimit_fsizex_res before return,
+ * since the uio is local.
+ */

	error = zfs_clone_range(VTOZ(invp), ap->a_inoffp, VTOZ(outvp),
	    ap->a_outoffp, &len, ap->a_outcred);

If you want the diff as plain text, just email me.