When operating in SU or SU+J mode, ffs_syncvnode() might need to instantiate other vnode by inode number while owning syncing vnode lock. Typically this other vnode is the parent of our vnode, but due to renames occuring right before fsync (or during fsync when we drop the syncing vnode lock, see below) it might be no longer parent.
More, the called function flush_pagedep_deps() needs to lock other vnode while owning the lock for vnode which owns the buffer, for which the dependencies are flushed. This creates another instance of the same LoR as was fixed in softdep_sync().
Put the generic code for safe relocking into new SU helper get_parent_vp() and use it in flush_pagedep_deps(). The case for safe relocking of two vnodes with undefined lock order was extracted into vn helper vn_lock_pair().
Due to call sequence ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(), ffs_syncvnode() indicates with ERELOOKUP that passed vnode was unlocked in process, and can return ENOENT if the passed vnode reclaimed. All callers of the function were inspected.
Because UFS namei lookups store auxiliary information about directory entry in in-memory directory inode, and this information is then used by UFS code that creates/removed directory entry in the actual mutating VOPs, it is critical that directory vnode lock is not dropped between lookup and VOP.
For softdep_prelink(), which ensures that later link/unlink operation can proceed without overflowing the journal, calls were moved to the place where it is safe to drop processing VOP because mutations are not yet applied. Then, ERELOOKUP causes restart of the whole VFS operation (typically VFS syscall) at top level, including the re-lookup of the involved pathes.
[Note that we already do the same restart for failing calls to vn_start_write(), so formally this patch does not introduce new behavior]
Similarly, unsafe calls to fsync in snapshot creation code were plugged. A possible view on these failures is that it does not make sense to continue creating snapshot if the snapshot vnode was reclaimed due to forced unmount.
Patch adds a framework that for DIAGNOSTICS builds tracks exclusive vnode lock generation count. This count is memoized together with the lookup metadata in directory inode, and we assert that accesses to lookup metadata are done under the same lock generation as they were stored.
In collaboration with: pho
Reported by: syzkaller (through markj)
This diff contains the following technically independent parts:
- vn_lock_pair()
- ERELOOKUP handling at top level of VFS.
- Move of softdep_prelink() to places in the UFS VOPs flow where it is safe to abort VOP execution still.
- Code for safe instantiation of vnodes by inode number, while owning other vnodes and buffers locks, mostly introduction and use of get_parent_vp().
- DIAGNOSTICS framework to track and check UFS vnode exclusive lock generation and corresponding lookup auxiliary data.
- Some local fixes for VOP_FSYNC()/ffs_syncvnode calls where they cannot be safely done, mostly because we own more locks than ffs_syncvnode() knows about.