Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Jun 2019 16:25:29 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Rick Macklem <rmacklem@uoguelph.ca>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>,  Brooks Davis <brooks@freebsd.org>
Subject:   Re: RFC: should the copy_file_range() syscall be atomic?
Message-ID:  <CAOtMX2gAAHVzneytFKKQkHDEzKpe3vVG3w3A%2BgxRvGo-d1ZMVQ@mail.gmail.com>
In-Reply-To: <20190613220815.GB8697@kib.kiev.ua>
References:  <YQXPR01MB3128EA1CB774995DD5C774FCDDEF0@YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM> <20190613220815.GB8697@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 13, 2019 at 4:08 PM Konstantin Belousov <kostikbel@gmail.com> w=
rote:
>
> On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote:
> > When I first wrote the copy_file_range() syscall, the updating of the f=
ile was atomic,
> > because I locked both vnodes while doing it.
> > kib@ quickly pointed out that this could introduce a LOR/deadlock becau=
se two
> > userland threads could concurrently do the syscall with the file argume=
nts reversed.
> >
> > It turns out that locking two VREG vnodes concurrently to do this isn't=
 easy and
> > would require the implementation of non-blocking versions of:
> >   vn_rangelock_rlock() and vn_rangelock_wlock()
> >   - I am not sure how difficult doing this is, but I'll admit I'd rathe=
r not do this.
> I will help you with this.  It should be relatively easy, even if require=
s
> some code re-shuffling.
>
> >
> > Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA=
/
> > FIOSEEKHOLE..) calls to find holes in the byte range of the file being =
copied from.
> If you lock ranges, you still can do ioctl.  But in fact it might be bett=
er
> to extract wrapper for FIOSEEK* into kernel function, and use it in the
> ioctl handler and in the copy syscall.
>
> >
> > Without the vnodes locked, it is possible for other threads to write to=
 either of the
> > files concurrently with the copy_file_range(), resulting in an indeterm=
inate results.
> > (cp(1) has never guaranteed atomic copying of files, so is it needed in=
 this syscall?)
> > In summary, doing the syscall non-atomically has the advantages of:
> > - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.=
t. locking
> >   wouldn't require changes to this syscall's code.
> > - VOP_IOCTL() could be used to find holes.
> > - No new rangelock primitives need to be added to the system.
> > - If there were some combination of file system/storage unit where copy=
ing
> >   non-overlapping byte ranges concurrently could result in better perfo=
rmance,
> >   then that could be done. (An atomic syscall would serialize them.)
> >
> > The advantage of an atomic syscall would be consistency with write(2) a=
nd read(2)
> > behaviour.
> >
> > The following comments are copied from phabricator:
> > kib@ - So you relock range for each chunk ? This defeats the purpose of=
 the range locking. Should copy_file_range() be atomic WRT other reads and =
writes ?
> >
> > asomers@ - That has an unfortunate side effect: copy_file_range is no l=
onger atomic if you drop the vnode locks and range locks in the middle. It =
would be possible for two copy_file_range operations to proceed concurrentl=
y, leaving the file in a state where each of the operations was partially s=
uccessful. A better solution would be to add rangelock_trywlock and rangelo=
ck_tryrlock. I don't think it would be very hard (testing them, however, co=
uld be).
> >
> > I don't see anything in the Linux man page w.r.t. atomicity, so I am no=
w asking what
> > others think?
> > (I'll admit I'm biased towards non-atomic, since I have already coded i=
t and can use
> >  the VOP_IOCTL() calls to find the holes in the input file, but...)
>
> AFAIK, linux does not provide atomicity for file reads and writes at all,
> even for normal reads and writes.  I do not want to read Linux code to
> confirm this.

Really?  I find that hard to believe.  But sure enough, here's the
proof: (they claim it's fixed now)
http://man7.org/linux/man-pages/man2/write.2.html .  As for
copy_file_range, if it's not implemented by the file system, Linux
falls back to an implementation based on splice(2).  I can't find
anything that says whether splice has atomic writes, and in a quick
scan of the code I can't find any sign of locking.  However, glibc
also contains a userland implementation of copy_file_range.  It works
8KB at a time and is most definitely NOT atomic.

-Alan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2gAAHVzneytFKKQkHDEzKpe3vVG3w3A%2BgxRvGo-d1ZMVQ>