Date: Thu, 13 Jun 2019 16:25:29 -0600 From: Alan Somers <asomers@freebsd.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Rick Macklem <rmacklem@uoguelph.ca>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>, Brooks Davis <brooks@freebsd.org> Subject: Re: RFC: should the copy_file_range() syscall be atomic? Message-ID: <CAOtMX2gAAHVzneytFKKQkHDEzKpe3vVG3w3A%2BgxRvGo-d1ZMVQ@mail.gmail.com> In-Reply-To: <20190613220815.GB8697@kib.kiev.ua> References: <YQXPR01MB3128EA1CB774995DD5C774FCDDEF0@YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM> <20190613220815.GB8697@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 13, 2019 at 4:08 PM Konstantin Belousov <kostikbel@gmail.com> w= rote: > > On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote: > > When I first wrote the copy_file_range() syscall, the updating of the f= ile was atomic, > > because I locked both vnodes while doing it. > > kib@ quickly pointed out that this could introduce a LOR/deadlock becau= se two > > userland threads could concurrently do the syscall with the file argume= nts reversed. > > > > It turns out that locking two VREG vnodes concurrently to do this isn't= easy and > > would require the implementation of non-blocking versions of: > > vn_rangelock_rlock() and vn_rangelock_wlock() > > - I am not sure how difficult doing this is, but I'll admit I'd rathe= r not do this. > I will help you with this. It should be relatively easy, even if require= s > some code re-shuffling. > > > > > Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA= / > > FIOSEEKHOLE..) calls to find holes in the byte range of the file being = copied from. > If you lock ranges, you still can do ioctl. But in fact it might be bett= er > to extract wrapper for FIOSEEK* into kernel function, and use it in the > ioctl handler and in the copy syscall. > > > > > Without the vnodes locked, it is possible for other threads to write to= either of the > > files concurrently with the copy_file_range(), resulting in an indeterm= inate results. > > (cp(1) has never guaranteed atomic copying of files, so is it needed in= this syscall?) > > In summary, doing the syscall non-atomically has the advantages of: > > - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.= t. locking > > wouldn't require changes to this syscall's code. > > - VOP_IOCTL() could be used to find holes. > > - No new rangelock primitives need to be added to the system. > > - If there were some combination of file system/storage unit where copy= ing > > non-overlapping byte ranges concurrently could result in better perfo= rmance, > > then that could be done. (An atomic syscall would serialize them.) > > > > The advantage of an atomic syscall would be consistency with write(2) a= nd read(2) > > behaviour. > > > > The following comments are copied from phabricator: > > kib@ - So you relock range for each chunk ? This defeats the purpose of= the range locking. Should copy_file_range() be atomic WRT other reads and = writes ? > > > > asomers@ - That has an unfortunate side effect: copy_file_range is no l= onger atomic if you drop the vnode locks and range locks in the middle. It = would be possible for two copy_file_range operations to proceed concurrentl= y, leaving the file in a state where each of the operations was partially s= uccessful. A better solution would be to add rangelock_trywlock and rangelo= ck_tryrlock. I don't think it would be very hard (testing them, however, co= uld be). > > > > I don't see anything in the Linux man page w.r.t. atomicity, so I am no= w asking what > > others think? > > (I'll admit I'm biased towards non-atomic, since I have already coded i= t and can use > > the VOP_IOCTL() calls to find the holes in the input file, but...) > > AFAIK, linux does not provide atomicity for file reads and writes at all, > even for normal reads and writes. I do not want to read Linux code to > confirm this. Really? I find that hard to believe. But sure enough, here's the proof: (they claim it's fixed now) http://man7.org/linux/man-pages/man2/write.2.html . As for copy_file_range, if it's not implemented by the file system, Linux falls back to an implementation based on splice(2). I can't find anything that says whether splice has atomic writes, and in a quick scan of the code I can't find any sign of locking. However, glibc also contains a userland implementation of copy_file_range. It works 8KB at a time and is most definitely NOT atomic. -Alan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2gAAHVzneytFKKQkHDEzKpe3vVG3w3A%2BgxRvGo-d1ZMVQ>