From owner-freebsd-current@FreeBSD.ORG Mon May 4 07:09:43 2015 Return-Path: Delivered-To: current@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CBC5C266 for ; Mon, 4 May 2015 07:09:43 +0000 (UTC) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 3E5FD15DA for ; Mon, 4 May 2015 07:09:43 +0000 (UTC) Received: from Julian-MBP3.local (ppp121-45-241-118.lns20.per4.internode.on.net [121.45.241.118]) (authenticated bits=0) by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id t446ki13023897 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Sun, 3 May 2015 23:46:48 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <554715CF.2010703@freebsd.org> Date: Mon, 04 May 2015 14:46:39 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Jilles Tjoelker , Konstantin Belousov CC: "current@freebsd.org" Subject: Re: seekdir/readdir patch .. Call for Review. References: <55432593.6050709@freebsd.org> <20150501161742.GW2390@kib.kiev.ua> <20150503143345.GB70576@stack.nl> In-Reply-To: <20150503143345.GB70576@stack.nl> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 May 2015 07:09:43 -0000 On 5/3/15 10:33 PM, Jilles Tjoelker wrote: > On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: >> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: >>> if you are interested in readdir(3), seekdir(3) and telldir(3) then >>> you should look at >>> https://reviews.freebsd.org/D2410 >>> this patches around a problem in seekdir() that breaks Samba. >>> Seekdir(3) will not work as expected when files prior to the point of >>> interest in directory have been deleted since the directory was opened. >>> Windows clients using Samba cause both these things to happen, causing >>> the next readdir(3) after the bad seekdir(3) to skip some entries and >>> return the wrong file. >>> Samba only needs to step back a single directory entry in the case >>> where it reads an entry and then discovers it can't fit it into the >>> buffer it is sending to the windows client. It turns out we can >>> reliably cater to Samba's requirement because the "last returned >>> element" is always still in memory, so with a little care, we can >>> set our filepointer back to it safely. (once) >>> seekdir and readdir (and telldir()) need a complete rewrite along with >>> getdirentries() but that is more than a small edit like this. >> Can you explain your expectations from the whole readdir() vs. parallel >> directory modifications interaction ? From what I understood so far, >> there is unlocked modification of the container and parallel iterator >> over the same container. IMO, in such situation, whatever tweaks you >> apply to the iterator, it is still cannot be made reliable. >> Before making single-purpose changes to the libc readdir and seekdir >> code, or to the kernel code, it would be useful to state exact behaviour >> of the dirent machinery we want to see. No, 'make samba works in my >> situation' does not sound good enough. > Consider the subsequence of entries that existed at opendir() time and > were not removed until now. This subsequence is clearly defined and does > not have concurrency problems. The order of this subsequence must remain > unchanged and seekdir() must be correct with respect to this > subsequence. > > Additionally, two other kinds of entries may be returned. New entries > may be inserted anywhere in between the entries of the subsequence, and > removed entries may be returned as if they were still part of the > subsequence (so that not every readdir() needs a system call). > > A simple implementation for UFS-style directories is to store the offset > in the directory (all bits of it, not masking off the lower 9 bits). > This needs d_off or similar in struct dirent. The kernel getdirentries() > then needs a similar loop as the old libc seekdir() to go from the start > of the 512-byte directory block to the desired entry (since an entry may > not exist at the stored offset within the directory block). At least it needs some more information in struct dirent than there is now.. A cookie is the current fashion.. that assumes however that the filesystems are capable of converting backwards from cookie to 'location'. ZFS claims to be able to do so.. Thee other thing to do would be to store some sort of strong hash of the name and inode# in each telldir entry.. we would seek to the saved seek location and seek forward computing or looking for a matching hash. I woudl also add that the man pages talk about the readdir blocksize a bit and mention the file blocksize (and stat) which is often way dfferent from 512 bytes.. usually 16k or so these days. I found setting the read size to the same as the fs blocksize seemd to work well. > > This means that a UFS-style directory cannot be compacted (existing > entries moved from higher to lower offsets to fill holes) while it is > open for reading. An NFS exported directory is always open for reading. yes so a UFS filesystem that is exported could never do garbage collection. > > This also means that duplicate entries can only be returned if that > particular filename was deleted and created again. > > Without kernel support, it is hard to get telldir/seekdir completely > reliable. The current libc implementation is wrong since the "holes" > within the block just disappear and change the offsets of the following > entries; the kernel cannot fix this using entries with d_fileno = 0 > since it cannot know, in the general case, how long the deleted entry > was in the filesystem-independent dirent format. yes it's the filesystem that knows.. we USED to return empty entries in the dirent list but that was removed recently think. > My previous idea of > storing one d_fileno during telldir() is wrong since it will fail if > that entry is deleted. > > If you do not care about memory usage (which probably is already > excessive with the current libc implementation), you could store at > telldir() time the offset of the current block returned by > getdirentries() and the d_fileno of all entries already returned in the > current block. > > The D2410 patch can conceptually work for what Samba needs, stepping > back one directory entry. I will comment on it. >