From owner-freebsd-current@FreeBSD.ORG Tue May 5 23:33:30 2015 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 7077AEEE; Tue, 5 May 2015 23:33:30 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 16E591BF7; Tue, 5 May 2015 23:33:29 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CsBADKUklV/95baINcg19cBYMYwnKBTAqFN04CgXATAQEBAQEBAYEKhCABAQEDAQEBARoGBCcgCwUWGAICDRkCKQEJGAENBggHBAEcBIgDCA2weJNeAQEBAQEBBAEBAQEBAQEBARmBIYoYhDMBAQUXNAeCaIFFBYsgixCCRoFMg1U9hg6KbYNTI4FlHwIBHIFtIjEHgQQ5gQEBAQE X-IronPort-AV: E=Sophos;i="5.13,376,1427774400"; d="scan'208";a="208890770" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 05 May 2015 19:33:27 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id A5A8AB4056; Tue, 5 May 2015 19:33:27 -0400 (EDT) Date: Tue, 5 May 2015 19:33:27 -0400 (EDT) From: Rick Macklem To: Julian Elischer Cc: "current@freebsd.org" , Jilles Tjoelker , Konstantin Belousov Message-ID: <1054739779.32254571.1430868807663.JavaMail.root@uoguelph.ca> In-Reply-To: <554715CF.2010703@freebsd.org> Subject: Re: seekdir/readdir patch .. Call for Review. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.10] X-Mailer: Zimbra 7.2.6_GA_2926 (ZimbraWebClient - FF3.0 (Win)/7.2.6_GA_2926) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 May 2015 23:33:30 -0000 Julian Elischer wrote: > On 5/3/15 10:33 PM, Jilles Tjoelker wrote: > > On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov > > wrote: > >> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > >>> if you are interested in readdir(3), seekdir(3) and telldir(3) > >>> then > >>> you should look at > >>> https://reviews.freebsd.org/D2410 > >>> this patches around a problem in seekdir() that breaks Samba. > >>> Seekdir(3) will not work as expected when files prior to the > >>> point of > >>> interest in directory have been deleted since the directory was > >>> opened. > >>> Windows clients using Samba cause both these things to happen, > >>> causing > >>> the next readdir(3) after the bad seekdir(3) to skip some entries > >>> and > >>> return the wrong file. > >>> Samba only needs to step back a single directory entry in the > >>> case > >>> where it reads an entry and then discovers it can't fit it into > >>> the > >>> buffer it is sending to the windows client. It turns out we can > >>> reliably cater to Samba's requirement because the "last returned > >>> element" is always still in memory, so with a little care, we can > >>> set our filepointer back to it safely. (once) > >>> seekdir and readdir (and telldir()) need a complete rewrite along > >>> with > >>> getdirentries() but that is more than a small edit like this. > >> Can you explain your expectations from the whole readdir() vs. > >> parallel > >> directory modifications interaction ? From what I understood so > >> far, > >> there is unlocked modification of the container and parallel > >> iterator > >> over the same container. IMO, in such situation, whatever tweaks > >> you > >> apply to the iterator, it is still cannot be made reliable. > >> Before making single-purpose changes to the libc readdir and > >> seekdir > >> code, or to the kernel code, it would be useful to state exact > >> behaviour > >> of the dirent machinery we want to see. No, 'make samba works in > >> my > >> situation' does not sound good enough. > > Consider the subsequence of entries that existed at opendir() time > > and > > were not removed until now. This subsequence is clearly defined and > > does > > not have concurrency problems. The order of this subsequence must > > remain > > unchanged and seekdir() must be correct with respect to this > > subsequence. > > > > Additionally, two other kinds of entries may be returned. New > > entries > > may be inserted anywhere in between the entries of the subsequence, > > and > > removed entries may be returned as if they were still part of the > > subsequence (so that not every readdir() needs a system call). > > > > A simple implementation for UFS-style directories is to store the > > offset > > in the directory (all bits of it, not masking off the lower 9 > > bits). > > This needs d_off or similar in struct dirent. The kernel > > getdirentries() > > then needs a similar loop as the old libc seekdir() to go from the > > start > > of the 512-byte directory block to the desired entry (since an > > entry may > > not exist at the stored offset within the directory block). > At least it needs some more information in struct dirent than there > is > now.. > A cookie is the current fashion.. that assumes however that the > filesystems > are capable of converting backwards from cookie to 'location'. ZFS > claims to > be able to do so.. My current plan for a patch is... - d_cookie would be the "physical" file system position of the next directory entry - a ngetdirentries() would take a "physical" cookie as a value/result argument. It would indicate where to start and would return the cookie for the next entry after what is returned. (It could probably be stuffed in uio_offset, but I think it might be clearer to make it a separate arg.) --> It would pass this physical cookie down to the file system's VOP_READDIR(). (For UFS it would be the byte offset in the on-disk directory. For ZFS, I believe it is an index for the entry. For NFS, it is the cookie that is sent to the server. For others, I don't yet know.) - dd_seek, dd_loc, loc_seek and loc_loc would be replaced by dd_cookie and loc_cookie. (For arches where sizeof(long) == 8, I think telldir() could just return the cookie and forget about the loc_XX structures?) This would get rid of the loc_seek, loc_loc bogosity that no longer makes much sense, since the byte offset in what is returned by getdirentries() has little to do with the "physical" directory position. I have already done the kernel stuff for some file systems and the libc changes actually simplify things compared to what is there now. rick > Thee other thing to do would be to store some sort > of strong > hash of the name and inode# in each telldir entry.. > we would seek to the saved seek location and seek forward computing > or > looking > for a matching hash. I woudl also add that the man pages talk about > the > readdir blocksize a bit and mention the file blocksize (and stat) > which is often > way dfferent from 512 bytes.. usually 16k or so these days. > I found setting the read size to the same as the fs blocksize seemd > to > work well. > > > > This means that a UFS-style directory cannot be compacted (existing > > entries moved from higher to lower offsets to fill holes) while it > > is > > open for reading. An NFS exported directory is always open for > > reading. > yes so a UFS filesystem that is exported could never do garbage > collection. > > > > This also means that duplicate entries can only be returned if that > > particular filename was deleted and created again. > > > > Without kernel support, it is hard to get telldir/seekdir > > completely > > reliable. The current libc implementation is wrong since the > > "holes" > > within the block just disappear and change the offsets of the > > following > > entries; the kernel cannot fix this using entries with d_fileno = 0 > > since it cannot know, in the general case, how long the deleted > > entry > > was in the filesystem-independent dirent format. > > yes it's the filesystem that knows.. we USED to return empty entries > in the dirent list but that was removed recently think. > > My previous idea of > > storing one d_fileno during telldir() is wrong since it will fail > > if > > that entry is deleted. > > > > If you do not care about memory usage (which probably is already > > excessive with the current libc implementation), you could store at > > telldir() time the offset of the current block returned by > > getdirentries() and the d_fileno of all entries already returned in > > the > > current block. > > > > The D2410 patch can conceptually work for what Samba needs, > > stepping > > back one directory entry. I will comment on it. > > > > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscribe@freebsd.org" >