From owner-freebsd-hackers Tue Oct 31 1:54:49 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id 73DED37B4C5 for ; Tue, 31 Oct 2000 01:54:45 -0800 (PST) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id CAA14655; Tue, 31 Oct 2000 02:52:56 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp03.primenet.com, id smtpdAAANQaGFC; Tue Oct 31 02:52:45 2000 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id CAA29192; Tue, 31 Oct 2000 02:54:23 -0700 (MST) From: Terry Lambert Message-Id: <200010310954.CAA29192@usr02.primenet.com> Subject: Re: dir-listing bug in linux-emulation To: peter.edwards@openet-telecom.com (Peter Edwards) Date: Tue, 31 Oct 2000 09:54:23 +0000 (GMT) Cc: tpnelson@echidna.stu.cowan.edu.au (Trent Nelson), d.d.v.deijk@student.tue.nl (David van Deijk), freebsd-hackers@FreeBSD.ORG In-Reply-To: <39FAF500.526FFD69@openet-telecom.com> from "Peter Edwards" at Oct 28, 2000 04:47:12 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > Hi, > Here's a patch that fixes the problem for me. can someone review and > possibly commit it? > > Here's my understanding: > > The data returned by VOP_READDIR is not neccessarily the same as that > consumed from the directory vnode. > > linux_getdents fills in a "doff" field in the linux_dirent structure > using the offset from the start of the read data rather than the offset > in the directory vnode that produced that directory entry. > > Then, rather than continuing where it left off, the linux "ls" command > uses this as an offset to seek to before the next time it calls > linux_getdents, (actually seeking back over data it's already read, for > some reason). This is standard incremental backoff behaviour. When you get directory data out of a directory using getdents(2), you pass it a buffer to copy data out into. The way this works is that the data is externalized in a format thats documented in dirent.h; this happens to match that of the struct direct in /sys/ufs/ufs/dir.h; for Linux, this is not true. Because the directory entries are translated on copyout for the Linux programs so that they see their native format, that means you can end up with more or less entries copied out, depending on the relative size of the on disk directory entry, and the Linux representation. Generally, I think the Linux stuff is actually smaller or the same size. If so, then there's only a problem when the return buffer is smaller than 512b. Now here's the problem. It seems that a partial directory block is being scanned and copied out. What this means is that it's going to want to restart the scan. Since the directory entry blocks copied to user space are just "snapshots", and another process could go in and change the block so that it was different from the copy, the code doesn't trust that that's not happened, when it's asked to seek to a non-directory entry block aligned boundary. So what it's supposed to do in the incremental backoff strategy, is to back off to the start of the partial block, and read forward, to verify that the boundary is actually the start of a valid directory entry block. It had to do this, since if the file that was there was deleted, the deletion would have been accomplished by adding the entry to the actual allocation size of the record immediately prior to it. So all it's trying to do is ensure that the record will be valid, rather than being a phantom of a now deleted file. This is actually a little more complicated, since the block may also have been coalesced since the last time the process read it. If that's happened (due to an allocation of an entry after the coelesce space), then the correct action is to read the entry one back, and return it, since it was coelesced back from the end. Worst case, this will result in a duplicate display entry; it's the job of the user space program to do the duplicate elimination. Really, it would probably be most correct to "short change" the caller on the entries returned, and use only precisely increments of even directory entry blocks. This will eliminate the restart problem, unless the Linux program calling the getdents() gives you too small a buffer -- this is unlikely, since the minimum it is required to give you is enough to return one record with the largest possible name, plus other returned directory data, like the inode number, flags, and so on, which live in the entry. In any case, it's legal for your system call to complain that the buffer is too small, and refuse to return data unless given a bigger buffer. In practice, I think you'll find that this will never happen, since the Linux people aren't any more fond of restarting in the middle of a directory block than anyone who doesn't want to deal with duplicate suppression and possibly stale data. The cookie interface was really put in there for NFS, and as you can see above, it wasn't really necessary to put it in. To get around the potential consumer/provider buffer size mismatch, NFS could have used the incremental backoff strategy, and been exactly as reliable in its one-at-a-time directory entry reading as it is with the cookie approach. A more ideal interface would seperate the lookup from the data externalization process, which would let the FS _always_ return some multiple of directory entry block size for whatever the underlying FS was, and then externalize it into a user buffer with another VOP which took the FS specific data, and copied it out into the user visible format in the user's buffer. This would have the advantage of being able to cache FS specific format "snapshot" data for things like NFS, with a short lived cache, such that reasonable clients would never have the stale data problem (clients that started a read, and waited a month before requesting the next entry would be on their own). So anyway, my suggestion is to "short change" the returned data, to avoid having to restart on a non-directory entry block alignment boundary. Using the cookie interface is only going to render you vulnerable to stale data races, which, though rare, will happen on occasion. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message