From owner-freebsd-hackers  Tue Oct 31  1:54:49 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP id 73DED37B4C5
	for <freebsd-hackers@FreeBSD.ORG>; Tue, 31 Oct 2000 01:54:45 -0800 (PST)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id CAA14655;
	Tue, 31 Oct 2000 02:52:56 -0700 (MST)
Received: from usr02.primenet.com(206.165.6.202)
 via SMTP by smtp03.primenet.com, id smtpdAAANQaGFC; Tue Oct 31 02:52:45 2000
Received: (from tlambert@localhost)
	by usr02.primenet.com (8.8.5/8.8.5) id CAA29192;
	Tue, 31 Oct 2000 02:54:23 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200010310954.CAA29192@usr02.primenet.com>
Subject: Re: dir-listing bug in linux-emulation
To: peter.edwards@openet-telecom.com (Peter Edwards)
Date: Tue, 31 Oct 2000 09:54:23 +0000 (GMT)
Cc: tpnelson@echidna.stu.cowan.edu.au (Trent Nelson),
	d.d.v.deijk@student.tue.nl (David van Deijk),
	freebsd-hackers@FreeBSD.ORG
In-Reply-To: <39FAF500.526FFD69@openet-telecom.com> from "Peter Edwards" at Oct 28, 2000 04:47:12 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> Hi,
> Here's a patch that fixes the problem for me. can someone review and
> possibly commit it?
> 
> Here's my understanding:
> 
> The data returned by VOP_READDIR is not neccessarily the same as that
> consumed from the directory vnode.
> 
> linux_getdents fills in a "doff" field in the linux_dirent structure
> using the offset from the start of the read data rather than the offset
> in the directory vnode that produced that directory entry.
> 
> Then, rather than continuing where it left off, the linux "ls" command
> uses this as an offset to seek to before the next time it calls
> linux_getdents, (actually seeking back over data it's already read, for
> some reason).

This is standard incremental backoff behaviour.

When you get directory data out of a directory using getdents(2),
you pass it a buffer to copy data out into.

The way this works is that the data is externalized in a format
thats documented in dirent.h; this happens to match that of the
struct direct in /sys/ufs/ufs/dir.h; for Linux, this is not
true.

Because the directory entries are translated on copyout for the
Linux programs so that they see their native format, that means
you can end up with more or less entries copied out, depending
on the relative size of the on disk directory entry, and the
Linux representation.

Generally, I think the Linux stuff is actually smaller or the
same size.  If so, then there's only a problem when the return
buffer is smaller than 512b.

Now here's the problem.  It seems that a partial directory block
is being scanned and copied out.  What this means is that it's
going to want to restart the scan.  Since the directory entry
blocks copied to user space are just "snapshots", and another
process could go in and change the block so that it was different
from the copy, the code doesn't trust that that's not happened,
when it's asked to seek to a non-directory entry block aligned
boundary.

So what it's supposed to do in the incremental backoff strategy,
is to back off to the start of the partial block, and read forward,
to verify that the boundary is actually the start of a valid
directory entry block.  It had to do this, since if the file that
was there was deleted, the deletion would have been accomplished
by adding the entry to the actual allocation size of the record
immediately prior to it.  So all it's trying to do is ensure that
the record will be valid, rather than being a phantom of a now
deleted file.  This is actually a little more complicated, since
the block may also have been coalesced since the last time the
process read it.  If that's happened (due to an allocation of an
entry after the coelesce space), then the correct action is to
read the entry one back, and return it, since it was coelesced
back from the end.  Worst case, this will result in a duplicate
display entry; it's the job of the user space program to do the
duplicate elimination.


Really, it would probably be most correct to "short change" the
caller on the entries returned, and use only precisely increments
of even directory entry blocks.  This will eliminate the restart
problem, unless the Linux program calling the getdents() gives
you too small a buffer -- this is unlikely, since the minimum it
is required to give you is enough to return one record with the
largest possible name, plus other returned directory data, like
the inode number, flags, and so on, which live in the entry.

In any case, it's legal for your system call to complain that the
buffer is too small, and refuse to return data unless given a bigger
buffer.  In practice, I think you'll find that this will never
happen, since the Linux people aren't any more fond of restarting
in the middle of a directory block than anyone who doesn't want to
deal with duplicate suppression and possibly stale data.

The cookie interface was really put in there for NFS, and as you
can see above, it wasn't really necessary to put it in.  To get
around the potential consumer/provider buffer size mismatch, NFS
could have used the incremental backoff strategy, and been exactly
as reliable in its one-at-a-time directory entry reading as it is
with the cookie approach.

A more ideal interface would seperate the lookup from the data
externalization process, which would let the FS _always_ return
some multiple of directory entry block size for whatever the
underlying FS was, and then externalize it into a user buffer
with another VOP which took the FS specific data, and copied it
out into the user visible format in the user's buffer.  This
would have the advantage of being able to cache FS specific
format "snapshot" data for things like NFS, with a short lived
cache, such that reasonable clients would never have the stale
data problem (clients that started a read, and waited a month
before requesting the next entry would be on their own).

So anyway, my suggestion is to "short change" the returned data,
to avoid having to restart on a non-directory entry block alignment
boundary.  Using the cookie interface is only going to render you
vulnerable to stale data races, which, though rare, will happen on
occasion.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message