From owner-freebsd-current  Fri Jul 21 11:44:19 1995
Return-Path: current-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.11/8.6.6) id LAA17086
          for current-outgoing; Fri, 21 Jul 1995 11:44:19 -0700
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.cdrom.com (8.6.11/8.6.6) with ESMTP id LAA17075
          for <freebsd-current@freebsd.org>; Fri, 21 Jul 1995 11:44:10 -0700
Received: (from dfr@localhost) by minnow.render.com (8.6.9/8.6.9) id TAA12680; Fri, 21 Jul 1995 19:46:19 +0100
Date: Fri, 21 Jul 1995 19:46:16 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: Terry Lambert <terry@cs.weber.edu>
cc: peter@haywire.dialix.com, freebsd-current@freebsd.org
Subject: Re: what's going on here? (NFSv3 problem?)
In-Reply-To: <9507211739.AA06208@cs.weber.edu>
Message-ID: <Pine.BSF.3.91.950721192020.12542A-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: current-owner@freebsd.org
Precedence: bulk

On Fri, 21 Jul 1995, Terry Lambert wrote:

> > No, the bug is that nfs_readdir is making larger protocol requests than 
> > it used to and the server is puking.  NFSv3 allows the server to hint at 
> > a size to use.  I don't have the rfc for NFSv2 handy so I can't check.  
> > It is possible that we are violating the protocol here.  It is also 
> > possible that the SVR4 server is returning crap data.
> 
> The NFS READDIR is a special case of the getdents call interface.
> 
> The getdents call interface is not guaranteed to work on objects
> smaller than a full (512b) directory block, since this is the
> directory compation boundry in UFS.  This is actually the one
> remaining file system dependency in the directory read code.
> 
> The typical behaviour is to use the file system block size, or
> the system page size, whichever is larger, since the directory
> block is guaranteed by the file system interface to be some
> power of two value smaller or equal to the page size.
> 
> The problem is *bound* to occur when the VOP uses entry-at-a-time
> retrieval, or odd-entry-retrieval over the NFSlink with the current
> code.

All this is fine.  I know perfectly well that NFSv2 tends to encourage 
non-aligned accesses to directories and also why it can be a bad thing 
for UFS.  AFAIR, the last time this topic came up was when I was 
modifying the 2.0 NFS server code to work properly for cd9660 
filesystems.  I added code at that time which, at least, stopped it from 
jumping into the middle of a directory entry.  If a block was compacted 
between requests, the client may receive duplicate entries or may miss 
some entries.  It will not receive corrupt filenames.

I believe that the original problem that brought the subject up was
completely different to this.  The client had mounted a filesystem with 1024
byte read/write sizes and then tried to read a directory in 8k blocks.  The
server got scared and returned strange values to the client.  This is backed
up by the fact that the problem went away when the read/write sizes were
increased to 8k.  I looked in rfc1094 and didn't see any references to read
size applying to the readdir request.  Of course, that doesn't mean that some
servers don't interpret the rfc that way. 

> 
> The complication in the NFSv3 protocol is the extension that we
> (well, it was me, when I used to work at Novell) added to the
> directory entry retrieval interface to return blocks of entries
> and stat information simultaneously, and which was added to NFSv3
> by Sun right after we demonstrated the code doubling the speed
> of ls in a NUC/NWU (NetWare UNIX Client/WetWare for UNIX)
> environment (several years ago).  The fact is that neither Novell
> nor I originated this idea: it's been present in AppleTalk and
> several Intel protocols from day one... a rather old idea, actually.
> 
> The code hacks for the per entry at a time retrieval for the NFSv2
> code *do not work* for buffer sizes larger than the page size, a
> fact I pointed out when the changes were rolled in (knowing full
> well that I wanted to do NetWare work on FreeBSD and knowing that
> NFSv3 was on its way).

This was an NFSv2 mount.

> 
> This isn't even considering the potential race conditions which
> are caused by the stat operation in the FS itself being seperate
> from the readdir operation, or by directory compaction occuring
> between non-block requests.
> 
> The first race condition can only be resolved by changing the
> interface; this is probably something that wants to be done
> anyway, since file operations should probably have stat information
> associated at all times.  The potential error here is that another
> caller could delte the file before the stat information was obtained
> and (in the case of only one entry in the return buffer), the
> directory must be locally retraversed on the server from the last
> offset.  Even then, you are relatively screwed if what is happening
> is a copy/unlink/rename operation.
> 
> The second race condition, above, can be handled internally only
> with an interface other than readdir, or with a substantial change
> to the operation of readdir, at the very least.  The way you do
> a resyncronization over a (potential) directory compaction is
> you find the block that the next entry offset is in, then read
> entries forward until the offset equals or exceeds the offset
> requested, saving one-behind.  If the offset is equal, you return
> the entry, otherwise you return the entry from the previous offset
> (assuming that the entry was compacted back).  This can result in
> duplicate entries, which the client must filter out, since it has
> state information, and it is unacceptable in the search for an
> exact match to omit the file being searched for.

NFSv3 defines a mechanism to validate the cookies used to read directory 
entries.  Each readdir request returns a set of directory entries, each 
with a cookie which can be used to start another readdir just after the 
entry.  To read from the beginning of the directory, one passes a NULL 
cookie.

NFSv3 also returns a 'cookie verifier' which must be passed with the next
readdir, along with the cookie representing the place to read from.  If the
directory block was compacted, then the server should use the verifier to
detect this and can return an error to the client to force it to retry the
read from the beginning of the directory. 

> 
> The buffer crap that got done to avoid a file system top end user
> presentation layer is totally bogus, and remains the cause of the
> prblem.  If no one is interested in fixing it, I suggest reducing
> the transfer size to the page size or smaller.

I can't parse this one.

> 
> And, of course, at the same time eat the increased and otherwise
> unnecessary overhead in the read/write path transfers that will
> result from doing this "fix".

I don't think that any fix is needed.  The NFSv2 behaviour is adequate 
and NFSv3 has the mechanism to detect this problem.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 251 4411
						FAX:   +44 171 251 0939