From owner-freebsd-current Fri Jul 21 11:44:19 1995 Return-Path: current-owner Received: (from majordom@localhost) by freefall.cdrom.com (8.6.11/8.6.6) id LAA17086 for current-outgoing; Fri, 21 Jul 1995 11:44:19 -0700 Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.cdrom.com (8.6.11/8.6.6) with ESMTP id LAA17075 for ; Fri, 21 Jul 1995 11:44:10 -0700 Received: (from dfr@localhost) by minnow.render.com (8.6.9/8.6.9) id TAA12680; Fri, 21 Jul 1995 19:46:19 +0100 Date: Fri, 21 Jul 1995 19:46:16 +0100 (BST) From: Doug Rabson To: Terry Lambert cc: peter@haywire.dialix.com, freebsd-current@freebsd.org Subject: Re: what's going on here? (NFSv3 problem?) In-Reply-To: <9507211739.AA06208@cs.weber.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: current-owner@freebsd.org Precedence: bulk On Fri, 21 Jul 1995, Terry Lambert wrote: > > No, the bug is that nfs_readdir is making larger protocol requests than > > it used to and the server is puking. NFSv3 allows the server to hint at > > a size to use. I don't have the rfc for NFSv2 handy so I can't check. > > It is possible that we are violating the protocol here. It is also > > possible that the SVR4 server is returning crap data. > > The NFS READDIR is a special case of the getdents call interface. > > The getdents call interface is not guaranteed to work on objects > smaller than a full (512b) directory block, since this is the > directory compation boundry in UFS. This is actually the one > remaining file system dependency in the directory read code. > > The typical behaviour is to use the file system block size, or > the system page size, whichever is larger, since the directory > block is guaranteed by the file system interface to be some > power of two value smaller or equal to the page size. > > The problem is *bound* to occur when the VOP uses entry-at-a-time > retrieval, or odd-entry-retrieval over the NFSlink with the current > code. All this is fine. I know perfectly well that NFSv2 tends to encourage non-aligned accesses to directories and also why it can be a bad thing for UFS. AFAIR, the last time this topic came up was when I was modifying the 2.0 NFS server code to work properly for cd9660 filesystems. I added code at that time which, at least, stopped it from jumping into the middle of a directory entry. If a block was compacted between requests, the client may receive duplicate entries or may miss some entries. It will not receive corrupt filenames. I believe that the original problem that brought the subject up was completely different to this. The client had mounted a filesystem with 1024 byte read/write sizes and then tried to read a directory in 8k blocks. The server got scared and returned strange values to the client. This is backed up by the fact that the problem went away when the read/write sizes were increased to 8k. I looked in rfc1094 and didn't see any references to read size applying to the readdir request. Of course, that doesn't mean that some servers don't interpret the rfc that way. > > The complication in the NFSv3 protocol is the extension that we > (well, it was me, when I used to work at Novell) added to the > directory entry retrieval interface to return blocks of entries > and stat information simultaneously, and which was added to NFSv3 > by Sun right after we demonstrated the code doubling the speed > of ls in a NUC/NWU (NetWare UNIX Client/WetWare for UNIX) > environment (several years ago). The fact is that neither Novell > nor I originated this idea: it's been present in AppleTalk and > several Intel protocols from day one... a rather old idea, actually. > > The code hacks for the per entry at a time retrieval for the NFSv2 > code *do not work* for buffer sizes larger than the page size, a > fact I pointed out when the changes were rolled in (knowing full > well that I wanted to do NetWare work on FreeBSD and knowing that > NFSv3 was on its way). This was an NFSv2 mount. > > This isn't even considering the potential race conditions which > are caused by the stat operation in the FS itself being seperate > from the readdir operation, or by directory compaction occuring > between non-block requests. > > The first race condition can only be resolved by changing the > interface; this is probably something that wants to be done > anyway, since file operations should probably have stat information > associated at all times. The potential error here is that another > caller could delte the file before the stat information was obtained > and (in the case of only one entry in the return buffer), the > directory must be locally retraversed on the server from the last > offset. Even then, you are relatively screwed if what is happening > is a copy/unlink/rename operation. > > The second race condition, above, can be handled internally only > with an interface other than readdir, or with a substantial change > to the operation of readdir, at the very least. The way you do > a resyncronization over a (potential) directory compaction is > you find the block that the next entry offset is in, then read > entries forward until the offset equals or exceeds the offset > requested, saving one-behind. If the offset is equal, you return > the entry, otherwise you return the entry from the previous offset > (assuming that the entry was compacted back). This can result in > duplicate entries, which the client must filter out, since it has > state information, and it is unacceptable in the search for an > exact match to omit the file being searched for. NFSv3 defines a mechanism to validate the cookies used to read directory entries. Each readdir request returns a set of directory entries, each with a cookie which can be used to start another readdir just after the entry. To read from the beginning of the directory, one passes a NULL cookie. NFSv3 also returns a 'cookie verifier' which must be passed with the next readdir, along with the cookie representing the place to read from. If the directory block was compacted, then the server should use the verifier to detect this and can return an error to the client to force it to retry the read from the beginning of the directory. > > The buffer crap that got done to avoid a file system top end user > presentation layer is totally bogus, and remains the cause of the > prblem. If no one is interested in fixing it, I suggest reducing > the transfer size to the page size or smaller. I can't parse this one. > > And, of course, at the same time eat the increased and otherwise > unnecessary overhead in the read/write path transfers that will > result from doing this "fix". I don't think that any fix is needed. The NFSv2 behaviour is adequate and NFSv3 has the mechanism to detect this problem. -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 251 4411 FAX: +44 171 251 0939