From owner-freebsd-hackers  Fri Sep 12 20:26:48 1997
Return-Path: <owner-freebsd-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id UAA29540
          for hackers-outgoing; Fri, 12 Sep 1997 20:26:48 -0700 (PDT)
Received: from usr02.primenet.com (tlambert@usr02.primenet.com [206.165.6.202])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id UAA29523;
          Fri, 12 Sep 1997 20:26:44 -0700 (PDT)
Received: (from tlambert@localhost)
	by usr02.primenet.com (8.8.5/8.8.5) id UAA07078;
	Fri, 12 Sep 1997 20:26:25 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199709130326.UAA07078@usr02.primenet.com>
Subject: Re: VFS/NFS client wedging problem
To: durian@plutotech.com (Mike Durian)
Date: Sat, 13 Sep 1997 03:26:24 +0000 (GMT)
Cc: tlambert@primenet.com, hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <199709130119.TAA03040@pluto.plutotech.com> from "Mike Durian" at Sep 12, 97 07:19:31 pm
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Sender: owner-freebsd-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> >If you don't have seperate contexts, eventually you'll make a request
> >before the previous one completes.
> 
>   I serialize.

This is what I figured you had to do, or you'd really be in trouble.

> I used to keep a list of N available sockets and
> use one socket per request, but since I handle commands atomically
> in the user process figured it was silly and dropped down to one
> socket.

If this is a UNIX domain socket, then it's like a pipe.  A pipe
does not guarantee to keep data together over the pipe block size,
so if you are doing larger writes, this gould be your problem.  You
could write:

	AAAAABBBBBCCCCC

And get the data out of order:

	AAAABABBBCBCCCC

Which would account for the failures.

Typically, when I do this, I write data as:

	aAaAaAaAaAbBbBbBbBbBcCcCcCcCcC

where a, b, and c are channel identification tokens.  Then you
can decode:

	aAaAaAaAbBaAbBbBbBcCbBcCcCcCcC

Back into atomic units.  The channel identifiers are per byte.

This is only one possibility, and depends on the write buffer size.

> The user process is one big select loop, and doesn't
> call select again until it has completed all commands on the
> readable sockets (which is now just one socket).

Did this failure occur when you had seperate sockets?  How hard would
it be to go back to a socket per channel as a test case?


> >The NFS export stuff is a bit problematic.  I don't know what to
> >say about it, except that it should be in the common mount code
> >instead of being duplicated per FS.
> >
> >If you can give more architectural data about your FS, and you can
> >give the FS you used as a model of how a VFS should be written, I
> >might be able to give you more detailed help.
> >
> >This is probably something that should be taken off the general
> >-hackers list, and onto fs@freebsd.org
> 
>   It's really a mish-mash of other file systems.  I grabbed some
> from cd9660 and msdosfs for NFS, socket stuff from portal and
> then nullfs and other miscfs filesystem for general stuff.

This is not going to be a pleasent revelation, I'm afraid.

These are the worst places to get NFS and VOP_LOCK examples,
unfortunately.  The best place is the ffs/ufs two layer stack,
but it's very complicated and hard to understand.

The directory stuff in the msdosfs, particularly, is bad.  There is
a race window after unlocking the parent to locking the child of the
child.  This is pretty much unavoidable (at present) because of the
VOP_LOOKUP code structure pushing some things better left up top
down into the per FS code (the msdosfs would be able to deal with it
if it didn't have the VOP_ABORTOP issues on create and rename to
contend with).


>   I'll take all the detailed stuff off this list and move it to
> freebsd-fs.  I didn't know the fs list existed.

Heh.  Most people don't.  It doesn't see much action because it
requires huge code shifts to modify interfaces.  Anything that
needs to do that touches every FS at the same time.


> >That's not strange.  It's a request context that's wedged.  When a
> >request context would be slept, the nfsd on the server isn't slept,
> >the context is.  The nfsd provides an execution context for a different
> >request context at that point.  Try nfsstat instead, and/or iostat,
> >on the server.
> 
>   I didn't realize that.  I did use nfsstat, but didn't know what
> to look for.  The only thing that seemed interesting to me was
> the 190 server faults.  But I didn't know if that was normal or not.

I have 0 here, but then my stuff is pretty hacked up compared to
the standard distribution, so I have no way of kowing if faults
are the normal state of affairs or not.  Doug Rabson would know.

> >This proves to us that it isn't async requests over the wire that are
> >hosing you.  That the server is an NFSv3 capable server argues that
> >the v2 protocol is implemented by a v3 engine, which would explain
> >the blockages.
> >
> >Have you tried bot TCP and UDP based mounts?
> 
>   Yes.  UDP died locked up faster than TCP (though that is a subjective
> measurement, I didn't actually time things).  TCP had the "server not
> responding"/"responding again" messages.

This lets out "source host not equal to mount host" errors.  It's a
good data point for eliminating an obvious case... even negative data
is still data.  8-(.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.