Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 6 Apr 1998 17:06:27 -0700 (PDT)
From:      Mika Nystrom <mika@cs.caltech.edu>
To:        FreeBSD-gnats-submit@FreeBSD.ORG
Subject:   bin/6231: amd automount fallback to "mount version 1" does not work
Message-ID:  <199804070006.RAA25142@obelix.cs.caltech.edu>

next in thread | raw e-mail | index | archive | help

>Number:         6231
>Category:       bin
>Synopsis:       amd automount fallback to "mount version 1" does not work
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:
>Keywords:
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Apr  6 17:10:00 PDT 1998
>Last-Modified:
>Originator:     Mika Nystrom
>Organization:
California Institute of Technology
>Release:        FreeBSD 3.0-CURRENT i386
>Environment:

See below.

>Description:

I have sent PRs on this one before, even closed one, thinking that the bug
was in the kernel and not in amd, but I finally realized I would have to
get to the bottom of this....  Quick recap of how amd works:
(I assume amd's home is what we use, "/ufs".. other people may use /a)
try to access an automounted directory not currently mounted, say
/ufs/mail.. /ufs appears to the kernel as being the domain of an NFS
server, but it is of course amd that masquerades as one.  When amd
gets the lookup request for /ufs/mail, it forks a process that will mount
(in our case) vlsi:/var/spool/mail on /tmp_mnt/vlsi/var/spool/mail .
Unless that mount returns really quickly (as in the case of an ffs mount),
amd returns nothing to the kernel at this time.  Instead, the forked process
is reaped later by amd, and that just sets a value somewhere in a table.
It is the kernel's responsibility to repeat the lookup RPC (that looks
like it's been "lost" by e.g., a network failure), and if the RPC is
repeated after the filesystem has been successfully mounted (and before
it's been re-umounted---there's a race condition here, but who cares,
we fix this one with timers), the correct response is now returned with
status NFS_OK.  Oh, if things could always be so easy!  

The bug in amd seems to have been introduced by a ten-thumbed programmer
adding support for NFS v3 (although I'm not claiming I could do any
better myself).  Here in the department, we have some old fileservers,
sparcstation 1's running SunOS 4.1.4.  These machines do not support 
NFSv3.  The approach amd takes in this case is:  try "version 1" mount
(I assume this is NFSv2), and if that fails, retry with a version 3
mount.  The relevant code is in /usr/src/usr.sbin/amd/amd/nfs_ops.c,
function got_nfs_fh().  This routine is called to "pick up" an RPC
response from the fileserver that's being mounted from (in my example,
that would be the ss1 called "vlsi").  It sets fp->fh_error to the
return value of pickup_rpc_reply, and if it was the first attempt
(version 3 mount), it *schedules* a retry with mount version 1
(XXX: asynchronous!)  But it leaves fp->fh_error set to NFSERR_IO
(that's what the failed version 3 mount returned).  Now if the kernel
sticks its RPC retry in between the failed version 1 mount and the
(coming soon) version 3 mount that will fix this, amd will respond with
the EIO, since the error is cached in the filehandle cache and returned
that way via nfs_init, try_mount, afs_lookuppn... It finally makes its
way to the user as an I/O Error that can be ... extremely frustrating.
My earlier kern PR I think just implies that the EIO is cached by the 
kernel, which strikes me as a bit strange, but maybe that's not
such a big deal.


P.S. the kernel NFS code is very hard to understand.. maybe someone
who understands how it works could insert a few comments?

>How-To-Repeat:

You can reverse-engineer it from above, the main thing is it requires an
NFSv2-only server.  (Note that we can't really work around this in our
amd maps since they are generated with a perl script from
SunOS 4 automount maps...) 

>Fix:

Simply set the return value to -1 before leaving this part of the 
daemon.  -1 is used in various places of amd to denote a "mount 
in progress," so this is probably The Right Thing To Do.

I tested this with my machine... the un-fixed amd seemed to have
problems about 30 times in 2,000 mount attempts.  The fixed amd
didn't fail to mount any time in 10,000 attempts to the same
server (and otherwise completely identical setup and test program).
	
*** nfs_ops.c.orig      Mon Apr  6 15:10:29 1998
--- nfs_ops.c   Mon Apr  6 15:15:37 1998
***************
*** 178,183 ****
--- 178,192 ----
  #ifdef DEBUG
                                dlog("mount version 3 refused, retrying with version 1");
  #endif
+                               /*
+                                * At this point, fp->fh_error is meaningless
+                                * since we are already fixing the problem
+                                * We don't want it to get back to the caller!
+                                * Mount is now still "in progress"
+                                *
+                                */
+ 
+                               fp->fh_error = -1;
                                fp->fh_id = FHID_ALLOC();
                                fp->fh_mountres.mr_version = MOUNTVERS;
                                call_mountd(fp, MOUNTPROC_MNT, MOUNTVERS, got_nfs_fh, fp->fh_wchan);


>Audit-Trail:
>Unformatted:

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199804070006.RAA25142>