From owner-freebsd-bugs Mon Apr 6 17:10:04 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id RAA08089 for freebsd-bugs-outgoing; Mon, 6 Apr 1998 17:10:04 -0700 (PDT) (envelope-from owner-freebsd-bugs@FreeBSD.ORG) Received: (from gnats@localhost) by hub.freebsd.org (8.8.8/8.8.8) id RAA08081; Mon, 6 Apr 1998 17:10:01 -0700 (PDT) (envelope-from gnats) Received: from vlsi.cs.caltech.edu (vlsi.cs.caltech.edu [131.215.131.129]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id RAA07806 for ; Mon, 6 Apr 1998 17:06:47 -0700 (PDT) (envelope-from mika@obelix.cs.caltech.edu) Received: from obelix.cs.caltech.edu by vlsi.cs.caltech.edu (4.1/1.34.1) id AA22195; Mon, 6 Apr 98 17:06:28 PDT Received: (from mika@localhost) by obelix.cs.caltech.edu (8.8.8/8.8.7) id RAA25142; Mon, 6 Apr 1998 17:06:27 -0700 (PDT) Message-Id: <199804070006.RAA25142@obelix.cs.caltech.edu> Date: Mon, 6 Apr 1998 17:06:27 -0700 (PDT) From: Mika Nystrom Reply-To: mika@cs.caltech.edu To: FreeBSD-gnats-submit@FreeBSD.ORG X-Send-Pr-Version: 3.2 Subject: bin/6231: amd automount fallback to "mount version 1" does not work Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >Number: 6231 >Category: bin >Synopsis: amd automount fallback to "mount version 1" does not work >Confidential: no >Severity: serious >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Mon Apr 6 17:10:00 PDT 1998 >Last-Modified: >Originator: Mika Nystrom >Organization: California Institute of Technology >Release: FreeBSD 3.0-CURRENT i386 >Environment: See below. >Description: I have sent PRs on this one before, even closed one, thinking that the bug was in the kernel and not in amd, but I finally realized I would have to get to the bottom of this.... Quick recap of how amd works: (I assume amd's home is what we use, "/ufs".. other people may use /a) try to access an automounted directory not currently mounted, say /ufs/mail.. /ufs appears to the kernel as being the domain of an NFS server, but it is of course amd that masquerades as one. When amd gets the lookup request for /ufs/mail, it forks a process that will mount (in our case) vlsi:/var/spool/mail on /tmp_mnt/vlsi/var/spool/mail . Unless that mount returns really quickly (as in the case of an ffs mount), amd returns nothing to the kernel at this time. Instead, the forked process is reaped later by amd, and that just sets a value somewhere in a table. It is the kernel's responsibility to repeat the lookup RPC (that looks like it's been "lost" by e.g., a network failure), and if the RPC is repeated after the filesystem has been successfully mounted (and before it's been re-umounted---there's a race condition here, but who cares, we fix this one with timers), the correct response is now returned with status NFS_OK. Oh, if things could always be so easy! The bug in amd seems to have been introduced by a ten-thumbed programmer adding support for NFS v3 (although I'm not claiming I could do any better myself). Here in the department, we have some old fileservers, sparcstation 1's running SunOS 4.1.4. These machines do not support NFSv3. The approach amd takes in this case is: try "version 1" mount (I assume this is NFSv2), and if that fails, retry with a version 3 mount. The relevant code is in /usr/src/usr.sbin/amd/amd/nfs_ops.c, function got_nfs_fh(). This routine is called to "pick up" an RPC response from the fileserver that's being mounted from (in my example, that would be the ss1 called "vlsi"). It sets fp->fh_error to the return value of pickup_rpc_reply, and if it was the first attempt (version 3 mount), it *schedules* a retry with mount version 1 (XXX: asynchronous!) But it leaves fp->fh_error set to NFSERR_IO (that's what the failed version 3 mount returned). Now if the kernel sticks its RPC retry in between the failed version 1 mount and the (coming soon) version 3 mount that will fix this, amd will respond with the EIO, since the error is cached in the filehandle cache and returned that way via nfs_init, try_mount, afs_lookuppn... It finally makes its way to the user as an I/O Error that can be ... extremely frustrating. My earlier kern PR I think just implies that the EIO is cached by the kernel, which strikes me as a bit strange, but maybe that's not such a big deal. P.S. the kernel NFS code is very hard to understand.. maybe someone who understands how it works could insert a few comments? >How-To-Repeat: You can reverse-engineer it from above, the main thing is it requires an NFSv2-only server. (Note that we can't really work around this in our amd maps since they are generated with a perl script from SunOS 4 automount maps...) >Fix: Simply set the return value to -1 before leaving this part of the daemon. -1 is used in various places of amd to denote a "mount in progress," so this is probably The Right Thing To Do. I tested this with my machine... the un-fixed amd seemed to have problems about 30 times in 2,000 mount attempts. The fixed amd didn't fail to mount any time in 10,000 attempts to the same server (and otherwise completely identical setup and test program). *** nfs_ops.c.orig Mon Apr 6 15:10:29 1998 --- nfs_ops.c Mon Apr 6 15:15:37 1998 *************** *** 178,183 **** --- 178,192 ---- #ifdef DEBUG dlog("mount version 3 refused, retrying with version 1"); #endif + /* + * At this point, fp->fh_error is meaningless + * since we are already fixing the problem + * We don't want it to get back to the caller! + * Mount is now still "in progress" + * + */ + + fp->fh_error = -1; fp->fh_id = FHID_ALLOC(); fp->fh_mountres.mr_version = MOUNTVERS; call_mountd(fp, MOUNTPROC_MNT, MOUNTVERS, got_nfs_fh, fp->fh_wchan); >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message