From owner-freebsd-questions  Fri Jan  7  8:15: 0 2000
Delivered-To: freebsd-questions@freebsd.org
Received: from cc942873-a.ewndsr1.nj.home.com (cc942873-a.ewndsr1.nj.home.com [24.2.89.207])
	by hub.freebsd.org (Postfix) with ESMTP id 8EEEF157B0
	for <freebsd-questions@FreeBSD.ORG>; Fri,  7 Jan 2000 08:14:46 -0800 (PST)
	(envelope-from cjc@cc942873-a.ewndsr1.nj.home.com)
Received: (from cjc@localhost)
	by cc942873-a.ewndsr1.nj.home.com (8.9.3/8.9.3) id LAA23340
	for freebsd-questions@FreeBSD.ORG; Fri, 7 Jan 2000 11:19:16 -0500 (EST)
	(envelope-from cjc)
From: "Crist J. Clark" <cjc@cc942873-a.ewndsr1.nj.home.com>
Message-Id: <200001071619.LAA23340@cc942873-a.ewndsr1.nj.home.com>
Subject: Re: Hung NFS Mount
In-Reply-To: <200001062031.PAA20493@cc942873-a.ewndsr1.nj.home.com> from "Crist J. Clark" at "Jan 6, 2000 03:31:40 pm"
To: freebsd-questions@FreeBSD.ORG (FreeBSD Questions)
Date: Fri, 7 Jan 2000 11:19:16 -0500 (EST)
Reply-To: cjclark@home.com
X-Mailer: ELM [version 2.4ME+ PL54 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

I have a little more info about my hung up system. I'm hoping against
all hope someone out there might know how to help me with this.

The client with unkillable processes is hung up, but still alive. It's
been sending out NFS packets every ten seconds for a day now trying to
get the file it wants from the server. I'm begining to wonder if this
is a client or server issue. I've caught the packet is keeps sending,

# tcpdump -v \( host newmail \&\& port nfs \)
10:40:04.592755 newmail.mydom.org.3388292815 > backmail.mydom.org.nfs: 104 lookup [|nfs] (ttl 64, id 15687)
10:40:14.838168 newmail.mydom.org.3388292815 > backmail.mydom.org.nfs: 104 lookup [|nfs] (ttl 64, id 15736)
10:40:25.088713 newmail.mydom.org.3388292815 > backmail.mydom.org.nfs: 104 lookup [|nfs] (ttl 64, id 15737)
10:40:35.339210 newmail.mydom.org.3388292815 > backmail.mydom.org.nfs: 104 lookup [|nfs] (ttl 64, id 15768)
^C
40 packets received by filter
0 packets dropped by kernel

More detail on one of these packets,

00:16:41.324996 newmail.mydom.org.3388292815 > backmail.mydom.org.nfs: 104 lookup fh 25,0/713065181 "net" (ttl 64, id 30022)

The full packet is show at the bottom.

I have tried a few things. I have scanned the source port on the
client from the NFS port of the server (nmap reports it open
anyway). I have set "unreach" rules on the server's firewall on the
NFS port with a variety of unreachable responses. None of these
stopped the packets from continuing or unhung the processes on the
client. I have also mounted the remote filesystem elsewhere on the
system and then unounted it with no problem. That is,

# mount -t nfs              # show the hung FS 
backmail:/u1/FreeBSD-3S/ports on /usr/ports
# mount backmail:/u1/FreeBSD-3S/ports /mnt
# ls /mnt
.cvsignore      archivers       deskutils       mail            sysutils
INDEX           astro           devel           math            textproc
LEGAL           audio           distfiles       mbone           www
Makefile        benchmarks      editors         misc            x11
Mk              biology         emulators       net             x11-clocks
README          cad             ftp             news            x11-fm
Templates       comms           games           print           x11-fonts
Tools           converters      graphics        security        x11-toolkits
YEAR2000        databases       lang            shells          x11-wm
# umount backmail:/u1/FreeBSD-3S/ports
umount: /usr/ports: Device busy
# umount /mnt
# ls /mnt
# 

I am kind of curious why the server is not responding at all to
this. I'm used to getting "stale file handle" messages when a server
disappears and comes back up in a new state, and I would think that
that would be the server response in this case.

Does anyone know how to prompt the server to give a response? Or how
to build a packet to send to the client to get it out of this rut? Any
ideas would be much appreciated.

Here is a hexdump -C of the NFS packet it keeps sending (IP addresses
masked),

# hexdump -C nfs.tcpdump                                                                                                         
00000000  d4 c3 b2 a1 02 00 04 00  00 00 00 00 00 00 00 00  |................|
00000010  00 01 00 00 01 00 00 00  b9 76 75 38 84 f5 04 00  |.........vu8....|
00000020  92 00 00 00 92 00 00 00  00 aa 00 bb 1e 42 00 aa  |.............B..|
00000030  00 6f d7 28 08 00 45 00  00 84 75 46 00 00 40 11  |.o.(..E...uF..@.|
00000040  ae 07 xx xx xx xx xx xx  xx xx 03 fd 08 01 00 70  |...............p|
00000050  30 32 c9 f5 3e cf 00 00  00 00 00 00 00 02 00 01  |02..>...........|
00000060  86 a3 00 00 00 03 00 00  00 03 00 00 00 01 00 00  |................|
00000070  00 18 00 00 00 00 00 00  00 00 00 00 ff fe 00 00  |................|
00000080  ff fe 00 00 00 01 00 00  ff fe 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 1c 00 19  00 00 dd 82 80 2a 0c 00  |.............*..|
000000a0  00 00 00 d9 00 00 0c b0  82 65 00 00 00 00 00 00  |.........e......|
000000b0  00 00 00 00 00 03 6e 65  74 00                    |......net.|
000000ba


Crist J. Clark wrote,
> A machine of mine had some SCSI hardware problems yesterday. The
> machine does NFS serving to several others. The filesystems exported
> are on drives that were experiencing problems. This was causing local
> hung processes on the machine as well as hung processes on the NFS
> clients.
> 
> Eventually, I was forced to reboot the machine with hardware
> problems. Now, the NFS exports are clean. Most machines that had
> problems noticed the server go down and come up. They responded with
> 'stale NFS handle's messages at access attempts. A simple umount/mount
> of the filesystem fixed this.
> 
> However, one machine is still having problems. It tried to access
> files on the failing server while the NFS daemon was alive, but unable
> to get the files due to the hardware problems. These processes are
> still hanging. Despite the server going up and down and the fact it is
> now alive and well, I cannot get the processes to "unhang."
> 
> Here are some of them,
> 
> root     15083  0.0  0.1   288   16  p0- D    11:51AM    0:00.04 umount /usr/ports
> postman  15288  0.0  2.2   740  488  p1  Ds   12:08PM    0:00.40 -tcsh (tcsh)
> root     15312  0.0  0.1   288   16  p2- D    12:09PM    0:00.03 umount /usr/ports
> root     15820  0.0  0.1   224   16  p2- D    12:42PM    0:00.02 mount /usr/ports
> root     16223  0.0  1.3   240  288  p2- D     1:05PM    0:00.43 / (find)
> root     17693  0.0  0.2   288   36  p0- D     2:53PM    0:00.03 umount -f /usr/ports
> 
> I would really rather not reboot the machine this is happening
> on (and I wonder if the shutdown would even be clean). However, these
> are just a few of the hung processes. I've already had 'file table
> full' errors which I believe are caused by all of the hung processes
> keeping files open.
> 
> I know that hard NFS errors like this are very tough, if not
> impossible, to clear, but I'd try just about anything. I'd build raw
> packets to throw from the NFS server if I thought it would spoof the
> cleint out of the hangs.
> 
> Any ideas would be great. (But I really think I'll need to
> reboot... after 160 days up too... *sigh*)
> -- 
> Crist J. Clark                           cjclark@home.com
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-questions" in the body of the message
> 


-- 
Crist J. Clark                           cjclark@home.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message