Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Nov 1999 02:50:38 +0000 (GMT)
From:      iedowse@maths.tcd.ie
To:        FreeBSD-gnats-submit@freebsd.org
Subject:   kern/15055: Soft NFS mounts can deadlock
Message-ID:  <199911230250.aa05526@walton.maths.tcd.ie>

next in thread | raw e-mail | index | archive | help

>Number:         15055
>Category:       kern
>Synopsis:       Soft NFS mounts can deadlock
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Nov 22 19:00:01 PST 1999
>Closed-Date:
>Last-Modified:
>Originator:     Ian Dowse
>Release:        FreeBSD 3.3-STABLE i386
>Organization:
		School of Mathematics,
		Trinity College Dublin
>Environment:
	
	FreeBSD -current or -stable, mounting an NFS filesystem with
	the NFSMNT_SOFT (-s) flag.

>Description:

	Under certain circumstances it is possible for multiple processes
	to reach a deadlock situation when accessing a soft-mount NFS
	filesystem. This problem is triggered when the NFS server becomes
	unavailable for a time, but the processes remain deadlocked even
	after the server comes back. If the mount is also interruptable
	(NFSMNT_INT or -i), then recovery is possible by killing some of
	the affected processes; otherwise a reboot is necessary.

	This problem results from an interaction between the NFS congestion
	window mechanism, and the way that soreceive()'s on the NFS socket
	are serialised.

	When the NFS server becomes unavailable and there are outstanding
	requests (new or old), the NFS congestion window quickly shrinks
	back to 1 RPC. Requests then fall into two catagories: (a) those
	that managed to get in and send a request before the window closed
	up (R_SENT flag set); and (b) those that missed the window, so are
	waiting for nfs_timer() to transmit their requests later.

	The deadlock occurs when a process with a category (b) request gets
	the receive lock, and subsequently all type (a) requests time out.
	No type (a) requests are transmitted since they have all timed
	out, and the congestion window disallows transmitting type (b)
	requests. The process holding the receive lock will not release it
	until it receives a NFS reply (for any request), but since there
	are no requests being transmitted, this never happens. The timed-
	out requests don't complete either since their processes are all
	waiting for the receive lock!

	If the mount is interruptable, then killing the type (b) process
	that currently holds the receive lock will release it. Then all
	the type (a) processes notice that their requests have timed out,
	and return. 
	

>How-To-Repeat:

	mount -o -s,-i someserver:/fs /mnt
	
	# Lots of accesses to push down the NFS RTT estimates
	find /mnt -print > /dev/null

	#  *** Disconnect the server from the client ***

	# Make some type (a) processes
	ls -l /mnt &; ls -l /mnt &; ls -l /mnt &; ls -l /mnt &
	sleep 5
	# Now that the congestion window has closed these will be type (b)
	df /mnt &; df /mnt &; df /mnt &; df /mnt &

	Then wait for a few 'nfs server not responding' errors, and wait
	for the NFS traffic to stop completely with one of the df processes
	waiting on 'sbwait'. When this happens, reconnecting the server
	will not unwedge the processes, but killing the df in 'sbwait' will.

>Fix:
	Apply the following patch to sys/nfs/nfs_socket.c. This causes
	the count of outstanding requests to be decremented as soon as
	a request is marked as timed-out. When all type (a) requests
	have timed out, the congestion window will allow another request
	to be transmitted, so the deadlock is avoided.

	Note that while this patch solves the deadlock problem, the code
	still does not guarantee that a process will be made aware quickly
	that its request has timed out. That would require nfs_timer() to
	set some flag in the nfsmount struct, instructing the current holder
	of the receive lock to release it as soon as possible. I'm not sure
	that such a mechanism would be worth the effort. With this patch the
	process will find out eventually (it doesn't need to wait for the
	server to come back) about a timeout, and all waiting processes will
	respond quickly when the server does return.

--- nfs_socket.c.orig	Mon Nov 22 21:58:12 1999
+++ nfs_socket.c	Mon Nov 22 22:43:33 1999
@@ -152,6 +152,7 @@
 static void	nfs_realign __P((struct mbuf **pm, int hsiz));
 static int	nfs_receive __P((struct nfsreq *rep, struct sockaddr **aname,
 				 struct mbuf **mp));
+static void	nfs_softterm __P((struct nfsreq *rep));
 static int	nfs_reconnect __P((struct nfsreq *rep));
 #ifndef NFS_NOSERVER 
 static int	nfsrv_getstream __P((struct nfssvc_sock *,int));
@@ -864,8 +865,10 @@
 					if (nmp->nm_cwnd > NFS_MAXCWND)
 						nmp->nm_cwnd = NFS_MAXCWND;
 				}
-				rep->r_flags &= ~R_SENT;
-				nmp->nm_sent -= NFS_CWNDSCALE;
+				if (rep->r_flags & R_SENT) {
+					rep->r_flags &= ~R_SENT;
+					nmp->nm_sent -= NFS_CWNDSCALE;
+				}
 				/*
 				 * Update rtt using a gain of 0.125 on the mean
 				 * and a gain of 0.25 on the deviation.
@@ -1384,7 +1387,7 @@
 		if (rep->r_mrep || (rep->r_flags & R_SOFTTERM))
 			continue;
 		if (nfs_sigintr(nmp, rep, rep->r_procp)) {
-			rep->r_flags |= R_SOFTTERM;
+			nfs_softterm(rep);
 			continue;
 		}
 		if (rep->r_rtt >= 0) {
@@ -1412,7 +1415,7 @@
 		}
 		if (rep->r_rexmit >= rep->r_retry) {	/* too many */
 			nfsstats.rpctimeouts++;
-			rep->r_flags |= R_SOFTTERM;
+			nfs_softterm(rep);
 			continue;
 		}
 		if (nmp->nm_sotype != SOCK_DGRAM) {
@@ -1491,6 +1494,27 @@
 	nfs_timer_handle = timeout(nfs_timer, (void *)0, nfs_ticks);
 }
 
+/*
+ * Flag a request as being about to terminate (due to NFSMNT_INT/NFSMNT_SOFT).
+ * The nm_send count is decremented now to avoid deadlocks when the process in
+ * soreceive() hasn't yet managed to send its own request.
+ */
+static void
+nfs_softterm(rep)
+	struct nfsreq *rep;
+{
+	rep->r_flags |= R_SOFTTERM;
+
+	/*
+	 * Decrement the outstanding request count, and clear R_SENT so
+	 * that the decrement doesn't get done again later.
+	 */
+	if (rep->r_flags & R_SENT) {
+		rep->r_nmp->nm_sent -= NFS_CWNDSCALE;
+		rep->r_flags &= ~R_SENT;
+	}
+}
+	
 
 /*
  * Test for a termination condition pending on the process.

>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911230250.aa05526>