From owner-freebsd-performance@FreeBSD.ORG  Sat Feb 19 12:25:19 2005
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9936A16A4D0
	for <freebsd-performance@freebsd.org>;
	Sat, 19 Feb 2005 12:25:19 +0000 (GMT)
Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 56CCE43D39
	for <freebsd-performance@freebsd.org>;
	Sat, 19 Feb 2005 12:25:19 +0000 (GMT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by cyrus.watson.org (Postfix) with SMTP id F15FD46B04;
	Sat, 19 Feb 2005 07:25:18 -0500 (EST)
Date: Sat, 19 Feb 2005 12:23:48 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: David Rice <drice@globat.com>
In-Reply-To: <200502171636.10361.drice@globat.com>
Message-ID: <Pine.NEB.3.96L.1050219121641.67347L-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: freebsd-performance@freebsd.org
Subject: Re: High traffic NFS performance and availability problems
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Feb 2005 12:25:19 -0000


On Thu, 17 Feb 2005, David Rice wrote:

> Typicly we have 7 client boxes mounting storage from a single file
> server.  Each client box servers 1000 web sites and associate email. We
> have done the basic NFS tuning (ie: Read write size optimization and
> kernel tuning) 

How many nfsd's are you running with?

If you run systat -vmstat 1 on your server under high load, could you send
us the output?  In particular, I'm interested in knowing how the system is
spending its time, the paging level, I/O throughput on devices, and the
systat -vmstat summary screen provides a good summary of this and more.  A
few snapshots of "gstat" output would also be very helpful.  As would a
snapshot or two of "top -S" output.  This will give us a picture of how
the system is spending its time.

> 2. Client boxes have high load averages and sometimes crashes due to
> slow NFS performance. 

Could you be more specific about the crash failure mode?

> 3. File servers that randomly crash with "Fatal trap 12: page fault
> while in kernel mode" 

Could you make sure you're running with at least the latest 5.3 patch
level on the server, which includes some NFS server stability fixes, and
also look at sliding to the head of 5-STABLE?  There are a number of
performance and stability improvements that may be relevant there.

Could you provide serial console output of the full panic message, trap
details, compile the kernel with KDB+DDB, and include a full stack trace?
I'm happy to try to help debug these problems.

> 4. With soft updates enabled during FSCK the fileserver will freeze with
> all NFS processs in the "snaplck" state. We disabled soft updates
> because of this. 

If it's possible to do get some more information, it would be quite
helpful.  In particular, could you compile the server box with
DDB+KDB+BREAK_TO_DEBUGGER, breka into the serial debugger when it appears
wedged, and put the contents of "show lockedvnods", "ps", and "trace
<pid>" of any processes listed in "show lockedvnods" output, that would be
great.  A crash dump would also be very helpful.  For some hints on the
information that is necessary here, take a look at the handbook chapter on
kernel debugging and reporting kernel bugs, and my recent post to current@
diagnosing a similar bug.

If you e-enable soft updates but leave bgfsck disabled, does that correct
this stability problem?

In any case, I'm happy to help try to figure out what's going on -- some
of the above information for stability and performance problems would be
quite helpful in tracking it down.

Robert N M Watson