From owner-freebsd-performance@FreeBSD.ORG Sat Feb 19 12:25:19 2005 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9936A16A4D0 for ; Sat, 19 Feb 2005 12:25:19 +0000 (GMT) Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53]) by mx1.FreeBSD.org (Postfix) with ESMTP id 56CCE43D39 for ; Sat, 19 Feb 2005 12:25:19 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by cyrus.watson.org (Postfix) with SMTP id F15FD46B04; Sat, 19 Feb 2005 07:25:18 -0500 (EST) Date: Sat, 19 Feb 2005 12:23:48 +0000 (GMT) From: Robert Watson X-Sender: robert@fledge.watson.org To: David Rice In-Reply-To: <200502171636.10361.drice@globat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-performance@freebsd.org Subject: Re: High traffic NFS performance and availability problems X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Feb 2005 12:25:19 -0000 On Thu, 17 Feb 2005, David Rice wrote: > Typicly we have 7 client boxes mounting storage from a single file > server. Each client box servers 1000 web sites and associate email. We > have done the basic NFS tuning (ie: Read write size optimization and > kernel tuning) How many nfsd's are you running with? If you run systat -vmstat 1 on your server under high load, could you send us the output? In particular, I'm interested in knowing how the system is spending its time, the paging level, I/O throughput on devices, and the systat -vmstat summary screen provides a good summary of this and more. A few snapshots of "gstat" output would also be very helpful. As would a snapshot or two of "top -S" output. This will give us a picture of how the system is spending its time. > 2. Client boxes have high load averages and sometimes crashes due to > slow NFS performance. Could you be more specific about the crash failure mode? > 3. File servers that randomly crash with "Fatal trap 12: page fault > while in kernel mode" Could you make sure you're running with at least the latest 5.3 patch level on the server, which includes some NFS server stability fixes, and also look at sliding to the head of 5-STABLE? There are a number of performance and stability improvements that may be relevant there. Could you provide serial console output of the full panic message, trap details, compile the kernel with KDB+DDB, and include a full stack trace? I'm happy to try to help debug these problems. > 4. With soft updates enabled during FSCK the fileserver will freeze with > all NFS processs in the "snaplck" state. We disabled soft updates > because of this. If it's possible to do get some more information, it would be quite helpful. In particular, could you compile the server box with DDB+KDB+BREAK_TO_DEBUGGER, breka into the serial debugger when it appears wedged, and put the contents of "show lockedvnods", "ps", and "trace " of any processes listed in "show lockedvnods" output, that would be great. A crash dump would also be very helpful. For some hints on the information that is necessary here, take a look at the handbook chapter on kernel debugging and reporting kernel bugs, and my recent post to current@ diagnosing a similar bug. If you e-enable soft updates but leave bgfsck disabled, does that correct this stability problem? In any case, I'm happy to help try to figure out what's going on -- some of the above information for stability and performance problems would be quite helpful in tracking it down. Robert N M Watson