From owner-freebsd-fs@FreeBSD.ORG Fri Jun 10 12:59:40 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EE8131065673 for ; Fri, 10 Jun 2011 12:59:39 +0000 (UTC) (envelope-from jwd@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id D40588FC0C; Fri, 10 Jun 2011 12:59:39 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p5ACxdPT069765; Fri, 10 Jun 2011 12:59:39 GMT (envelope-from jwd@freefall.freebsd.org) Received: (from jwd@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p5ACxdKQ069764; Fri, 10 Jun 2011 12:59:39 GMT (envelope-from jwd) Date: Fri, 10 Jun 2011 12:59:39 +0000 From: John To: Rick Macklem Message-ID: <20110610125939.GA69616@FreeBSD.org> References: <20110609133805.GA78874@FreeBSD.org> <1069270455.338453.1307636209760.JavaMail.root@erie.cs.uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1069270455.338453.1307636209760.JavaMail.root@erie.cs.uoguelph.ca> User-Agent: Mutt/1.4.2.3i Cc: freebsd-fs@freebsd.org Subject: Re: New NFS server stress test hang X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Jun 2011 12:59:40 -0000 ----- Rick Macklem's Original Message ----- > John De wrote: > > ----- Rick Macklem's Original Message ----- > > > John De wrote: > > > > Hi, > > > > > > > > We've been running some stress tests of the new nfs server. > > > > The system is at r222531 (head), 9 clients, two mounts each > > > > to the server: > > > > > > > > mount_nfs -o > > > > udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=2 > > > > ${servera}:/vol/datsrc /c/$servera/vol/datsrc > > > > mount_nfs -o > > > > udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=0 > > > > ${servera}:/vol/datgen /c/$servera/vol/datgen > > > > > > > > > > > > The system is still up & responsive, simply no nfs services > > > > are working. All (200) threads appear to be active, but not > > > > doing anything. The debugger is not compiled into this kernel. > > > > We can run any other tracing commands desired. We can also > > > > rebuild the kernel with the debugger enabled for any kernel > > > > debugging needed. > > > > > > > > --- long logs deleted --- > > > > > > How about a: > > > ps axHlww <-- With the "H" we'll see what the nfsd server threads > > > are up to > > > procstat -kka > > > > > > Oh, and a couple of nfsstats a few seconds apart. It's what the > > > counts > > > are changing by that might tell us what is going on. (You can use > > > "-z" > > > to zero them out, if you have an nfsstat built from recent sources.) > > > > > > Also, does a new NFS mount attempt against the server do anything? > > > > > > Thanks in advance for help with this, rick > > > > Hi Rick, > > > > Here's the output. In general, the nfsd processes appear to be in > > either nfsrvd_getcache(35 instances) or nfsrvd_updatecache(164) > > sleeping on > > "nfssrc". The server numbers don't appear to be moving. A showmount > > from a > > client system works, but a mount does not (see below). > > Please try the attached patch and let me know if it helps. When I looked > I found several places where the rc_flag variable was being fiddled without the > mutex held. I suspect one of these resulted in the RC_LOCKED flag not > getting cleared, so all the threads got stuck waiting on it. > > The patch is at: > http://people.freebsd.org/~rmacklem/cache.patch > in case it gets eaten by the list handler. > Thanks for digging into this, rick Hi Rick, Patch applied. The system has been up and running for about 16 hours now and so far it's still handling the load quite nicely. last pid: 15853; load averages: 5.36, 4.64, 4.48 up 0+16:08:16 08:48:07 72 processes: 7 running, 65 sleeping CPU: % user, % nice, % system, % interrupt, % idle Mem: 22M Active, 3345M Inact, 79G Wired, 9837M Buf, 11G Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 2049 root 26 52 0 10052K 1712K CPU3 3 97:21 942.24% nfsd I'll followup again in 24 hours with another status. Any performance related numbers/knobs we can provide that might be of interest? Thanks Rick. -John