From owner-freebsd-stable@FreeBSD.ORG Sat Jul 27 06:54:38 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id AD942851 for ; Sat, 27 Jul 2013 06:54:38 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.16.84]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 217AF29EC for ; Sat, 27 Jul 2013 06:54:37 +0000 (UTC) Received: from pampa.cs.huji.ac.il ([132.65.80.32]) by kabab.cs.huji.ac.il with esmtp id 1V2yCy-000PdF-Ms; Sat, 27 Jul 2013 09:42:29 +0300 X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.3 To: Michael Tratz Subject: Re: NFS deadlock on 9.2-Beta1 In-reply-to: <780BC2DB-3BBA-4396-852B-0EBDF30BF985@esosoft.com> References: <960930050.1702791.1374711910151.JavaMail.root@uoguelph.ca> <780BC2DB-3BBA-4396-852B-0EBDF30BF985@esosoft.com> Comments: In-reply-to Michael Tratz message dated "Thu, 25 Jul 2013 20:05:59 -0700." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sat, 27 Jul 2013 09:42:28 +0300 From: Daniel Braniss Message-ID: Cc: freebsd-stable@freebsd.org, Rick Macklem , Steven Hartland X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Jul 2013 06:54:38 -0000 > > On Jul 24, 2013, at 5:25 PM, Rick Macklem wrote: > > > Michael Tratz wrote: > >> Two machines (NFS Server: running ZFS / Client: disk-less), both are > >> running FreeBSD r253506. The NFS client starts to deadlock processes > >> within a few hours. It usually gets worse from there on. The > >> processes stay in "D" state. I haven't been able to reproduce it > >> when I want it to happen. I only have to wait a few hours until the > >> deadlocks occur when traffic to the client machine starts to pick > >> up. The only way to fix the deadlocks is to reboot the client. Even > >> an ls to the path which is deadlocked, will deadlock ls itself. It's > >> totally random what part of the file system gets deadlocked. The NFS > >> server itself has no problem at all to access the files/path when > >> something is deadlocked on the client. > >> > >> Last night I decided to put an older kernel on the system r252025 > >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on > >> the client machine (it should have deadlocked by now). FreeBSD is > >> working hard like it always does. :-) There are a few changes to the > >> NFS code from the revision which seems to work until Beta1. I > >> haven't tried to narrow it down if one of those commits are causing > >> the problem. Maybe someone has an idea what could be wrong and I can > >> test a patch or if it's something else, because I'm not a kernel > >> expert. :-) > >> > > Well, the only NFS client change committed between r252025 and r253506 > > is r253124. It fixes a file corruption problem caused by a previous > > commit that delayed the vnode_pager_setsize() call until after the > > nfs node mutex lock was unlocked. > > > > If you can test with only r253124 reverted to see if that gets rid of > > the hangs, it would be useful, although from the procstats, I doubt it. > > > >> I have run several procstat -kk on the processes including the ls > >> which deadlocked. You can see them here: > >> > >> http://pastebin.com/1RPnFT6r > > > > All the processes you show seem to be stuck waiting for a vnode lock > > or in __utmx_op_wait. (I`m not sure what the latter means.) > > > > What is missing is what processes are holding the vnode locks and > > what they are stuck on. > > > > A starting point might be ``ps axhl``, to see what all the threads > > are doing (particularily the WCHAN for them all). If you can drop into > > the debugger when the NFS mounts are hung and do a ```show alllocks`` > > that could help. See: > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > > > I`ll admit I`d be surprised if r253124 caused this, but who knows. > > > > If there have been changes to your network device driver between > > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck > > waiting for a reply while holding a vnode lock, that would do it.) > > > > Good luck with it and maybe someone else can think of a commit > > between r252025 and r253506 that could cause vnode locking or network > > problems. > > > > rick > > > >> > >> I have tried to mount the file system with and without nolockd. It > >> didn't make a difference. Other than that it is mounted with: > >> > >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768 > >> > >> Let me know if you need me to do something else or if some other > >> output is required. I would have to go back to the problem kernel > >> and wait until the deadlock occurs to get that information. > >> > > Thanks Rick and Steven for your quick replies. > > I spoke too soon regarding r252025 fixing the problem. The same issue started to show up after about 1 day and a few hours of uptime. > > "ps axhl" shows all those stuck processes in newnfs > > I recompiled the GENERIC kernel for Beta1 with the debugging options: > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > ps and debugging output: > > http://pastebin.com/1v482Dfw > > (I only listed processes matching newnfs, if you need the whole list, please let me know) > > The first PID showing up having that problem is 14001. Certainly the "show alllocks" command shows interesting information for that PID. > I looked through the commit history for those files mentioned in the output to see if there is something obvious to me. But I don't know. :-) > I hope that information helps you to dig deeper into the issue what might be causing those deadlocks. > > I did include the pciconf -lv, because you mentioned network device drivers. It's Intel igb. The same hardware is running a kernel from January 19th, 2013 also as an NFS client. That machine is rock solid. No problems at all. > > I also went to r251611. That's before r251641 (The NFS FHA changes). Same problem. Here is another debugging output from that kernel: > > http://pastebin.com/ryv8BYc4 > > If I should test something else or provide some other output, please let me know. > > Again thank you! > > Michael just a quick 'me too', It usually happens on our ftp server, and it's been happening for a long time. It's diskless, and it happens randomly, so it's difficult to reproduce. We have many other diskless servers running quiet smoothly. danny