From owner-freebsd-stable@FreeBSD.ORG  Mon Dec 22 18:42:14 2003
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3DE2F16A4CE
	for <freebsd-stable@freebsd.org>;
	Mon, 22 Dec 2003 18:42:14 -0800 (PST)
Received: from alcanet.com.au (mailout2.alcanet.com.au [208.178.117.14])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7D3D843D1F
	for <freebsd-stable@freebsd.org>;
	Mon, 22 Dec 2003 18:42:10 -0800 (PST)
	(envelope-from peter.jeremy@alcatel.com.au)
Received: from sydsmtp02.alcatel.com.au (IDENT:root@localhost.localdomain
	[127.0.0.1])hBN2g7Ix019866;	Tue, 23 Dec 2003 13:42:08 +1100
Received: from gsmx07.alcatel.com.au ([139.188.20.247])
          by sydsmtp02.alcatel.com.au (Lotus Domino Release 5.0.12)
          with ESMTP id 2003122313420764:318804 ;
          Tue, 23 Dec 2003 13:42:07 +1100 
Received: from gsmx07.alcatel.com.au (localhost [127.0.0.1])
	hBN2g7HQ045789;	Tue, 23 Dec 2003 13:42:07 +1100 (EST)
	(envelope-from peter.jeremy@alcatel.com.au)
Received: (from jeremyp@localhost)
	by gsmx07.alcatel.com.au (8.12.9p2/8.12.9/Submit) id hBN2g56R045788;
	Tue, 23 Dec 2003 13:42:05 +1100 (EST)
	(envelope-from peter.jeremy@alcatel.com.au)
Date: Tue, 23 Dec 2003 13:42:05 +1100
From: Peter Jeremy <peter.jeremy@alcatel.com.au>
To: freebsd-stable@freebsd.org
Message-ID: <20031223024205.GA45693@gsmx07.alcatel.com.au>
Mail-Followup-To: freebsd-stable@freebsd.org,
	Andrew.Li@alcatel.com.au
Mime-Version: 1.0
User-Agent: Mutt/1.4.1i
X-MIMETrack: Itemize by SMTP Server on SYDSMTP02/AlcatelAustralia(Release
	5.0.12  |February 13, 2003) at 23/12/2003 13:42:07,|February
	13, 2003) at 23/12/2003 13:42:08,
	Serialize complete at 23/12/2003 13:42:08
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Subject: 4.9p1 deadlock on "inode"
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Dec 2003 02:42:14 -0000

This morning I found one of my systems would not let me login or issue
commands but still seemed to be running.  ddb showed that lots of
processes were waiting on "inode".  I forced a crash dump and found
166 processes total, 95 waiting on inode and 94 on the same wchan:

(kgdb) p *(struct lock *)0xc133eb00
$9 = {lk_interlock = {lock_data = 0}, lk_flags = 0x200440, lk_sharecount = 0, 
  lk_waitcount = 94, lk_exclusivecount = 1, lk_prio = 8, 
  lk_wmesg = 0xc02b0a8a "inode", lk_timo = 101, lk_lockholder = 304}
(kgdb) 

The lockholder is cron - the process waiting on inode on a different
lock:
(kgdb) p *(struct lock *)0xc1901a00
$10 = {lk_interlock = {lock_data = 0}, lk_flags = 0x200440, lk_sharecount = 0, 
  lk_waitcount = 1, lk_exclusivecount = 1, lk_prio = 8, 
  lk_wmesg = 0xc02b0a8a "inode", lk_timo = 101, lk_lockholder = 15123}
(kgdb) 

Pid 15123 is another cron process waiting on "vlruwk" because there are
too many vnodes in use:
(kgdb) p numvnodes
$12 = 8904
(kgdb) p freevnodes
$13 = 24
(kgdb) p desiredvnodes
$14 = 8879

Process vnlru is waiting on "vlrup" with vnlru_nowhere = 18209.

Looking through the mountlist, mnt_nvnodelistsize was sane on all
filesystems except one (/mnt), where it was 8613 (97% of all vnodes).
Only one process was actively using files in /mnt, though some other
processes may have been using it for $PWD or similar.  This process
was scanning most of the files in /mnt (about 750,000) checking for
files with identical content - basically all files that could
potentially be the same (eg same length) are mmap'd and compared.
This process had 2816 entries in its vm_map.  (It's just occurred to
me that there would be one set of data that would appear in a large
number of files (~30000) but I would have expected this to result in
an error during an mmap(), not a deadlock).

Scanning through the mnt_nvnodelist on /mnt:
5797 entries were for directories with entries in v_cache_src
2804 entries were for files with a usecount > 0
  11 entries were for directories with VFREE|VDOOMED|VXLOCK
   1 VNON entry

This means that none of the vnodes in /mnt were available for
recycling (and the total vnodes on the other filesystems would not be
enough to reach the hysteresis point to unlock the vnode allocation).
I can understand that an mmap'd file holds a usecount on the file's
vnode but my understanding is that vnode entries with v_cache_src
entries should be able to be recycled (though this will slow down
namei()).  If so, should vnlru grow a "try harder" loop that will
recycle these vnodes if it winds up stuck in entries?

I notice vlrureclaim() contains the comment "don't set kern.maxvnodes
too low".  In this case, it is auto-tuned based on 128MB RAM and
"maxusers=0".  Maybe this is too low for my purposes but it would be
much nicer if the system managed to handle this situation gracefully
rather than by deadlocking.

And finally, a question on vlrureclaim():  Why does this process scan
through mnt_nvnodelist and perform a TAILQ_REMOVE(), TAILQ_INSERT_TAIL()
on each node?  Wouldn't it be cheaper to just scan the list, rather than
moving every node to the end of the list?

Peter