From owner-freebsd-stable@FreeBSD.ORG Mon Dec 22 18:42:14 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3DE2F16A4CE for ; Mon, 22 Dec 2003 18:42:14 -0800 (PST) Received: from alcanet.com.au (mailout2.alcanet.com.au [208.178.117.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7D3D843D1F for ; Mon, 22 Dec 2003 18:42:10 -0800 (PST) (envelope-from peter.jeremy@alcatel.com.au) Received: from sydsmtp02.alcatel.com.au (IDENT:root@localhost.localdomain [127.0.0.1])hBN2g7Ix019866; Tue, 23 Dec 2003 13:42:08 +1100 Received: from gsmx07.alcatel.com.au ([139.188.20.247]) by sydsmtp02.alcatel.com.au (Lotus Domino Release 5.0.12) with ESMTP id 2003122313420764:318804 ; Tue, 23 Dec 2003 13:42:07 +1100 Received: from gsmx07.alcatel.com.au (localhost [127.0.0.1]) hBN2g7HQ045789; Tue, 23 Dec 2003 13:42:07 +1100 (EST) (envelope-from peter.jeremy@alcatel.com.au) Received: (from jeremyp@localhost) by gsmx07.alcatel.com.au (8.12.9p2/8.12.9/Submit) id hBN2g56R045788; Tue, 23 Dec 2003 13:42:05 +1100 (EST) (envelope-from peter.jeremy@alcatel.com.au) Date: Tue, 23 Dec 2003 13:42:05 +1100 From: Peter Jeremy To: freebsd-stable@freebsd.org Message-ID: <20031223024205.GA45693@gsmx07.alcatel.com.au> Mail-Followup-To: freebsd-stable@freebsd.org, Andrew.Li@alcatel.com.au Mime-Version: 1.0 User-Agent: Mutt/1.4.1i X-MIMETrack: Itemize by SMTP Server on SYDSMTP02/AlcatelAustralia(Release 5.0.12 |February 13, 2003) at 23/12/2003 13:42:07,|February 13, 2003) at 23/12/2003 13:42:08, Serialize complete at 23/12/2003 13:42:08 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Subject: 4.9p1 deadlock on "inode" X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Dec 2003 02:42:14 -0000 This morning I found one of my systems would not let me login or issue commands but still seemed to be running. ddb showed that lots of processes were waiting on "inode". I forced a crash dump and found 166 processes total, 95 waiting on inode and 94 on the same wchan: (kgdb) p *(struct lock *)0xc133eb00 $9 = {lk_interlock = {lock_data = 0}, lk_flags = 0x200440, lk_sharecount = 0, lk_waitcount = 94, lk_exclusivecount = 1, lk_prio = 8, lk_wmesg = 0xc02b0a8a "inode", lk_timo = 101, lk_lockholder = 304} (kgdb) The lockholder is cron - the process waiting on inode on a different lock: (kgdb) p *(struct lock *)0xc1901a00 $10 = {lk_interlock = {lock_data = 0}, lk_flags = 0x200440, lk_sharecount = 0, lk_waitcount = 1, lk_exclusivecount = 1, lk_prio = 8, lk_wmesg = 0xc02b0a8a "inode", lk_timo = 101, lk_lockholder = 15123} (kgdb) Pid 15123 is another cron process waiting on "vlruwk" because there are too many vnodes in use: (kgdb) p numvnodes $12 = 8904 (kgdb) p freevnodes $13 = 24 (kgdb) p desiredvnodes $14 = 8879 Process vnlru is waiting on "vlrup" with vnlru_nowhere = 18209. Looking through the mountlist, mnt_nvnodelistsize was sane on all filesystems except one (/mnt), where it was 8613 (97% of all vnodes). Only one process was actively using files in /mnt, though some other processes may have been using it for $PWD or similar. This process was scanning most of the files in /mnt (about 750,000) checking for files with identical content - basically all files that could potentially be the same (eg same length) are mmap'd and compared. This process had 2816 entries in its vm_map. (It's just occurred to me that there would be one set of data that would appear in a large number of files (~30000) but I would have expected this to result in an error during an mmap(), not a deadlock). Scanning through the mnt_nvnodelist on /mnt: 5797 entries were for directories with entries in v_cache_src 2804 entries were for files with a usecount > 0 11 entries were for directories with VFREE|VDOOMED|VXLOCK 1 VNON entry This means that none of the vnodes in /mnt were available for recycling (and the total vnodes on the other filesystems would not be enough to reach the hysteresis point to unlock the vnode allocation). I can understand that an mmap'd file holds a usecount on the file's vnode but my understanding is that vnode entries with v_cache_src entries should be able to be recycled (though this will slow down namei()). If so, should vnlru grow a "try harder" loop that will recycle these vnodes if it winds up stuck in entries? I notice vlrureclaim() contains the comment "don't set kern.maxvnodes too low". In this case, it is auto-tuned based on 128MB RAM and "maxusers=0". Maybe this is too low for my purposes but it would be much nicer if the system managed to handle this situation gracefully rather than by deadlocking. And finally, a question on vlrureclaim(): Why does this process scan through mnt_nvnodelist and perform a TAILQ_REMOVE(), TAILQ_INSERT_TAIL() on each node? Wouldn't it be cheaper to just scan the list, rather than moving every node to the end of the list? Peter