From owner-freebsd-stable@FreeBSD.ORG Mon Sep 5 09:35:07 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0B58D1065670 for ; Mon, 5 Sep 2011 09:35:07 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from mail.incore.de (dss.incore.de [195.145.1.138]) by mx1.freebsd.org (Postfix) with ESMTP id E23968FC18 for ; Mon, 5 Sep 2011 09:35:01 +0000 (UTC) Received: from inetmail.dmz (inetmail.dmz [10.3.0.3]) by mail.incore.de (Postfix) with ESMTP id 1BC275EC6C for ; Mon, 5 Sep 2011 11:15:44 +0200 (CEST) X-Virus-Scanned: amavisd-new at incore.de Received: from mail.incore.de ([10.3.0.3]) by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024) with LMTP id e0WXVpovcz0v for ; Mon, 5 Sep 2011 11:15:42 +0200 (CEST) Received: from mail.incore (fwintern.dmz [10.0.0.253]) by mail.incore.de (Postfix) with ESMTP id E16885EC42 for ; Mon, 5 Sep 2011 11:15:42 +0200 (CEST) Received: from bsdlo.incore (bsdlo.incore [192.168.0.84]) by mail.incore (Postfix) with ESMTP id DAB2445088 for ; Mon, 5 Sep 2011 11:15:42 +0200 (CEST) Message-ID: <4E64933E.8030908@incore.de> Date: Mon, 05 Sep 2011 11:15:42 +0200 From: Andreas Longwitz User-Agent: Thunderbird 2.0.0.19 (X11/20090113) MIME-Version: 1.0 To: freebsd-stable@freebsd.org Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit Subject: UFS_DIRHASH panics on a dozen server within 30 hours X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Sep 2011 09:35:07 -0000 Hi, a week ago a dozen of my FreeBSD server crashed within a time span of 30 hours. On the server run very different applications, some of them were only standby. All server has the same kernel with FreeBSD 6 STABLE and there were no problems for yours until the "black monday". Yes I know that FreeBSD 6 is out of date now, but I don't like to change a very good running system. Another reason is that my hardware needs the amr driver and because of the outstanding solution of the amr_ioctl problem described in kern/155658 it is not possible for me to upgrade my production sytems without changing hardware. Now I have a dozen core dumps and try to understand what happened. All dumps looks very similar and the panic is always "page fault" in _mtx_lock_sleep called from ufsdirhash_recycle or ufsdirhash_free because the used mtx_object is overwritten with zeros by someone before _mtx_lock_sleep is called. A typical stack trace and some kgdb output follows: (kgdb) where #0 doadump () at pcpu.h:165 #1 0xc03c5b25 in boot (howto=260) at ../../../kern/kern_shutdown.c:410 #2 0xc03c5e7d in panic (fmt=0xc05931cb "%s") at ../../../kern/kern_shutdown.c:566 #3 0xc0564606 in trap_fatal (frame=0xec6ed77c, eva=256) at ../../../i386/i386/trap.c:838 #4 0xc0563d1e in trap (frame= {tf_fs = 8, tf_es = -328335320, tf_ds = -328335320, tf_edi = -901761536, tf_esi = 0, tf_ebp = -328280120, tf_isp = -328280152, tf_ebx = -827089920, tf_edx = 0, tf_ecx = 2, tf_eax = 1, tf_trapno = 12, tf_err = 0, tf_eip = -1069829895, tf_cs = 32, tf_eflags = 65538, tf_esp = -827089920, tf_ss = 2}) at ../../../i386/i386/trap.c:270 #5 0xc054ddda in calltrap () at ../../../i386/i386/exception.s:139 #6 0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:550 #7 0xc04eb3c5 in ufsdirhash_recycle (wanted=57230) at ../../../ufs/ufs/ufs_dirhash.c:1035 #8 0xc04e981b in ufsdirhash_build (ip=0xca6b6084) at ../../../ufs/ufs/ufs_dirhash.c:173 #9 0xc04ebbdd in ufs_lookup (ap=0xec6ed920) at ../../../ufs/ufs/ufs_lookup.c:202 #10 0xc057116c in VOP_CACHEDLOOKUP_APV (vop=0x1, a=0x0) at vnode_if.c:150 #11 0xc04164fa in vfs_cache_lookup (ap=0x1) at vnode_if.h:82 #12 0xc05710fb in VOP_LOOKUP_APV (vop=0xc05f90a0, a=0xec6ed9c0) at vnode_if.c:99 #13 0xc041add4 in lookup (ndp=0xec6edbcc) at vnode_if.h:56 #14 0xc041a66a in namei (ndp=0xec6edbcc) at ../../../kern/vfs_lookup.c:216 #15 0xc042ec31 in vn_open_cred (ndp=0xec6edbcc, flagp=0xec6edccc, cmode=384, cred=0xc9bceb80, fdidx=97) at ../../../kern/vfs_vnops.c:183 #16 0xc042e982 in vn_open (ndp=0x0, flagp=0xec6edccc, cmode=384, fdidx=97) at ../../../kern/vfs_vnops.c:91 #17 0xc042749a in kern_open (td=0xca403600, path=0x1
, pathseg=UIO_SYSSPACE, flags=1, mode=438) at ../../../kern/vfs_syscalls.c:1016 #18 0xc04271d2 in open (td=0xca403600, uap=0xec6edd04) at ../../../kern/vfs_syscalls.c:971 #19 0xc056494b in syscall (frame= {tf_fs = -1082195909, tf_es = -1082195909, tf_ds = -1082195909, tf_edi = -1082141792, tf_esi = -1082155856, tf_ebp = -1082151736, tf_isp = -328278684, tf_ebx = -1982551028, tf_edx = 41, tf_ecx = 0, tf_eax = 5, tf_trapno = 0, tf_err = 2, tf_eip = -2008413713, tf_cs = 51, tf_eflags = 642, tf_esp = -1082155972, tf_ss = 59}) at ../../../i386/i386/trap.c:984 #20 0xc054de2f in Xint0x80_syscall () at ../../../i386/i386/exception.s:200 (kgdb) f 8 #8 0xc04e981b in ufsdirhash_build (ip=0xca6b6084) at ../../../ufs/ufs/ufs_dirhash.c:173 173 if (ufsdirhash_recycle(memreqd) != 0) (kgdb) p *ip $1 = {i_nextsnap = {tqe_next = 0x0, tqe_prev = 0x0}, i_vnode = 0xca6c0bb0, i_ump = 0xc9bd3300, i_flag = 0, i_dev = 0xc9b4f400, i_number = 4686848, i_effnlink = 2, i_fs = 0xc9ba5800, i_dquot = {0x0, 0x0}, i_modrev = 14753454826293, i_lockf = 0x0, i_count = 24, i_endoff = 112640, i_diroff = 72704, i_offset = 73056, i_ino =3357131, i_reclen = 16, i_un = {dirhash = 0x0, snapblklist = 0x0}, i_ea_area = 0x0, i_ea_len = 0, i_ea_error = 0, i_mode = 16832, i_nlink = 2, i_size = 112640, i_flags = 0, i_gen = -1337636365, i_uid = 60, i_gid = 60, dinode_u = {din1 = 0xca6c7d00, din2 = 0xca6c7d00}} kgdb) f 7 #7 0xc04eb3c5 in ufsdirhash_recycle (wanted=57230) at ../../../ufs/ufs/ufs_dirhash.c:1035 1035 DIRHASH_LOCK(dh); (kgdb) p dh $2 = (struct dirhash *) 0xceb39c00 (kgdb) p *dh $3 = {dh_mtx = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type = 0x0, lo_flags = 0, lo_list = { tqe_next = 0x0, tqe_prev = 0x0}, lo_witness = 0x0}, mtx_lock = 2, mtx_recurse = 0}, dh_hash = 0x0, dh_narrays = 0, dh_hlen = 0, dh_hused = 0, dh_blkfree = 0x0, dh_nblk = 0, dh_dirblks = 0, dh_firstfree = { 0 , -16777216, -1 }, dh_seqopt = 1, dh_seqoff = 3440, dh_score =64, dh_onlist = 1, dh_list = {tqe_next = 0xcf919a00, tqe_prev = 0xc063cfb0 (kgdb) f 6 #6 0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760, opts=0, file=0x0, line=0) at ../../../kern/kern_mutex.c:550 550 if (m != &Giant && TD_IS_RUNNING(owner)) { (kgdb) p m $4 = (struct mtx *) 0xceb39c00 (kgdb) p *m $5 = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type = 0x0, lo_flags = 0, lo_list = {tqe_next = 0x0, tqe_prev = 0x0}, lo_witness = 0x0}, mtx_lock = 2, mtx_recurse = 0} (kgdb) p &Giant $6 = (struct mtx *) 0xc062a0e0 (kgdb) p owner $7 = (volatile struct thread *) 0x0 info local owner = (volatile struct thread *) 0x0 v = 0 (kgdb) list 545 */ 546 owner = (struct thread *)(v & MTX_FLAGMASK); 547 #ifdef ADAPTIVE_GIANT 548 if (TD_IS_RUNNING(owner)) { 549 #else 550 if (m != &Giant && TD_IS_RUNNING(owner)) { 551 #endif The crash occurs in line 550 because owner is zero and should be a thread id that holds the dirhash mutex. When _mtx_lock_sleep is called the mtx_object already is filled with zeros and especially mtx_lock should be 4 (UNOWNED) or the thread id of someone. What may be the reason, that the panics never occured before and then on a dozen server in a short time ? No further crashs since a week now. Any hints are welcome. -- Dr. Andreas Longwitz Data Service GmbH Beethovenstr. 2A 23617 Stockelsdorf Amtsgericht Lübeck, HRB 318 BS Geschäftsführer: Wilfried Paepcke, Dr. Andreas Longwitz, Josef Flatau