From owner-freebsd-stable@FreeBSD.ORG  Mon Sep  5 09:35:07 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0B58D1065670
	for <freebsd-stable@freebsd.org>; Mon,  5 Sep 2011 09:35:07 +0000 (UTC)
	(envelope-from longwitz@incore.de)
Received: from mail.incore.de (dss.incore.de [195.145.1.138])
	by mx1.freebsd.org (Postfix) with ESMTP id E23968FC18
	for <freebsd-stable@freebsd.org>; Mon,  5 Sep 2011 09:35:01 +0000 (UTC)
Received: from inetmail.dmz (inetmail.dmz [10.3.0.3])
	by mail.incore.de (Postfix) with ESMTP id 1BC275EC6C
	for <freebsd-stable@freebsd.org>; Mon,  5 Sep 2011 11:15:44 +0200 (CEST)
X-Virus-Scanned: amavisd-new at incore.de
Received: from mail.incore.de ([10.3.0.3])
	by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024)
	with LMTP id e0WXVpovcz0v for <freebsd-stable@freebsd.org>;
	Mon,  5 Sep 2011 11:15:42 +0200 (CEST)
Received: from mail.incore (fwintern.dmz [10.0.0.253])
	by mail.incore.de (Postfix) with ESMTP id E16885EC42
	for <freebsd-stable@freebsd.org>; Mon,  5 Sep 2011 11:15:42 +0200 (CEST)
Received: from bsdlo.incore (bsdlo.incore [192.168.0.84])
	by mail.incore (Postfix) with ESMTP id DAB2445088
	for <freebsd-stable@freebsd.org>; Mon,  5 Sep 2011 11:15:42 +0200 (CEST)
Message-ID: <4E64933E.8030908@incore.de>
Date: Mon, 05 Sep 2011 11:15:42 +0200
From: Andreas Longwitz <longwitz@incore.de>
User-Agent: Thunderbird 2.0.0.19 (X11/20090113)
MIME-Version: 1.0
To: freebsd-stable@freebsd.org
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 8bit
Subject: UFS_DIRHASH panics on a dozen server within 30 hours
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Sep 2011 09:35:07 -0000

Hi,

a week ago a dozen of my FreeBSD server crashed within a time span of
30 hours. On the server run very different applications, some of them
were only standby. All server has the same kernel with FreeBSD 6 STABLE
and there were no problems for yours until the "black monday".

Yes I know that FreeBSD 6 is out of date now, but I don't like to
change a very good running system. Another reason is that my hardware
needs the amr driver and because of the outstanding solution of the
amr_ioctl problem described in kern/155658 it is not possible for me
to upgrade my production sytems without changing hardware.

Now I have a dozen core dumps and try to understand what happened.
All dumps looks very similar and the panic is always "page fault"
in _mtx_lock_sleep called from ufsdirhash_recycle or ufsdirhash_free
because the used mtx_object is overwritten with zeros by someone
before _mtx_lock_sleep is called.

A typical stack trace and some kgdb output follows:

(kgdb) where
#0  doadump () at pcpu.h:165
#1  0xc03c5b25 in boot (howto=260)
               at ../../../kern/kern_shutdown.c:410
#2  0xc03c5e7d in panic (fmt=0xc05931cb "%s")
               at ../../../kern/kern_shutdown.c:566
#3  0xc0564606 in trap_fatal (frame=0xec6ed77c, eva=256)
               at ../../../i386/i386/trap.c:838
#4  0xc0563d1e in trap (frame=
      {tf_fs = 8, tf_es = -328335320, tf_ds = -328335320, tf_edi =
      -901761536, tf_esi = 0, tf_ebp = -328280120, tf_isp = -328280152,
      tf_ebx = -827089920, tf_edx = 0, tf_ecx = 2, tf_eax = 1,
      tf_trapno = 12, tf_err = 0, tf_eip = -1069829895, tf_cs = 32,
      tf_eflags = 65538, tf_esp = -827089920, tf_ss = 2})
               at ../../../i386/i386/trap.c:270
#5  0xc054ddda in calltrap () at ../../../i386/i386/exception.s:139
#6  0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760,
      opts=0, file=0x0, line=0)
               at ../../../kern/kern_mutex.c:550
#7  0xc04eb3c5 in ufsdirhash_recycle (wanted=57230)
               at ../../../ufs/ufs/ufs_dirhash.c:1035
#8  0xc04e981b in ufsdirhash_build (ip=0xca6b6084)
               at ../../../ufs/ufs/ufs_dirhash.c:173
#9  0xc04ebbdd in ufs_lookup (ap=0xec6ed920)
               at ../../../ufs/ufs/ufs_lookup.c:202
#10 0xc057116c in VOP_CACHEDLOOKUP_APV (vop=0x1, a=0x0)
               at vnode_if.c:150
#11 0xc04164fa in vfs_cache_lookup (ap=0x1)
               at vnode_if.h:82
#12 0xc05710fb in VOP_LOOKUP_APV (vop=0xc05f90a0, a=0xec6ed9c0)
               at vnode_if.c:99
#13 0xc041add4 in lookup (ndp=0xec6edbcc)
               at vnode_if.h:56
#14 0xc041a66a in namei (ndp=0xec6edbcc)
               at ../../../kern/vfs_lookup.c:216
#15 0xc042ec31 in vn_open_cred (ndp=0xec6edbcc, flagp=0xec6edccc,
      cmode=384, cred=0xc9bceb80, fdidx=97)
               at ../../../kern/vfs_vnops.c:183
#16 0xc042e982 in vn_open (ndp=0x0, flagp=0xec6edccc, cmode=384,
      fdidx=97)
               at ../../../kern/vfs_vnops.c:91
#17 0xc042749a in kern_open (td=0xca403600, path=0x1 <Address 0x1
       out of bounds>, pathseg=UIO_SYSSPACE, flags=1, mode=438)
               at ../../../kern/vfs_syscalls.c:1016
#18 0xc04271d2 in open (td=0xca403600, uap=0xec6edd04)
               at ../../../kern/vfs_syscalls.c:971
#19 0xc056494b in syscall (frame=
      {tf_fs = -1082195909, tf_es = -1082195909, tf_ds = -1082195909,
      tf_edi = -1082141792, tf_esi = -1082155856, tf_ebp = -1082151736,
      tf_isp = -328278684, tf_ebx = -1982551028, tf_edx = 41,
      tf_ecx = 0, tf_eax = 5, tf_trapno = 0, tf_err = 2, tf_eip =
      -2008413713, tf_cs = 51, tf_eflags = 642, tf_esp = -1082155972,
      tf_ss = 59})
               at ../../../i386/i386/trap.c:984
#20 0xc054de2f in Xint0x80_syscall ()
               at ../../../i386/i386/exception.s:200

(kgdb) f 8
#8  0xc04e981b in ufsdirhash_build (ip=0xca6b6084)
               at ../../../ufs/ufs/ufs_dirhash.c:173
173                     if (ufsdirhash_recycle(memreqd) != 0)
(kgdb) p *ip
$1 = {i_nextsnap = {tqe_next = 0x0, tqe_prev = 0x0}, i_vnode =
  0xca6c0bb0, i_ump = 0xc9bd3300, i_flag = 0, i_dev = 0xc9b4f400,
  i_number = 4686848, i_effnlink = 2, i_fs = 0xc9ba5800, i_dquot
  = {0x0, 0x0}, i_modrev = 14753454826293, i_lockf = 0x0, i_count = 24,
  i_endoff = 112640, i_diroff = 72704, i_offset = 73056, i_ino =3357131,
  i_reclen = 16, i_un = {dirhash = 0x0, snapblklist = 0x0}, i_ea_area
  = 0x0, i_ea_len = 0, i_ea_error = 0, i_mode = 16832, i_nlink = 2,
  i_size = 112640, i_flags = 0, i_gen = -1337636365, i_uid = 60,
  i_gid = 60, dinode_u = {din1 = 0xca6c7d00, din2 = 0xca6c7d00}}

kgdb) f 7
#7  0xc04eb3c5 in ufsdirhash_recycle (wanted=57230)
               at ../../../ufs/ufs/ufs_dirhash.c:1035
1035                    DIRHASH_LOCK(dh);
(kgdb) p dh
$2 = (struct dirhash *) 0xceb39c00
(kgdb) p *dh
$3 = {dh_mtx = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type
  = 0x0, lo_flags = 0, lo_list = { tqe_next = 0x0, tqe_prev = 0x0},
  lo_witness = 0x0}, mtx_lock = 2, mtx_recurse = 0}, dh_hash = 0x0,
  dh_narrays = 0, dh_hlen = 0, dh_hused = 0, dh_blkfree = 0x0, dh_nblk
  = 0, dh_dirblks = 0, dh_firstfree = { 0 <repeats 46 times>, -16777216,
  -1 <repeats 21 times>}, dh_seqopt = 1, dh_seqoff = 3440, dh_score =64,
  dh_onlist = 1, dh_list = {tqe_next = 0xcf919a00, tqe_prev = 0xc063cfb0

(kgdb) f 6
#6  0xc03bb0f9 in _mtx_lock_sleep (m=0xceb39c00, tid=3393205760, opts=0,
    file=0x0, line=0) at ../../../kern/kern_mutex.c:550
550                     if (m != &Giant && TD_IS_RUNNING(owner)) {
(kgdb) p m
$4 = (struct mtx *) 0xceb39c00
(kgdb) p *m
$5 = {mtx_object = {lo_class = 0x0, lo_name = 0x0, lo_type = 0x0,
  lo_flags = 0, lo_list = {tqe_next = 0x0, tqe_prev = 0x0}, lo_witness
  = 0x0}, mtx_lock = 2, mtx_recurse = 0}
(kgdb) p &Giant
$6 = (struct mtx *) 0xc062a0e0
(kgdb) p owner
$7 = (volatile struct thread *) 0x0
info local
owner = (volatile struct thread *) 0x0
v = 0
(kgdb) list
545                      */
546                     owner = (struct thread *)(v & MTX_FLAGMASK);
547     #ifdef ADAPTIVE_GIANT
548                     if (TD_IS_RUNNING(owner)) {
549     #else
550                     if (m != &Giant && TD_IS_RUNNING(owner)) {
551     #endif

The crash occurs in line 550 because owner is zero and should be a
thread id that holds the dirhash mutex. When _mtx_lock_sleep is
called the mtx_object already is filled with zeros and especially
mtx_lock should be 4 (UNOWNED) or the thread id of someone.

What may be the reason, that the panics never occured before and then
on a dozen server in a short time ? No further crashs since a week now.

Any hints are welcome.

-- 
Dr. Andreas Longwitz

Data Service GmbH
Beethovenstr. 2A
23617 Stockelsdorf
Amtsgericht Lübeck, HRB 318 BS
Geschäftsführer: Wilfried Paepcke, Dr. Andreas Longwitz, Josef Flatau