From owner-freebsd-current@FreeBSD.ORG  Tue Aug  3 05:31:34 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id B6B6816A4CE; Tue,  3 Aug 2004 05:31:34 +0000 (GMT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 59CB043D2F; Tue,  3 Aug 2004 05:31:34 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.11/8.12.11) with ESMTP id i735VDNL077963;
	Mon, 2 Aug 2004 22:31:22 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200408030531.i735VDNL077963@gw.catspoiler.org>
Date: Mon, 2 Aug 2004 22:31:13 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: boris@brooknet.com.au
In-Reply-To: <1091504341.729.25.camel@dirk.no.domain>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: freebsd-current@FreeBSD.org
cc: pjd@FreeBSD.org
Subject: Re: processes freezing when writing to gstripe'd device
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Aug 2004 05:31:34 -0000

On  3 Aug, Sam Lawrance wrote:
>> +> I am observing processes performing operations on a gstripe device
>> +> freeze in state 'bufwait'. An 'rm' process is stuck right now. The rest
>> +> of the system is fine.
>> +> 
>> +> What's the best way to look in to this? I can't attach to rm with gdb
>> +> (it just ends up waiting for something). I can drop to kdb, but have no
>> +> idea where to go from there.
>> 
>> You could use 'ps' command from DDB to which processes are alseep.
>> Then you can run 'tr <PID>' where <PID> is PID of sleeping process.
>> Look for processes related somehow to this problem.
>> 
>> It'll be also great if you can provide exact procedure which will also
>> me to reproduce this problem.
> 
> Okay, I updated to current as of yesterday and still seeing the same
> problem. I'm new to these bits of the kernel but it looks like a locking
> problem. This is what I am doing:
> 
> dd if=/dev/zero of=sd0 count=20480
> cp sd0 sd1
> mdconfig -a -t vnode -f sd0
> mdconfig -a -t vnode -f sd1
> gstripe label bork md0 md1
> newfs /dev/stripe/bork
> mkdir teststripe
> mount /dev/stripe/bork teststripe
> cd teststripe
> 
> Now I repeatedly 'cvs checkout' and 'rm -rf' the FreeBSD src tree.
> Usually it freezes during the first checkout.
> 
> Siginfo shows:
> 
> load:1.14  cmd: cvs 801 [biowr] 0.33u 3.35s 14% 2840k
> 
> A trace of the frozen cvs process 801 shows:
> 
> KDB: enter: manual escape to debugger
> [thread 100006]
> Stopped at      kdb_enter+0x2b: nop
> db> tr 801
> sched_switch(c1a87580,0) at sched_switch+0x12b
> mi_switch(1,0) at mi_switch+0x24d
> sleepq_switch(c63ee500,d0c83814,c06030e9,c63ee500,0) at
> sleepq_switch+0xe0
> sleepq_wait(c63ee500,0,0,0,c07f3ab7) at sleepq_wait+0xb
> msleep(c63ee500,c08ddd80,4c,c07f40d1,0) at msleep+0x375
> bwait(c63ee500,4c,c07f40d1) at bwait+0x47
> bufwait(c63ee500,c088f1a0,c1cd6318,c63ee500,0) at bufwait+0x2d
> ibwrite(c63ee500,d0c838d8,c071906e,c63ee500,a00) at ibwrite+0x3e2
> bwrite(c63ee500,a00,0,ee,c19b1834) at bwrite+0x32
> ffs_update(c19c3738,1,0,c19b808c,c19c3738) at ffs_update+0x302
> ufs_makeinode(81a4,c199f840,d0c83bf8,d0c83c0c) at ufs_makeinode+0x3a3
> ufs_create(d0c83a74,d0c83b30,c0655238,d0c83a74,c08b8c00) at
> ufs_create+0x26
> ufs_vnoperate(d0c83a74) at ufs_vnoperate+0x13
> vn_open_cred(d0c83be4,d0c83ce4,1a4,c1d47700,8) at vn_open_cred+0x174
> vn_open(d0c83be4,d0c83ce4,1a4,8,c08ad240) at vn_open+0x1e
> kern_open(c1a87580,8199430,0,602,1b6) at kern_open+0xd2
> open(c1a87580,d0c83d14,3,1be,292) at open+0x18
> syscall(2f,bfbf002f,bfbf002f,8,2836c7f8) at syscall+0x217
> Xint0x80_syscall() at Xint0x80_syscall+0x1f
> --- syscall (5, FreeBSD ELF32, open), eip = 0x282e2437, esp =
> 0xbfbfdd7c, ebp = 0xbfbfdda8 ---

> 
> I'll keep poking around - if you have any further suggestions or need
> other information, fire away.

I'm not too familiar with this area of the kernel, but I'd be suspicious
that one or more of the geom kernel threads are getting wedged and
keeping the I/O that the cvs process is waiting for from completing.  I
would think that the vnode locks sd0 and sd1 need to be obtained in
order to do the I/O on the md devices.  Maybe a deadly embrace where one
thread has a lock on sd0 and another thread has a lock on sd1 and they
each want to grab a lock on the other vnode ...

Try
	ps lax | grep g_
and get a DDB backtrace on the g_up, g_down, and g_event threads.