From owner-freebsd-fs@FreeBSD.ORG  Sun Jul 31 20:29:13 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DFC57106566C;
	Sun, 31 Jul 2011 20:29:13 +0000 (UTC)
	(envelope-from dpd@bitgravity.com)
Received: from mail-pz0-f44.google.com (mail-pz0-f44.google.com
	[209.85.210.44])
	by mx1.freebsd.org (Postfix) with ESMTP id B152A8FC0A;
	Sun, 31 Jul 2011 20:29:13 +0000 (UTC)
Received: by pzk5 with SMTP id 5so30362568pzk.17
	for <multiple recipients>; Sun, 31 Jul 2011 13:29:13 -0700 (PDT)
Received: by 10.68.31.130 with SMTP id a2mr872036pbi.275.1312142781276;
	Sun, 31 Jul 2011 13:06:21 -0700 (PDT)
Received: from [10.1.10.12] (173-13-188-46-sfba.hfc.comcastbusiness.net
	[173.13.188.46])
	by mx.google.com with ESMTPS id i9sm4682298pbk.36.2011.07.31.13.06.19
	(version=TLSv1/SSLv3 cipher=OTHER);
	Sun, 31 Jul 2011 13:06:20 -0700 (PDT)
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: David P Discher <dpd@bitgravity.com>
X-Priority: 3
In-Reply-To: <04C305AE5F184C6AAC2A67CE23184013@multiplay.co.uk>
Date: Sun, 31 Jul 2011 13:06:18 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <3D893A9B-2CD9-40EB-B4A2-5DBCBB72C62E@bitgravity.com>
References: <A14F1C768A41483C876AD77502A864D6@multiplay.co.uk>
	<0D449EC916264947AB31AA17F870EA7A@multiplay.co.uk>
	<4E3013DF.10803@FreeBSD.org>
	<3D6CEB50BEDD4ACE96FD35C4D085618A@multiplay.co.uk>
	<4E301C55.7090105@FreeBSD.org>
	<5C84E7C8452E489C8CA738294F5EBB78@multiplay.co.uk>
	<4E301F10.6060708@FreeBSD.org>
	<63705B5AEEAD4BB88ADB9EF770AB6C76@multiplay.co.uk>
	<4E302204.2030009@FreeBSD.org>
	<6703F0BB-D4FC-4417-B519-CAFC62E5BC39@bitgravity.com>
	<04C305AE5F184C6AAC2A67CE23184013@multiplay.co.uk>
To: "Steven Hartland" <killing@multiplay.co.uk>
X-Mailer: Apple Mail (2.1084)
Cc: freebsd-fs@FreeBSD.org, Andriy Gapon <avg@freebsd.org>
Subject: Re: zfs process hang on pool access
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 31 Jul 2011 20:29:14 -0000

I've actually found a second issue that my working theory is related to =
the *fix* of LBOLT, in zio_wait()/txg_delay() when calling =
_cv_wait()/_cv_timedwait().  This maybe aggravated by setting =
vfs.zfs.txg.timeout=3D1.  And in fact these functions are using using =
LBOLT with signed 32bit ints.=20

I got some cores, and ideas, and will dig into the debugging this week.  =
And of course will post my findings (and pleads for help) here on =
freebsd-fs@.

Rolling back the two patches I posted early for the 26+ day and 106+ =
days bugs, seemed to avoid the new issue.

---
David P. Discher
dpd@bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com

On Jul 31, 2011, at 12:50 PM, Steven Hartland wrote:

> Is there a PR related to this so we can track progress. Having to =
reboot machines
> every 100+ days to ensure they don't break is a bit of a PITA when =
you've got hundreds
> of machines :(
>=20
> ----- Original Message ----- From: "David P Discher" =
<dpd@bitgravity.com>
> To: "Steven Hartland" <killing@multiplay.co.uk>
> Cc: <freebsd-fs@FreeBSD.org>; "Andriy Gapon" <avg@freebsd.org>
> Sent: Wednesday, July 27, 2011 9:41 PM
> Subject: Re: zfs process hang on pool access
>=20
>=20
> The way I found this was breaking into the debugger, do some back =
traces, continue, break in again, do some more back traces on the hung =
processes ... see what is going on, then walk through the code.
>=20
> Then what I had specific loops and code locations, asking the higher =
powers of the freebsd kernel world.
>=20