From owner-freebsd-fs@FreeBSD.ORG Sun Jul 31 20:29:13 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DFC57106566C; Sun, 31 Jul 2011 20:29:13 +0000 (UTC) (envelope-from dpd@bitgravity.com) Received: from mail-pz0-f44.google.com (mail-pz0-f44.google.com [209.85.210.44]) by mx1.freebsd.org (Postfix) with ESMTP id B152A8FC0A; Sun, 31 Jul 2011 20:29:13 +0000 (UTC) Received: by pzk5 with SMTP id 5so30362568pzk.17 for ; Sun, 31 Jul 2011 13:29:13 -0700 (PDT) Received: by 10.68.31.130 with SMTP id a2mr872036pbi.275.1312142781276; Sun, 31 Jul 2011 13:06:21 -0700 (PDT) Received: from [10.1.10.12] (173-13-188-46-sfba.hfc.comcastbusiness.net [173.13.188.46]) by mx.google.com with ESMTPS id i9sm4682298pbk.36.2011.07.31.13.06.19 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 31 Jul 2011 13:06:20 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: David P Discher X-Priority: 3 In-Reply-To: <04C305AE5F184C6AAC2A67CE23184013@multiplay.co.uk> Date: Sun, 31 Jul 2011 13:06:18 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <3D893A9B-2CD9-40EB-B4A2-5DBCBB72C62E@bitgravity.com> References: <0D449EC916264947AB31AA17F870EA7A@multiplay.co.uk> <4E3013DF.10803@FreeBSD.org> <3D6CEB50BEDD4ACE96FD35C4D085618A@multiplay.co.uk> <4E301C55.7090105@FreeBSD.org> <5C84E7C8452E489C8CA738294F5EBB78@multiplay.co.uk> <4E301F10.6060708@FreeBSD.org> <63705B5AEEAD4BB88ADB9EF770AB6C76@multiplay.co.uk> <4E302204.2030009@FreeBSD.org> <6703F0BB-D4FC-4417-B519-CAFC62E5BC39@bitgravity.com> <04C305AE5F184C6AAC2A67CE23184013@multiplay.co.uk> To: "Steven Hartland" X-Mailer: Apple Mail (2.1084) Cc: freebsd-fs@FreeBSD.org, Andriy Gapon Subject: Re: zfs process hang on pool access X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 31 Jul 2011 20:29:14 -0000 I've actually found a second issue that my working theory is related to = the *fix* of LBOLT, in zio_wait()/txg_delay() when calling = _cv_wait()/_cv_timedwait(). This maybe aggravated by setting = vfs.zfs.txg.timeout=3D1. And in fact these functions are using using = LBOLT with signed 32bit ints.=20 I got some cores, and ideas, and will dig into the debugging this week. = And of course will post my findings (and pleads for help) here on = freebsd-fs@. Rolling back the two patches I posted early for the 26+ day and 106+ = days bugs, seemed to avoid the new issue. --- David P. Discher dpd@bitgravity.com * AIM: bgDavidDPD BITGRAVITY * http://www.bitgravity.com On Jul 31, 2011, at 12:50 PM, Steven Hartland wrote: > Is there a PR related to this so we can track progress. Having to = reboot machines > every 100+ days to ensure they don't break is a bit of a PITA when = you've got hundreds > of machines :( >=20 > ----- Original Message ----- From: "David P Discher" = > To: "Steven Hartland" > Cc: ; "Andriy Gapon" > Sent: Wednesday, July 27, 2011 9:41 PM > Subject: Re: zfs process hang on pool access >=20 >=20 > The way I found this was breaking into the debugger, do some back = traces, continue, break in again, do some more back traces on the hung = processes ... see what is going on, then walk through the code. >=20 > Then what I had specific loops and code locations, asking the higher = powers of the freebsd kernel world. >=20