From owner-freebsd-fs@FreeBSD.ORG Sun Jul 31 22:10:51 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9E70C1065673; Sun, 31 Jul 2011 22:10:51 +0000 (UTC) (envelope-from mm@FreeBSD.org) Received: from mail.vx.sk (mail.vx.sk [IPv6:2a01:4f8:100:1043::3]) by mx1.freebsd.org (Postfix) with ESMTP id 309058FC15; Sun, 31 Jul 2011 22:10:51 +0000 (UTC) Received: from core.vx.sk (localhost [127.0.0.1]) by mail.vx.sk (Postfix) with ESMTP id 89F6D18BABE; Mon, 1 Aug 2011 00:10:49 +0200 (CEST) X-Virus-Scanned: amavisd-new at mail.vx.sk Received: from mail.vx.sk ([127.0.0.1]) by core.vx.sk (mail.vx.sk [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Ms3WFko5bgap; Mon, 1 Aug 2011 00:10:46 +0200 (CEST) Received: from [10.9.8.3] (chello085216231078.chello.sk [85.216.231.78]) by mail.vx.sk (Postfix) with ESMTPSA id 3F0BE18BAB1; Mon, 1 Aug 2011 00:10:46 +0200 (CEST) Message-ID: <4E35D2E5.4020108@FreeBSD.org> Date: Mon, 01 Aug 2011 00:10:45 +0200 From: Martin Matuska User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20110624 Thunderbird/5.0 MIME-Version: 1.0 To: David P Discher References: <0D449EC916264947AB31AA17F870EA7A@multiplay.co.uk> <4E3013DF.10803@FreeBSD.org> <3D6CEB50BEDD4ACE96FD35C4D085618A@multiplay.co.uk> <4E301C55.7090105@FreeBSD.org> <5C84E7C8452E489C8CA738294F5EBB78@multiplay.co.uk> <4E301F10.6060708@FreeBSD.org> <63705B5AEEAD4BB88ADB9EF770AB6C76@multiplay.co.uk> <4E302204.2030009@FreeBSD.org> <6703F0BB-D4FC-4417-B519-CAFC62E5BC39@bitgravity.com> <04C305AE5F184C6AAC2A67CE23184013@multiplay.co.uk> <3D893A9B-2CD9-40EB-B4A2-5DBCBB72C62E@bitgravity.com> In-Reply-To: <3D893A9B-2CD9-40EB-B4A2-5DBCBB72C62E@bitgravity.com> X-Enigmail-Version: 1.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: freebsd-fs@FreeBSD.org, Andriy Gapon Subject: Re: zfs process hang on pool access X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 31 Jul 2011 22:10:51 -0000 I walked through all occurences of ddi_get_lbolt() in the ZFS code and this is the only place where it is incorrectly initialized. This is how it should look like. =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c (revision 224527) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c (working copy) @@ -488,7 +488,7 @@ txg_delay(dsl_pool_t *dp, uint64_t txg, int ticks) { tx_state_t *tx = &dp->dp_tx; - int timeout = ddi_get_lbolt() + ticks; + clock_t timeout = ddi_get_lbolt() + ticks; /* don't delay if this txg could transition to quiesing immediately */ if (tx->tx_open_txg > txg || Dňa 31. 7. 2011 22:06, David P Discher wrote / napísal(a): > I've actually found a second issue that my working theory is related to the *fix* of LBOLT, in zio_wait()/txg_delay() when calling _cv_wait()/_cv_timedwait(). This maybe aggravated by setting vfs.zfs.txg.timeout=1. And in fact these functions are using using LBOLT with signed 32bit ints. > > I got some cores, and ideas, and will dig into the debugging this week. And of course will post my findings (and pleads for help) here on freebsd-fs@. > > Rolling back the two patches I posted early for the 26+ day and 106+ days bugs, seemed to avoid the new issue. > > --- > David P. Discher > dpd@bitgravity.com * AIM: bgDavidDPD > BITGRAVITY * http://www.bitgravity.com > > On Jul 31, 2011, at 12:50 PM, Steven Hartland wrote: > >> Is there a PR related to this so we can track progress. Having to reboot machines >> every 100+ days to ensure they don't break is a bit of a PITA when you've got hundreds >> of machines :( >> >> ----- Original Message ----- From: "David P Discher" >> To: "Steven Hartland" >> Cc: ; "Andriy Gapon" >> Sent: Wednesday, July 27, 2011 9:41 PM >> Subject: Re: zfs process hang on pool access >> >> >> The way I found this was breaking into the debugger, do some back traces, continue, break in again, do some more back traces on the hung processes ... see what is going on, then walk through the code. >> >> Then what I had specific loops and code locations, asking the higher powers of the freebsd kernel world. >> > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -- Martin Matuska FreeBSD committer http://blog.vx.sk