Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 01 Aug 2011 00:10:45 +0200
From:      Martin Matuska <mm@FreeBSD.org>
To:        David P Discher <dpd@bitgravity.com>
Cc:        freebsd-fs@FreeBSD.org, Andriy Gapon <avg@freebsd.org>
Subject:   Re: zfs process hang on pool access
Message-ID:  <4E35D2E5.4020108@FreeBSD.org>
In-Reply-To: <3D893A9B-2CD9-40EB-B4A2-5DBCBB72C62E@bitgravity.com>
References:  <A14F1C768A41483C876AD77502A864D6@multiplay.co.uk> <0D449EC916264947AB31AA17F870EA7A@multiplay.co.uk> <4E3013DF.10803@FreeBSD.org> <3D6CEB50BEDD4ACE96FD35C4D085618A@multiplay.co.uk> <4E301C55.7090105@FreeBSD.org> <5C84E7C8452E489C8CA738294F5EBB78@multiplay.co.uk> <4E301F10.6060708@FreeBSD.org> <63705B5AEEAD4BB88ADB9EF770AB6C76@multiplay.co.uk> <4E302204.2030009@FreeBSD.org> <6703F0BB-D4FC-4417-B519-CAFC62E5BC39@bitgravity.com> <04C305AE5F184C6AAC2A67CE23184013@multiplay.co.uk> <3D893A9B-2CD9-40EB-B4A2-5DBCBB72C62E@bitgravity.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I walked through all occurences of ddi_get_lbolt() in the ZFS code and
this is the only place where it is incorrectly initialized.
This is how it should look like.

===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c (revision 224527)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c (working copy)
@@ -488,7 +488,7 @@
txg_delay(dsl_pool_t *dp, uint64_t txg, int ticks)
{
tx_state_t *tx = &dp->dp_tx;
- int timeout = ddi_get_lbolt() + ticks;
+ clock_t timeout = ddi_get_lbolt() + ticks;

/* don't delay if this txg could transition to quiesing immediately */
if (tx->tx_open_txg > txg ||


Dňa 31. 7. 2011 22:06, David P Discher wrote / napísal(a):
> I've actually found a second issue that my working theory is related to the *fix* of LBOLT, in zio_wait()/txg_delay() when calling _cv_wait()/_cv_timedwait().  This maybe aggravated by setting vfs.zfs.txg.timeout=1.  And in fact these functions are using using LBOLT with signed 32bit ints. 
>
> I got some cores, and ideas, and will dig into the debugging this week.  And of course will post my findings (and pleads for help) here on freebsd-fs@.
>
> Rolling back the two patches I posted early for the 26+ day and 106+ days bugs, seemed to avoid the new issue.
>
> ---
> David P. Discher
> dpd@bitgravity.com * AIM: bgDavidDPD
> BITGRAVITY * http://www.bitgravity.com
>
> On Jul 31, 2011, at 12:50 PM, Steven Hartland wrote:
>
>> Is there a PR related to this so we can track progress. Having to reboot machines
>> every 100+ days to ensure they don't break is a bit of a PITA when you've got hundreds
>> of machines :(
>>
>> ----- Original Message ----- From: "David P Discher" <dpd@bitgravity.com>
>> To: "Steven Hartland" <killing@multiplay.co.uk>
>> Cc: <freebsd-fs@FreeBSD.org>; "Andriy Gapon" <avg@freebsd.org>
>> Sent: Wednesday, July 27, 2011 9:41 PM
>> Subject: Re: zfs process hang on pool access
>>
>>
>> The way I found this was breaking into the debugger, do some back traces, continue, break in again, do some more back traces on the hung processes ... see what is going on, then walk through the code.
>>
>> Then what I had specific loops and code locations, asking the higher powers of the freebsd kernel world.
>>
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"


-- 
Martin Matuska
FreeBSD committer
http://blog.vx.sk




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E35D2E5.4020108>