Date: Wed, 27 Jul 2011 13:41:34 -0700 From: David P Discher <dpd@bitgravity.com> To: Steven Hartland <killing@multiplay.co.uk> Cc: freebsd-fs@FreeBSD.org, Andriy Gapon <avg@freebsd.org> Subject: Re: zfs process hang on pool access Message-ID: <6703F0BB-D4FC-4417-B519-CAFC62E5BC39@bitgravity.com> In-Reply-To: <4E302204.2030009@FreeBSD.org> References: <A14F1C768A41483C876AD77502A864D6@multiplay.co.uk> <0D449EC916264947AB31AA17F870EA7A@multiplay.co.uk> <4E3013DF.10803@FreeBSD.org> <3D6CEB50BEDD4ACE96FD35C4D085618A@multiplay.co.uk> <4E301C55.7090105@FreeBSD.org> <5C84E7C8452E489C8CA738294F5EBB78@multiplay.co.uk> <4E301F10.6060708@FreeBSD.org> <63705B5AEEAD4BB88ADB9EF770AB6C76@multiplay.co.uk> <4E302204.2030009@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
The way I found this was breaking into the debugger, do some back = traces, continue, break in again, do some more back traces on the hung = processes ... see what is going on, then walk through the code.=20 Then what I had specific loops and code locations, asking the higher = powers of the freebsd kernel world. Of course, I had the high cpu and was peaking at the arc_reclaim_thread.=20= I've seen this nearly like clockwork in production at 106-107 days. If = it goes on too much longer than that, then things deadlock.=20 But 112 days, and 8.2 ... you for sure have the LBOLT overflow.=20 Otherwise, reboot and patch. However, I have not fully vetted the patch = under heavily load, and currently seeing another deadlock issue with = 8.1+ zfs v14 - but seemly durning writes after 6-40 hours. Still = investigating.=20 Note, my proposal of "time_uptime" doesn't work - as it causes a = buildworld error in zfs userland tools. This is what I'm currently running to fix the 26 day issue with l2arc = feeder and arc_reclaim_thread with LBOLT in 8.1.=20 Index: sys/cddl/compat/opensolaris/sys/time.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/compat/opensolaris/sys/time.h (.../8.1-BGOS-20110105) = (revision 3322) +++ sys/cddl/compat/opensolaris/sys/time.h (.../8.1-BGOS-20110613) = (working copy) @@ -38,7 +38,7 @@ =20 typedef longlong_t hrtime_t; =20 -#define LBOLT ((gethrtime() * hz) / NANOSEC) +#define LBOLT (gethrtime() * (NANOSEC/hz)) =20 #if defined(__i386__) || defined(__powerpc__) #define TIMESPEC_OVERFLOW(ts) = \ Index: sys/cddl/compat/opensolaris/sys/types.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/cddl/compat/opensolaris/sys/types.h (.../8.1-BGOS-20110105) = (revision 3322) +++ sys/cddl/compat/opensolaris/sys/types.h (.../8.1-BGOS-20110613) = (working copy) @@ -34,6 +34,12 @@ */ =20 #include <sys/stdint.h> + +#ifdef _KERNEL +typedef int64_t clock_t; +#define _CLOCK_T_DECLARED +#endif + #include_next <sys/types.h> =20 #define MAXNAMELEN 256 --- David P. Discher dpd@bitgravity.com * AIM: bgDavidDPD BITGRAVITY * http://www.bitgravity.com On Jul 27, 2011, at 7:34 AM, Andriy Gapon wrote: >> Ahh, is there anyway to confirm that before I reboot, or any other >> information we could glean that might be useful? >=20 > No quick ideas, unfortunately.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6703F0BB-D4FC-4417-B519-CAFC62E5BC39>