Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Oct 2011 16:02:22 +0200
From:      Pawel Jakub Dawidek <pjd@FreeBSD.org>
To:        Harold Paulson <haroldp@internal.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: Damaged directory on ZFS
Message-ID:  <20111023140222.GG1697@garage.freebsd.pl>
In-Reply-To: <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org>
References:  <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--oXNgvKVxGWJ0RPMJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Oct 17, 2011 at 05:17:31PM -0700, Harold Paulson wrote:
> Hello,=20
>=20
> I've had a server that boots from ZFS panicking for a couple days.  I hav=
e worked around the problem for now, but I hope someone can give me some in=
sight into what's going on, and how I can solve it properly. =20
>=20
> The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA disk=
s in a raid10 type arrangement:
>=20
> # uname -a             =20
> FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD 8.2-STABLE-201105 #0=
: Tue May 17 05:18:48 UTC 2011     root@mason.cse.buffalo.edu:/usr/obj/usr/=
src/sys/GENERIC  amd64
>=20
> And zpool status:=20
>=20
> 	NAME           STATE     READ WRITE CKSUM
> 	tank           ONLINE       0     0     0
> 	  mirror       ONLINE       0     0     0
> 	    gpt/disk0  ONLINE       0     0     0
> 	    gpt/disk1  ONLINE       0     0     0
> 	  mirror       ONLINE       0     0     0
> 	    gpt/disk2  ONLINE       0     0     0
> 	    gpt/disk3  ONLINE       0     0     0
>=20
> It started panicking under load a couple days ago.  We replaced RAM and m=
otherboard, but problems persisted.  I don't know if a hardware issue origi=
nally caused the problem or what.  When it panics, I get the usual panic me=
ssage, but I don't get a core file, and it never reboots itself. =20
>=20
> http://pastebin.com/F1J2AjSF
>=20
> While I was trying to figure out the source of the problem, I notice stuc=
k various stuck processes that peg a CPU and can't be killed, such as:
>=20
>   PID JID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU CO=
MMAND
> 48735   0 root        1  46    0 11972K   924K CPU3    3 415:14 100.00% f=
ind
>=20
> They are not marked zombie, but I can't kill them, and restarting the jai=
l they are in won't even get rid of them.  truss just hangs with no output =
on them.  On different occasions, I noticed pop3d processes for the same us=
er getting stuck in this way.  On a hunch I ran a "find" through the files =
in the user's Maildir and got a panic.  I disabled this account and now the=
 server is stable again.  At least until locate.updatedb walks through that=
 directory, I suppose.   Evidentially, there is some kind of hole in the fi=
le system below that directory tree causing the panic. =20
>=20
> I can move that directory out of the way, and carry on, but is there anyt=
hing I can do to really *repair* the problem?

Could you run these commands:

	objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16} <fzap_curs=
or_retrieve>' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR printf "%u + %=
u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs addr2line -e /boot/ker=
nel/zfs.ko.symbols

They should convert fzap_cursor_retrieve+0x111 info file:line. Send it
here once you obtain it.

Thanks.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://yomoli.com

--oXNgvKVxGWJ0RPMJ
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAk6kHm4ACgkQForvXbEpPzR9HACfZjaw7qUv8KyZfPkEH7xVLuet
I8cAnjray9S2+gUN5SFKdTD4IngISlaH
=PF1p
-----END PGP SIGNATURE-----

--oXNgvKVxGWJ0RPMJ--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111023140222.GG1697>