From owner-freebsd-stable@FreeBSD.ORG  Tue Nov 16 07:26:40 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 65D031065670;
	Tue, 16 Nov 2010 07:26:40 +0000 (UTC) (envelope-from TERRY@tmk.com)
Received: from server.tmk.com (server.tmk.com [204.141.35.63])
	by mx1.freebsd.org (Postfix) with ESMTP id 3F12E8FC1C;
	Tue, 16 Nov 2010 07:26:40 +0000 (UTC)
Received: from tmk.com by tmk.com (PMDF V6.4 #37010)
	id <01NUB2MU0E8000BNN4@tmk.com>; Tue, 16 Nov 2010 02:26:37 -0500 (EST)
Date: Tue, 16 Nov 2010 02:01:58 -0500 (EST)
From: Terry Kennedy <TERRY@tmk.com>
In-reply-to: "Your message dated Mon, 15 Nov 2010 22:55:11 -0800"
	<E7621997-3485-43A2-A2EE-A11574054FF6@deman.com>
To: Michael DeMan <freebsd@deman.com>
Message-id: <01NUB3IOMZJW00BNN4@tmk.com>
MIME-version: 1.0
Content-type: TEXT/PLAIN; CHARSET=us-ascii
References: <01NUB1F8POL000BNN4@tmk.com>
Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org
Subject: Re: ZFS panic after replacing log device
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Nov 2010 07:26:40 -0000

> I am no ZFS kernel-code dude or anything, but it is well known that losing
> the ZIL can corrupt things pretty bad with ZFS.

  First, thanks for writing back!

  I agree that this could be the problem. As I mentioned in my original post,
I followed the steps recommended by "zpool status" - clearing the device and
then doing a replace. The fix may be as simple as testing for whether the de-
vice in question is a log device and if so, erroring out with "You can't do
that".

  Also note that multiple scrubs pass with no errors detected - it is only
writes that trigger the panic. It looks like something isn't being cleaned
up in the clear / replace path.

  I would save a crash dump for people to look at, but unfortunately the
last time a crash dump actually worked for me (on dozens of systems) was
back in the FreeBSD 6.2 days.

  There wasn't any data corruption (the filesystem was not being written at
the time the log device failed) - I have my own checksum files written by
the sysutils/cfv port, and the data all matches.

> All in all, if I was in your situation I would give a whirl at installing
> OpenSolaris and going from there, being sure not to upgrade the pool vers-
> ion past what is supported by FreeBSD and going from there.

  I have the data on another server (see my prior "snapshots are not back-
ups" discussion on freebsd-stable if interested). So, fortunately, this is
not a case of data recovery.

> Unfortunately we all find ourselves in a bit of a pickle with ZFS right 
> now with the Oracle acquisition of Sun.  For myself, I would stick with 
> deploying on FreeBSD but I think its going to be FBSD 9.1 before its go-
> ing to be truly ready for production.

  The problem with hardware on the leading edge is that the software often
needs time to catch up. In this particular case, the ZFS pool is 32TB. I
can't begin to imagine how long a UFS fsck would take on such a partition,
even if it were possible to create one. It was bad enough on the previous
generation of my servers (2TB UFS partitions).

        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA