From owner-freebsd-stable@FreeBSD.ORG Tue Nov 16 07:26:40 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 65D031065670; Tue, 16 Nov 2010 07:26:40 +0000 (UTC) (envelope-from TERRY@tmk.com) Received: from server.tmk.com (server.tmk.com [204.141.35.63]) by mx1.freebsd.org (Postfix) with ESMTP id 3F12E8FC1C; Tue, 16 Nov 2010 07:26:40 +0000 (UTC) Received: from tmk.com by tmk.com (PMDF V6.4 #37010) id <01NUB2MU0E8000BNN4@tmk.com>; Tue, 16 Nov 2010 02:26:37 -0500 (EST) Date: Tue, 16 Nov 2010 02:01:58 -0500 (EST) From: Terry Kennedy In-reply-to: "Your message dated Mon, 15 Nov 2010 22:55:11 -0800" To: Michael DeMan Message-id: <01NUB3IOMZJW00BNN4@tmk.com> MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=us-ascii References: <01NUB1F8POL000BNN4@tmk.com> Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org Subject: Re: ZFS panic after replacing log device X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Nov 2010 07:26:40 -0000 > I am no ZFS kernel-code dude or anything, but it is well known that losing > the ZIL can corrupt things pretty bad with ZFS. First, thanks for writing back! I agree that this could be the problem. As I mentioned in my original post, I followed the steps recommended by "zpool status" - clearing the device and then doing a replace. The fix may be as simple as testing for whether the de- vice in question is a log device and if so, erroring out with "You can't do that". Also note that multiple scrubs pass with no errors detected - it is only writes that trigger the panic. It looks like something isn't being cleaned up in the clear / replace path. I would save a crash dump for people to look at, but unfortunately the last time a crash dump actually worked for me (on dozens of systems) was back in the FreeBSD 6.2 days. There wasn't any data corruption (the filesystem was not being written at the time the log device failed) - I have my own checksum files written by the sysutils/cfv port, and the data all matches. > All in all, if I was in your situation I would give a whirl at installing > OpenSolaris and going from there, being sure not to upgrade the pool vers- > ion past what is supported by FreeBSD and going from there. I have the data on another server (see my prior "snapshots are not back- ups" discussion on freebsd-stable if interested). So, fortunately, this is not a case of data recovery. > Unfortunately we all find ourselves in a bit of a pickle with ZFS right > now with the Oracle acquisition of Sun. For myself, I would stick with > deploying on FreeBSD but I think its going to be FBSD 9.1 before its go- > ing to be truly ready for production. The problem with hardware on the leading edge is that the software often needs time to catch up. In this particular case, the ZFS pool is 32TB. I can't begin to imagine how long a UFS fsck would take on such a partition, even if it were possible to create one. It was bad enough on the previous generation of my servers (2TB UFS partitions). Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA