From owner-freebsd-stable@FreeBSD.ORG Fri Jul 1 03:33:53 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 62EF5106564A for ; Fri, 1 Jul 2011 03:33:53 +0000 (UTC) (envelope-from tts@personalmis.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id EA3E68FC0A for ; Fri, 1 Jul 2011 03:33:52 +0000 (UTC) Received: by bwa20 with SMTP id 20so3361745bwa.13 for ; Thu, 30 Jun 2011 20:33:51 -0700 (PDT) Received: by 10.204.42.21 with SMTP id q21mr1327436bke.186.1309489344668; Thu, 30 Jun 2011 20:02:24 -0700 (PDT) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx.google.com with ESMTPS id k16sm2590963bks.13.2011.06.30.20.02.20 (version=SSLv3 cipher=OTHER); Thu, 30 Jun 2011 20:02:22 -0700 (PDT) Received: by bwa20 with SMTP id 20so3345761bwa.13 for ; Thu, 30 Jun 2011 20:02:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.19.83 with SMTP id z19mr2566192bka.191.1309489339783; Thu, 30 Jun 2011 20:02:19 -0700 (PDT) Received: by 10.204.117.198 with HTTP; Thu, 30 Jun 2011 20:02:19 -0700 (PDT) Date: Thu, 30 Jun 2011 20:02:19 -0700 Message-ID: From: Timothy Smith To: freebsd-stable@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: HAST + ZFS: no action on drive failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jul 2011 03:33:53 -0000 First posting here, hopefully I'm doing it right =) I also posted this to the FreeBSD forum, but I know some hast folks monitor this list regularly and not so much there, so... Basically, I'm testing failure scenarios with HAST/ZFS. I got two nodes, scripted up a bunch of checks and failover actions between the nodes. Looking good so far, though more complex that I expected. It would be cool to post it somewher to get some pointers/critiques, but that's another thing. Anyway, now I'm just seeing what happens when a drive fails on primary node. Oddly/sadly, NOTHING! Hast just keeps on a ticking, and doesn't change the state of the failed drive, so the zpool has no clue the drive is offline. The /dev/hast/ remains. The hastd does log some errors to the system log like this, but nothing more. messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Unable to flush activemap to disk: Device not configured. messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Local request failed (Device not configured): WRITE(4736512, 512). So, I guess the question is, "Do I have to script a cronjob to check for these kinds of errors and then change the hast resource to 'init' or something to handle this?" Or is there some kind of hastd config setting that I need to set? What's the SOP for this? As something related too, when the zpool in FreeBSD does finally notice that the drive is missing because I have manually changed the hast resource to INIT (so the /dev/hast/ is gone), my zpool (raidz2) hot spare doesn't engage, even with "autoreplace=on". The zpool status of the degraded pool seems to indicate that I should manually replace the failed drive. If that's the case, it's not really a "hot spare". Does this mean the "FMA Agent" referred to in the ZFS manual is not implemented in FreeBSD? thanks!