From owner-freebsd-fs@FreeBSD.ORG Thu Apr 11 21:24:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id DE755825 for ; Thu, 11 Apr 2013 21:24:10 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta14.emeryville.ca.mail.comcast.net (qmta14.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:212]) by mx1.freebsd.org (Postfix) with ESMTP id C472F1CA1 for ; Thu, 11 Apr 2013 21:24:10 +0000 (UTC) Received: from omta09.emeryville.ca.mail.comcast.net ([76.96.30.20]) by qmta14.emeryville.ca.mail.comcast.net with comcast id Nkgs1l0040S2fkCAElQ9Sb; Thu, 11 Apr 2013 21:24:09 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta09.emeryville.ca.mail.comcast.net with comcast id NlQ81l00Z1t3BNj8VlQ8Zc; Thu, 11 Apr 2013 21:24:08 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 3C40773A33; Thu, 11 Apr 2013 14:24:08 -0700 (PDT) Date: Thu, 11 Apr 2013 14:24:08 -0700 From: Jeremy Chadwick To: Radio =?unknown-8bit?B?bcU/b2R5Y2ggYmFuZHl0w7N3?= Subject: Re: A failed drive causes system to hang Message-ID: <20130411212408.GA60159@icarus.home.lan> References: <51672164.1090908@o2.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51672164.1090908@o2.pl> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1365715449; bh=lgeG55F7PQugm9NysM2njoyOOna4cglwykmfgG1Balw=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=ehSizw0gWrdfgEm4dWfiL0qxjBc0SeUCqm76EktGQlLchtYlvYd65U3NBTGaWysDR 02ImKX7WxoZc79v58dawcPhbZwRzc6PEcDngDFB9OqPq8/UKGkowE9lpb+vukTEviK 6nrYEKA0nNPooz3q8oYQ5ZeBAupb8RoSOrKmTNO8VxxnKa3sRnZHoSxHxI2nYBJZMy S8uDwV8K1hc4OjTbhkgkbk9L7yTVKa+TSeslT/gd2HnVn50nZgO/hWRCH1l9QGuUr0 Nb338kItKokDQe4SEiYuD2PdzLmEat3I5k9hSFiOpcc/lMb5T5xo9C18g72XRZmEZ+ Y+odVzzhjMGPQ== Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Apr 2013 21:24:10 -0000 On Thu, Apr 11, 2013 at 10:47:32PM +0200, Radio m?odych bandytw wrote: > Seeing a ZFS thread, I decided to write about a similar problem that > I experience. > I have a failing drive in my array. I need to RMA it, but don't have > time and it fails rarely enough to be a yet another annoyance. > The failure is simple: it fails to respond. > When it happens, the only thing I found I can do is switch consoles. > Any command fails, login fails, apps hang. > > On the 1st console I see a series of messages like: > > (ada0:ahcich0:0:0:0): CAM status: Command timeout > (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated > (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED > > I use RAIDZ1 and I'd expect that none single failure would cause the > system to fail... You need to provide full output from "dmesg", and you need to define what the word "fails" means (re: "any command fails", "login fails"). I've already demonstrated that loss of a disk in raidz1 (or even 2 disks in raidz2) does not cause ""the system to fail"" on stable/9. However, if you lose enough members or vdevs to cause catastrophic failure, there may be anomalies depending on how your system is set up: http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html If the pool has failmode=wait, any I/O to that pool will block (wait) indefinitely. This is the default. If the pool has failmode=continue, existing write I/O operations will fail with EIO (I/O error) (and hopefully applications/daemons will handle that gracefully -- if not, that's their fault) but any subsequent I/O (read or write) to that pool will block (wait) indefinitely. If the pool has failmode=panic, the kernel will immediately panic. If the CAM layer is what's wedged, that may be a different issue (and not related to ZFS). I would suggest running stable/9 as many improvements in this regard have been committed recently (some related to CAM, others related to ZFS and its new "deadman" watcher). Bottom line: terse output of the problem does not help. Be verbose, provide all output (commands you type, everything!), as well as any physical actions you take. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |