From owner-freebsd-fs@FreeBSD.ORG Fri Apr 12 21:52:52 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 24AACA91 for ; Fri, 12 Apr 2013 21:52:52 +0000 (UTC) (envelope-from radiomlodychbandytow@o2.pl) Received: from moh2-ve1.go2.pl (moh2-ve1.go2.pl [193.17.41.186]) by mx1.freebsd.org (Postfix) with ESMTP id 95DE31F01 for ; Fri, 12 Apr 2013 21:52:51 +0000 (UTC) Received: from moh2-ve1.go2.pl (unknown [10.0.0.186]) by moh2-ve1.go2.pl (Postfix) with ESMTP id 103D044C9A8 for ; Fri, 12 Apr 2013 23:52:44 +0200 (CEST) Received: from unknown (unknown [10.0.0.108]) by moh2-ve1.go2.pl (Postfix) with SMTP for ; Fri, 12 Apr 2013 23:52:44 +0200 (CEST) Received: from unknown [93.175.66.185] by poczta.o2.pl with ESMTP id rGzSIl; Fri, 12 Apr 2013 23:52:42 +0200 Message-ID: <5168821F.5020502@o2.pl> Date: Fri, 12 Apr 2013 23:52:31 +0200 From: =?UTF-8?B?UmFkaW8gbcWCb2R5Y2ggYmFuZHl0w7N3?= User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130324 Thunderbird/17.0.4 MIME-Version: 1.0 To: Jeremy Chadwick , Quartz Subject: Re: A failed drive causes system to hang References: <51672164.1090908@o2.pl> <20130411212408.GA60159@icarus.home.lan> In-Reply-To: <20130411212408.GA60159@icarus.home.lan> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-O2-Trust: 1, 35 X-O2-SPF: neutral Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Apr 2013 21:52:52 -0000 On 11/04/2013 23:24, Jeremy Chadwick wrote: > On Thu, Apr 11, 2013 at 10:47:32PM +0200, Radio m?odych bandytw wrote: >> Seeing a ZFS thread, I decided to write about a similar problem that >> I experience. >> I have a failing drive in my array. I need to RMA it, but don't have >> time and it fails rarely enough to be a yet another annoyance. >> The failure is simple: it fails to respond. >> When it happens, the only thing I found I can do is switch consoles. >> Any command fails, login fails, apps hang. >> >> On the 1st console I see a series of messages like: >> >> (ada0:ahcich0:0:0:0): CAM status: Command timeout >> (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated >> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED >> >> I use RAIDZ1 and I'd expect that none single failure would cause the >> system to fail... > > You need to provide full output from "dmesg", and you need to define > what the word "fails" means (re: "any command fails", "login fails"). Fails = hangs. When trying to log it, I can type my user name, but after I press enter the prompt for password never appear. As to dmesg, tough luck. I have 2 photos on my phone and their transcripts are all I can give until the problem reappears (which should take up to 2 weeks). Photos are blurry and in many cases I'm not sure what exactly is there. Screen1: (ada0:ahcich0:0:0:0): FLUSHCACHE40. ACB: (ea?) 00 00 00 00 (cut?) (ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-qu (cut) (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 05 d3(cut) 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 03 7b(cut) 00 (ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-qu (cut) (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 03 d0(cut) 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated Screen 2: ahcich0: Timeout on slot 29 port 0 ahcich0: (unreadable, lots of numbers, some text) (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: (cc?) 00 (cut) (aprobe0:ahcich0:0:0:0): CAM status: Command timeout (aprobe0:ahcich0:0:0:0): Error (5?), Retry was blocked ahcich0: Timeout on slot 29 port 0 ahcich0: (unreadable, lots of numbers, some text) (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: (cc?) 00 (cut) (aprobe0:ahcich0:0:0:0): CAM status: Command timeout (aprobe0:ahcich0:0:0:0): Error (5?), Retry was blocked ahcich0: Timeout on slot 30 port 0 ahcich0: (unreadable, lots of numbers, some text) (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 01 (cut) (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 01 (cut) Both are from the same event. In general, messages: (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. are the most common. I've waited for more than 1/2 hour once and the system didn't return to a working state, the messages kept flowing and pretty much nothing was working. What's interesting, I remember that it happened to me even when I was using an installer (PC-BSD one), before the actual installation began, so the disk stored no program data. And I *think* there was no ZFS yet anyway. > > I've already demonstrated that loss of a disk in raidz1 (or even 2 disks > in raidz2) does not cause ""the system to fail"" on stable/9. However, > if you lose enough members or vdevs to cause catastrophic failure, there > may be anomalies depending on how your system is set up: > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html > > If the pool has failmode=wait, any I/O to that pool will block (wait) > indefinitely. This is the default. > > If the pool has failmode=continue, existing write I/O operations will > fail with EIO (I/O error) (and hopefully applications/daemons will > handle that gracefully -- if not, that's their fault) but any subsequent > I/O (read or write) to that pool will block (wait) indefinitely. > > If the pool has failmode=panic, the kernel will immediately panic. > > If the CAM layer is what's wedged, that may be a different issue (and > not related to ZFS). I would suggest running stable/9 as many > improvements in this regard have been committed recently (some related > to CAM, others related to ZFS and its new "deadman" watcher). Yeah, because of the installer failure, I don't think it's related to ZFS. Even if it is, for now I won't set any ZFS properties in hope it repeats and I can get better data. > > Bottom line: terse output of the problem does not help. Be verbose, > provide all output (commands you type, everything!), as well as any > physical actions you take. > Yep. In fact having little data was what made me hesitate to write about it; since I did already, I'll do my best to get more info, though for now I can only wait for a repetition. On 12/04/2013 00:08, Quartz wrote:> >> Seeing a ZFS thread, I decided to write about a similar problem that I >> experience. > > I'm assuming you're referring to my "Failed pool causes system to hang" > thread. I wonder if there's some common issue with zfs where it locks up > if it can't write to disks how it wants to. > > I'm not sure how similar your problem is to mine. What's your pool setup > look like? Redundancy options? Are you booting from a pool? I'd be > interested to know if you can just yank the cable to the drive and see > if the system recovers. > > You seem to be worse off than me- I can still login and run at least a > couple commands. I'm booting from a straight ufs drive though. > > ______________________________________ > it has a certain smooth-brained appeal > Like I said, I don't think it's ZFS-specific, but just in case...: RAIDZ1, root on ZFS. I should reduce severity of a pool loss before pulling cables, so no tests for now. -- Twoje radio