Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Apr 2013 23:52:31 +0200
From:      =?UTF-8?B?UmFkaW8gbcWCb2R5Y2ggYmFuZHl0w7N3?= <radiomlodychbandytow@o2.pl>
To:        Jeremy Chadwick <jdc@koitsu.org>, Quartz <quartz@sneakertech.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: A failed drive causes system to hang
Message-ID:  <5168821F.5020502@o2.pl>
In-Reply-To: <20130411212408.GA60159@icarus.home.lan>
References:  <mailman.11.1365681601.78138.freebsd-fs@freebsd.org> <51672164.1090908@o2.pl> <20130411212408.GA60159@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On 11/04/2013 23:24, Jeremy Chadwick wrote:
> On Thu, Apr 11, 2013 at 10:47:32PM +0200, Radio m?odych bandytw wrote:
>> Seeing a ZFS thread, I decided to write about a similar problem that
>> I experience.
>> I have a failing drive in my array. I need to RMA it, but don't have
>> time and it fails rarely enough to be a yet another annoyance.
>> The failure is simple: it fails to respond.
>> When it happens, the only thing I found I can do is switch consoles.
>> Any command fails, login fails, apps hang.
>>
>> On the 1st console I see a series of messages like:
>>
>> (ada0:ahcich0:0:0:0): CAM status: Command timeout
>> (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
>> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED
>>
>> I use RAIDZ1 and I'd expect that none single failure would cause the
>> system to fail...
>
> You need to provide full output from "dmesg", and you need to define
> what the word "fails" means (re: "any command fails", "login fails").
Fails = hangs. When trying to log it, I can type my user name, but after 
I press enter the prompt for password never appear.
As to dmesg, tough luck. I have 2 photos on my phone and their 
transcripts are all I can give until the problem reappears (which should 
take up to 2 weeks). Photos are blurry and in many cases I'm not sure 
what exactly is there.

Screen1:
(ada0:ahcich0:0:0:0): FLUSHCACHE40. ACB: (ea?) 00 00 00 00 (cut?)
(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-qu (cut)
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 05 d3(cut)
00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 03 7b(cut)
00
(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-qu (cut)
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 03 d0(cut)
00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated


Screen 2:
ahcich0: Timeout on slot 29 port 0
ahcich0: (unreadable, lots of numbers, some text)
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: (cc?) 00 (cut)
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error (5?), Retry was blocked
ahcich0: Timeout on slot 29 port 0
ahcich0: (unreadable, lots of numbers, some text)
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: (cc?) 00 (cut)
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error (5?), Retry was blocked
ahcich0: Timeout on slot 30 port 0
ahcich0: (unreadable, lots of numbers, some text)
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 01 (cut)
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 01 (cut)

Both are from the same event. In general, messages:

(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED.

are the most common.

I've waited for more than 1/2 hour once and the system didn't return to 
a working state, the messages kept flowing and pretty much nothing was 
working. What's interesting, I remember that it happened to me even when 
I was using an installer (PC-BSD one), before the actual installation 
began, so the disk stored no program data. And I *think* there was no 
ZFS yet anyway.

>
> I've already demonstrated that loss of a disk in raidz1 (or even 2 disks
> in raidz2) does not cause ""the system to fail"" on stable/9.  However,
> if you lose enough members or vdevs to cause catastrophic failure, there
> may be anomalies depending on how your system is set up:
>
> http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html
>
> If the pool has failmode=wait, any I/O to that pool will block (wait)
> indefinitely.  This is the default.
>
> If the pool has failmode=continue, existing write I/O operations will
> fail with EIO (I/O error) (and hopefully applications/daemons will
> handle that gracefully -- if not, that's their fault) but any subsequent
> I/O (read or write) to that pool will block (wait) indefinitely.
>
> If the pool has failmode=panic, the kernel will immediately panic.
>
> If the CAM layer is what's wedged, that may be a different issue (and
> not related to ZFS).  I would suggest running stable/9 as many
> improvements in this regard have been committed recently (some related
> to CAM, others related to ZFS and its new "deadman" watcher).

Yeah, because of the installer failure, I don't think it's related to ZFS.
Even if it is, for now I won't set any ZFS properties in hope it repeats 
and I can get better data.
>
> Bottom line: terse output of the problem does not help.  Be verbose,
> provide all output (commands you type, everything!), as well as any
> physical actions you take.
>
Yep. In fact having little data was what made me hesitate to write about 
it; since I did already, I'll do my best to get more info, though for 
now I can only wait for a repetition.


On 12/04/2013 00:08, Quartz wrote:>
 >> Seeing a ZFS thread, I decided to write about a similar problem that I
 >> experience.
 >
 > I'm assuming you're referring to my "Failed pool causes system to hang"
 > thread. I wonder if there's some common issue with zfs where it locks up
 > if it can't write to disks how it wants to.
 >
 > I'm not sure how similar your problem is to mine. What's your pool setup
 > look like? Redundancy options? Are you booting from a pool? I'd be
 > interested to know if you can just yank the cable to the drive and see
 > if the system recovers.
 >
 > You seem to be worse off than me- I can still login and run at least a
 > couple commands. I'm booting from a straight ufs drive though.
 >
 > ______________________________________
 > it has a certain smooth-brained appeal
 >
Like I said, I don't think it's ZFS-specific, but just in case...:
RAIDZ1, root on ZFS. I should reduce severity of a pool loss before 
pulling cables, so no tests for now.
-- 
Twoje radio



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5168821F.5020502>