Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 May 2010 08:17:07 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>
Subject:   Re: ZFS (zpool) doesn't detect failed drive
Message-ID:  <20100505151707.GA68166@icarus.home.lan>
In-Reply-To: <4BE18729.3050209@omnilan.de>
References:  <4BE16784.8050400@omnilan.de> <4BE18729.3050209@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, May 05, 2010 at 04:56:41PM +0200, Harald Schmalzbauer wrote:
> Harald Schmalzbauer schrieb am 05.05.2010 14:41 (localtime):
> >Hello,
> >
> >one drive of my mirror failed today, but 'zpool staus' shows it "online".
> >Every process using a ZFS mount hangs. Also 'zpool offline
> >/dev/ad1' hangs infinitely.
> ...
> Sorry, I made an error with zpool create. Somehow the little word
> "mirror" must have been lost. So the pool wasn't a mirror but a
> stripe. Then of course I can't make one vdev offline. Sorry for the
> noise.
> But I took the opportunity to do some tests with that failing drive
> and created a _real_ mirror. That works without failures, but using
> the mirror again leads to:
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ata3: port is not ready (timeout 10000ms) tfd = 00000080
> ata3: hardware reset timeout
> ad1: FAILURE - device detached
> 
> Now zpool reporsts the vdev ad1 still online although it has been
> detached and 'atacontrol list' doesn't show it anymore:
> 
> zpool status
>   pool: URUBAmirrorP1
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
> unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using 'zpool clear' or replace the device with 'zpool replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         URUBAmirrorP1  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             ad1     ONLINE       3  302K     0
>             ad2     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> atacontrol list
> ATA channel 2:
>     Master:  ad0 <TRANSCEND/20090520> SATA revision 1.x
>     Slave:       no device present
> ATA channel 3:
>     Master:      no device present
>     Slave:       no device present
> ATA channel 4:
>     Master:  ad2 <SAMSUNG HD154UI/1AG01118> SATA revision 2.x
>     Slave:       no device present
> ATA channel 5:
>     Master:  ad3 <ST3750640NS/3.AEG> SATA revision 1.x
>     Slave:       no device present
> 
> How should such a failure be handled?
> Do I have to manually mark the drive offline for zpool?

You shouldn't have to; this should happen automatically when the
underlying device goes away.  GEOM should see the device gone, and ZFS
should therefore be marking the pool as DEGRADED and the ad1 disk as
FAULTED (or possibly OFFLINE).

Is AHCI in use + enabled (in the BIOS) on this system?  If not, I could
see this being a potential problem but have no idea where it should be
fixed.  If AHCI is available/in use, can you try using ahci_load="yes"
in /boot/loader.conf[1] to see if CAM handles this situation better?

Quick atacontrol<-->camcontrol conversion chart:

atacontrol list          = camcontrol devlist
atacontrol cap <disk>    = camcontrol identify <disk>
atacontrol detach <chan> = not needed AFAIK (just yank the disk)
atacontrol attach <chan> = may not be needed, but if disk doesn't
                           reappear try "camcontrol reset" or
                           "camcontrol rescan"

[1]: WARNING: this will change your device names from ad0->ada0,
ad1->ada1, etc., so you may have to boot single-user and fix /etc/fstab.
No need to mess with ZFS after the device naming changes; ZFS will taste
metadata on all disks attached and automatically load the pools (one
thing about ZFS I greatly appreciate.  :-) )

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100505151707.GA68166>