Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 7 Jun 2010 02:08:50 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Andriy Gapon <avg@icyb.net.ua>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: zfs i/o error, no driver error
Message-ID:  <20100607090850.GA49166@icarus.home.lan>
In-Reply-To: <4C0CB3FC.8070001@icyb.net.ua>
References:  <4C0CAABA.2010506@icyb.net.ua> <20100607083428.GA48419@icarus.home.lan> <4C0CB3FC.8070001@icyb.net.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jun 07, 2010 at 11:55:24AM +0300, Andriy Gapon wrote:
> on 07/06/2010 11:34 Jeremy Chadwick said the following:
> > On Mon, Jun 07, 2010 at 11:15:54AM +0300, Andriy Gapon wrote:
> >> During recent zpool scrub one read error was detected and "128K repaired".
> >>
> >> In system log I see the following message:
> >> ZFS: vdev I/O failure, zpool=tank
> >> path=/dev/gptid/536c6f78-e4f3-11de-b9f8-001cc08221ff offset=284456910848
> >> size=131072 error=5
> >>
> >> On the other hand, there are no other errors, nothing from geom, ahci, etc.
> >> Why would that happen? What kind of error could this be?
> > 
> > I believe this indicates silent data corruption[1], which ZFS can
> > auto-correct if the pool is a mirror or raidz (otherwise it can detect
> > the problem but not fix it).
> 
> This pool is a mirror.
> 
> > This can happen for a lot of reasons, but
> > tracking down the source is often difficult.  Usually it indicates the
> > disk itself has some kind of problem (cache going bad, some sector
> > remaps which didn't happen or failed, etc.).
> 
> Please note that this is not a CKSUM error, but READ error.

Okay, then it indicates reading some data off the disk failed.  ZFS
auto-corrected it by reading the data from the other member in the pool
(ada0p4).  That's confirmed here:

> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are unaffected.
> 
>         NAME                                            STATE     READ WRITE CKSUM
>         tank                                            ONLINE       0     0     0
>           mirror                                        ONLINE       0     0     0
>             ada0p4                                      ONLINE       0     0     0
>             gptid/536c6f78-e4f3-11de-b9f8-001cc08221ff  ONLINE       1     0     0  128K repaired

> > - Full "smartctl -a /dev/XXX" for all disk members of zpool "tank"
> 
> Those output for both disks are "perfect".
> I monitor them regularly, also smartd is running and complaints from it.

Most people I know if do not know how to interpret SMART statistics, and
that's not their fault -- and that's why I requested them.  :-)  In this
case, I'd like to see "smartctl -a" output for the disk that's
associated with the above GPT ID.  There may be some attributes or data
in the SMART error log which could indicate what's going on.  smartd
does not know how to interpret data; it just logs what it sees.

> > Furthermore, what made you decide to scrub the pool on a whim?
> 
> Why on a whim? It was a regularly scheduled scrub (bi-weekly).

I'm still trying to figure out why people do this.  ZFS will
automatically detect and correct errors of this sort when it encounters
them during normal operation.  It's good that you caught an error ahead
of time, but ZFS would have dealt with this on its own.

It's important to remember that scrubs are *highly* intensive on both
the system itself as well as on all pool members.  Disk I/O activity is
very heavy during a scrub; it's not considered "normal use".

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100607090850.GA49166>