Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Jan 2008 11:30:56 -0700
From:      Joe Peterson <joe@skyrush.com>
To:        freebsd-stable@freebsd.org
Subject:   Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Message-ID:  <479B7C60.7000800@skyrush.com>
In-Reply-To: <20080126012124.GA53400@eos.sc1.parodius.com>
References:  <479A0731.6020405@skyrush.com> <20080125162940.GA38494@eos.sc1.parodius.com> <479A3764.6050800@skyrush.com> <3803988D-8D18-4E89-92EA-19BF62FD2395@mac.com> <479A4CB0.5080206@skyrush.com> <20080126003845.GA52183@eos.sc1.parodius.com> <479A86E5.5060806@skyrush.com> <20080126012124.GA53400@eos.sc1.parodius.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I performed a ZFS scrub, which finished yesterday, and no new
/var/log/messages errors were reported during that time.  However, the scrub
found something interesting:


crater# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       1     3     2
          ad0s1d    ONLINE       1     3     2

errors: Permanent errors have been detected in the following files:


/home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_
Bachelor_Pad/07-Snowfall.mp3



Note that I have not touched this file since copying it to this drive.

So, it seems one file failed a checksum check during the scrub.  I now
(expectedly) get errors trying to read this file - probably ZFS indicating the
condition.  When I just logged in tonight, I got two more /var/log/messages
disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just
as I was typing my password).

Also, smartctl still shows PASSED, however, this is interesting:

195 Hardware_ECC_Recovered  0x001a   061   046   000    Old_age   Always
      -       9070

The number is much *smaller* now!  It was "6" a few minutes before this...
wrap around?  Hmm, I'm really not sure, at this point, what is going on.

So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the
drive.  The short test passed already.  The results should be interesting.  If
it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
bugs that just happen to look like drive problems.  I already did a long read,
under linux, of disk contents, and got no messages about anything wrong.

If I can turn on any debugging info to help determine if this is
software-related, let me know the magic keywords to use.  :)

							-Joe



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?479B7C60.7000800>