Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 27 Feb 2008 04:11:29 -0800
From:      Jeremy Chadwick <koitsu@freebsd.org>
To:        Stephen Hurd <shurd@sasktel.net>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE
Message-ID:  <20080227121129.GA76419@eos.sc1.parodius.com>
In-Reply-To: <47C52948.2070500@sasktel.net>
References:  <47C52948.2070500@sasktel.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Feb 27, 2008 at 01:11:36AM -0800, Stephen Hurd wrote:
> ...  The corrupted sync message scared the heck out of me:
> Waiting (max 60 seconds) for system process `vnlru' to stop...done
> Waiti
> Synncgi n(gm adxi sk6s0,  svencoodnedss )r efmoari nsiynsgte.m. .pr1o0c ess 
> `syncer' to stop...8 7 8 3 3 3 1 0 0 0 0 done

http://lists.freebsd.org/pipermail/freebsd-current/2007-October/078145.html
http://lists.freebsd.org/pipermail/freebsd-current/2007-November/079130.html
http://lists.freebsd.org/pipermail/freebsd-current/2007-November/079131.html
http://lists.freebsd.org/pipermail/freebsd-stable/2007-December/038727.html


> And after the reboot, the READ_DMA timeouts were back.

You're not the only one seeing this behaviour.  There are too many posts
in the past reporting similar.  Here's the breakdown:

* Some reporting this problem have been told to replace their ATA or
  SATA cables (which have previously been known to be working, but cables
  going bad does happen) -- and this has fixed the problem for a couple.

* Some have checked their SMART stats and found their disks to be in
  perfect condition.

* Some have switched to alternate operating systems (usually Linux) for
  a short while and seen no sign of DMA timeouts.

* Some have replaced the storage controller to no avail, and some have
  replaced the entire motherboard to no avail.  In some cases (myself
  included), replacing the motherboard did in fact help.

However: in your case, your disk does look to have problems based on the
SMART output you provided.  It does not matter how new/old the disk is,
by the way.  I'll point out the problematic stats.  You need to replace
the disk ASAP.

BTW, any SMART stats you see labelled "Offline" means the numbers will
not be updated until you perform an offline test (smartctl -t short or
smartctl -t long).

> The only "odd" think I can think of about my system is an unusually high HZ 
> value (2386) I'm building a kernel now with 1000 to check if that makes a 
> difference.

This is not the cause, rest assured.

> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail  Always       -       4

This shows you've had 4 reallocated sectors, meaning your disk does in
fact have bad blocks.  In 90% of the cases out there, bad blocks
continue to "grow" over time, due to whatever reason (I remember reading
an article explaining it, but I can't for the life of me find the URL).

> 194 Temperature_Celsius     0x0032   253   253   000    Old_age   Always       -       48

This is excessive, and may be attributing to problems.  A hard disk
running at 48C is not a good sign.  This should really be somewhere
between high 20s and mid 30s.

> 195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age   Always       -       11498

This implies a large number of ECC (error correction) activities have
occured, but all were successful.

> Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
>   When the command that caused the error occurred, the device was in an unknown state.
> Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
>   When the command that caused the error occurred, the device was in an unknown state.

These are automated SMART log entries confirming the DMA failures.  The
fact that SMART saw them means that the disk is also aware of said
issues.  These may have been caused by the reallocated sectors.  It's
also interesting that the LBAs are different than the ones FreeBSD
reported issues with.

My advice to you is: replace the disk ASAP.  This problem will only get
worse.  Try another hard disk brand too (I don't have anything "against"
Maxtor, but usually its recommended to avoid a brand you have problems
with until the next time you have issues, then switch brands, etc.
etc...).  I'm very fond of Western Digital's SE16, RE, and RE2 series
currently.  But avoid Fujitsu and Samsung (both have a long track record
of having buggy drive firmwares, forcing vendors to make custom
workarounds for issues); stick with Seagate, Western Digital, or Maxtor.

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080227121129.GA76419>