Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 14 Mar 2004 15:12:26 -1000
From:      Clifton Royston <cliftonr@lava.net>
To:        kuku@kukulies.org
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: off topic - disk crash
Message-ID:  <20040315011225.GA8554@lava.net>
In-Reply-To: <20040314200051.18FBF16A4FD@hub.freebsd.org>
References:  <20040314200051.18FBF16A4FD@hub.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
> From: "Christoph P. Kukulies" <kuku@kukulies.org>
> Subject: Re: off topic - disk crash
> To: freebsd-hackers@freebsd.org
> Message-ID: <20040314104717.GA16158@gilberto.physik.rwth-aachen.de>
> Content-Type: text/plain; charset=iso-8859-1
> 
> On Fri, Mar 12, 2004 at 03:58:16PM +0100, Dag-Erling Smørgrav wrote:
> > Clifton Royston <cliftonr@tikitechnologies.com> writes:
> > > > Today an important (no backup of course) 46 GB IBM Deskstar
> > > > IDE disk crashed.
> > > This specific line of drives is infamous for a failure rate that's at
> > > least a full order of magnitude above the industry average for ATA
> > > drives.  Google a bit for it.
> > 
> > Not the entire DeskStar line, just the 75GXP series.  I still have
> > several 16Gs and at least one 60GXP that have never given me any
> > trouble, and they were fast and silent for their time, head and
> > shoulders ahead of the competition.  These days I mostly buy WD...
> > 
> > > > The disk boots into FreeBSD but already at power on time the disk does
> > > > seek retries or some recalibration noise.
> > 
> > Also known as the "click of death"...
> 
> 
> Thanks for all the helpful tips so far. It is a DLTA 307045 (3.5")
> Don't know whether this is a 75GXP.

  Yes it is.  All the 46GB drives for a couple years were, AFAIK, and
this particular model (DTLA 307045) is the exact one I was researching
and replacing the week before.  (BTW, des@ is correct that not all
Deskstars are bad; I have a later model Deskstar in my home FreeBSD
machine which has been fine.  It was the first generation of drives
with higher-density platters that have the huge mortality rate.)

> I'm getting either these:
> 
> ad2: TIMEOUT - READ_DMA retrying (2 retries left)  LBA=30583
> 
> Which don't stop the dd process.
> 
> And these,
> 
> ad2: FAILURE  -  READ_DMA status=51<READY, DSC, ERROR> error=40<UNCORRECTABLE> LBA=9156
> 
> leading to termination.
> 
> Also the transfer rate is terribly slow: (80 KB/s)
 
  This is because the drive firmware itself is retrying over and over
before it reports errors back up to the controller.

> I was able to save 18 MB (of 46 GB) (not much so far)
> 
> Any other suggestions? 
> 
> Could I increase the retry count? Or enforce continuation even in case of
> hard errors? 

  I see you got responses later on to your other questions on
continuing past hard errors; sounds like those are on the right track. 
If anything, you might want to cut down the driver-level retries,
because by the time the failure is returned, the drive itself has
already retried exhaustively, but I don't know how you might do that.

[merged with following post]
> > I'm about to get me a second identical model and maybe I then can dd
> > the whole image including partition table so that I will not have to
> > scan the disk for the start of the filesystems.
> 
> Dont get another DTLA/AVER IBM disk, you will just have the same problem 
> again sometime in the future, stay away from IBM/Hitachi disks that is 
> based on these models (I dont know much about the newer disks from 
> Hitachi and frankly I wont waste my money on them to find out).

  If you get the same model/line of DeskStars, you run a high risk of
the same problem.  Get later models and you're probably OK, but I don't
think those include 46GBs.  IBM did solve the manufacturing problem,
but not before their initial coverups and lies about it had completely
ruined a once proud reputation.  (BTW, the 18GB Ultrastar SCSI drives
from the same period have much the same problem.  I've had some of
those die in servers within a month or two, whereas I had run many
9.1GB IBM SCSI drives for years of continuous duty without a single
failure.)

[...]
> Another question is whether the read error occurs on the actual data
> or only during the fstat or directory read. Is it possible to mount a 
> FS with an alternate superblock as information base or do I have to fsck 
> (write back to the disk risking that things get worse)

  You would want to avoid fscking on the old disk or anything else that
would cause writes to it.  I don't think there is any way to specify an
alternate superblock for a read-only mount (though it would sure be a
slick idea if you could.) However, if you succeed in dd'ing the raw
partitions off the disk to a new drive, then you can fsck those
*copies* of it using an alternate superblock if necessary.  I think
people were implicitly suggesting that as part of the recovery
approach.

  If you want to be extra sure about recovering everything you possibly
can, you would take the partitions you dd-ed from the original disk to
a copy on another drive, and treat those as read-only reference copies:
rather than fscking them, make a copy to yet another partition, try
fscking that with different options and see what works best to recover
the maximum.  Depending on how much you've recovered so far and how
much the data is worth to you, this may be more effort than you're
prepared to go to, but it does let you try things out with the least
further damage to the original disk.

  -- Clifton

-- 
          Clifton Royston  --  cliftonr@tikitechnologies.com 
         Tiki Technologies Lead Programmer/Software Architect
Did you ever fly a kite in bed?  Did you ever walk with ten cats on your head?
  Did you ever milk this kind of cow?  Well we can do it.  We know how.
If you never did, you should.  These things are fun, and fun is good.
                                                                 -- Dr. Seuss



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040315011225.GA8554>