Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 6 Jun 1995 09:18:49 -0700 (PDT)
From:      "Rodney W. Grimes" <rgrimes@gndrsh.aac.dev.com>
To:        dufault@hda.com (Peter Dufault)
Cc:        taob@gate.sinica.edu.tw, freebsd-hackers@freebsd.org
Subject:   Re: Quantum hardware errors (Deferred Error: HARDWARE FAILURE asc:87,0)
Message-ID:  <199506061618.JAA26313@gndrsh.aac.dev.com>
In-Reply-To: <199506061211.IAA16611@hda.com> from "Peter Dufault" at Jun 6, 95 08:11:18 am

next in thread | previous in thread | raw e-mail | index | archive | help
> 
> Brian Tao writes:
> > 
> >     I mknod'd a 13:536870912 character special file and now it works.
> > Did the 2.0.5 installation mess up the minor device numbers?  I don't
> > see any *.ctl devices on my pre-2.0.5 machines, so I presume this is a
> > new feature (and thus couldn't have been left over from previous
> > installs)?  Anyhow, you were right about the write-cache:
> > 
> > # scsi -f rsd0.ctl -m 8
> > WCE:  1     <--- (write cache enabled, for the -hackers folks)
> > MF:  0 
> > RCD:  0 
> > Demand Retention Priority:  0 
> > Write Retention Priority:  0 
> > Disable Pre-fetch Transfer Length:  0    -> yours is 65535
> > Minumum Pre-fetch:  8                    -> yours is 0
> > Maximum Pre-fetch:  128 
> > Maximum Pre-fetch Ceiling:  128 
> > 
> >     So this is the firmware on the *drive* and not the controller
> > then... is this something we have to watch out for in 2.0.5, or has it
> > just been lurking around for the past six months on my 2.0 systems
> > without causing any trouble?
> 
> No, your drive has always had this setting.
> 
> You were living dangerously in the event of a power failure.

No, this is not true.  Have you studied the exact mechanism used to
do this in the drive.  They use the inertail energy of the rotating
disk to turn the spindle motor into a generator in case of power
failure and can actually write the cache after power goes down.  You
only cache on cyclinder data in the write behind buffers so you don't
need the energy to do a seek.

This technology is 3 years old now, I worked for the company that
did the motor controller circuits for Conner when they first decided
to do this and it works, infact we had it to the point that you could
do +/- 2 track seeks and still have the power to write the cache!
It is amazing what you can do with SmartPower ASIC circuits and
a little creativity :-).

> Everyone should check their drives and disable this.

Wrong, or at least wrong for Quantum drives, been running them that
way for 3 years in mass quantities (every Quantum drive shipped has
this turned on as far as I know) and never seen a data loss that you
could attribute to a write behind cache deferment.  AWRE should 
correct the problem if it could not write the data due to a sector
going bad...

> Be sure you also have Automatic Write Reallocation Enabled and
> Automatic Read Reallocation Enabled in mode page 1.

This is true, and most manufacutres ship the drives with these
off.  Including Quantum.  Though I actually make sure it is off
before burn in testing, and then turn it back on at the end
of burn in.  This is to watch for growing defects on new drives
that I want to know about.  Any drive that grows an error this
early in it's life is returned to the manufature as a DOA drive.
[And yes, I have returned a few, very few, but even 1 counts in
my books as a significant reduction in user risk of reliability.]

> The nature of the deferred error is that the OS thought the write
> succeeded and then later on reports back an error (in this case,
> it said the write succeeded when it cached the data on the disk,
> and then reported the failure when it couldn't transfer the cache
> to disk for some reason).

I suspect turning on AWRE and running a verify operation on this
disk would clean the persons problem if this infact is what is
causing it.  Also dump the growing defect list, you may have a
zone that has no more spares left in it due to lots of grown
defects :-(.

> In 2.1 the system will sanity check the mode page settings during
> boot and will comment on anything it thinks is set wrong.

Great, and just what are the ``sanity'' values going to be??

> >     I then used "scsi -f /dev/rsd2.ctl -m 8 -e -P 3" to turn off
> > write-cache enable.  The minimum pre-fetch on my drive was set to 8,
> > but I noticed yours is 0.  I suppose this value means the drive will
> > always try to grab 8 blocks at once?  It is set to 0 now, to match
> > yours, if it makes any difference.
> 
> Mine may not be correct; it is just set as the vendor shipped it.
> Quantum may have some reason they want the drives set that way.

You just killed the read ahead cache in the drive, this man adversely
effect performance.  I suggest you reset it to what the drive says
the default value is (-P 4??).

> >     Noticed another error from syslog that happened about 8 hours after:
> > 
> > /kernel: sd0(ncr0:6:0): Deferred Error: HARDWARE FAILURE asc:87,0
> > [...8 hours...]
> > /kernel: vnode_pager_input: I/O read error
> > /kernel: vm_fault: pager input (probably hardware) error, PID 163 failure
> > 
> >     Same cause?  Or something completely different?  As for the first
> > error, I'll see if my dealer has the tech books from Quantum (unless
> > someone knows Quantum's number in Taipei).
> 
> Something completely different, I think.

I agreed, to much time between the disk failure on the vm_fault.

> You are saying there is an 8 hour delay between the hardware failure
> and the read complaint from the vnode pager?  It looks to be a bug
> that there was no read fail message immediately before the vnode
> message.
> 
> Peter


-- 
Rod Grimes                                      rgrimes@gndrsh.aac.dev.com
Accurate Automation Company                   Custom computers for FreeBSD



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199506061618.JAA26313>