FreeBSD Mail Archives

Date:      Fri, 3 Oct 97 21:55 CDT
From:      uhclem.bsd@nemesis.lonestar.org
To:        FreeBSD-gnats-submit@FreeBSD.ORG
Subject:   kern/4686: SCSI driver gradually remaps entire drive/false read errors? - FDIV073
Message-ID:  <m0xHKND-000twaC@nemesis.lonestar.org>
Resent-Message-ID: <199710040640.XAA01382@hub.freebsd.org>

next in thread | raw e-mail | index | archive | help


>Number:         4686
>Category:       kern
>Synopsis:       SCSI driver gradually remaps entire drive/false read errors? - FDIV073
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Oct  3 23:40:00 PDT 1997
>Last-Modified:
>Originator:     Frank Durda IV
>Organization:
None
>Release:        FreeBSD 2.1-STABLE i386
>Environment:

All entries Oct  3	("/kernel" removed from lines for readability)
17:05:11 cabal1: FreeBSD 2.2.5-971003-BETA #0: Fri 16:51:38 CDT 1997
17:05:11 cabal1:     uhclem@handsoff:/usr/src/sys.new/sys/compile/CABAL1
17:05:11 cabal1: CPU: Pentium (132.96-MHz 586-class CPU)
17:05:11 cabal1:   Origin = "GenuineIntel"  Id = 0x52c  Stepping=12
17:05:11 cabal1:   Features=0x1bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8>
17:05:11 cabal1: real memory  = 134217728 (131072K bytes)
17:05:11 cabal1: avail memory = 129269760 (126240K bytes)
17:05:11 cabal1: ahc0: aic7880 Single Channel, SCSI Id=7, 16 SCBs
17:05:11 cabal1: (ahc0:0:0): "QUANTUM FIREBALL ST4.3S 0F0C" type 0 fixed SCSI 2
17:05:11 cabal1: sd0(ahc0:0:0): Direct-Access 4136MB (8471232 512 byte sectors)
17:05:11 cabal1: (ahc0:3:0): "SEAGATE ST19171N 0017" type 0 fixed SCSI 2
17:05:11 cabal1: sd1(ahc0:3:0): Direct-Access 8683MB (17783112 512 byte sectors)

>Description:

On four different (but identical) systems as shown above with different makes
of drives, we have been running into a situation where the SCSI driver will
get into a mode where it will apparently start remapping bad sectors because
of real or imagined media problems.

Once in a while, the driver seems to get in a state where it starts
substituting one block after another as fast as it can, while at other times,
it will go anywhere from two minutes to two hours after the first remapping
"event"  before it does another and each disk problem from that point on
until reboot results in an apparent media reassignment, until there are no
spare blocks left.

Using a logic analyzer and monitoring the SCSI bus, we found that a SCSI bus
RESET (actually two) is performed immediately prior to each message being
displayed.  This is done even though multiple drives were active, which seems
dangerous. 

Later, when the system is shut down and the Adaptec media scan is run, it
finds no errors, OR the errors it does find are not anywhere the block
numbers supposedly reassigned.

We modified a kernel from a month ago (after the last big batch of changes
to the AIC drivers) to simply panic rather than reassign a block, so that we
could perform a media scan using the Adaptec diags at that moment.
Again, the diags found no flaws, or found some far away from the block number
reported in the driver error message.

It may be possible that an errant SCSI bus RESET is resulting in other
parts of the driver thinking the media is flawed.

Another curiosity is that the errors are almost always reported on
drives 1 thru 3 in a four drive configuration (usually the highest
drive number), or drive 1 in a two drive configuration, regardless of
which physical media is placed on that drive select and cable position.
This makes us tend to disbelieve the device being blamed even more.
Because of the application, 99.5% of disk I/O is to drives other than 0.

The media is Quantum 4.3GB Fireballs with high velocity fans blowing
directly on their circuit boards (we found that they start mis-handling
SCSI commands otherwise), or 9GB Seagate Baracudas.  We have cycled through
16 different brand new Quantums during this process and two Seagates.
We have used four different 2940 or 2940U SCSI adapters, different
cables, power supplies, motherboards (all Intel), and memory.  Each
type of drive and every drive select has had the opportunity to be the
last drive on the cable and be terminated.  We have even used discrete
termination blocks just to eliminate that.  A 250MHz scope and a logic
analyzer were used to check for ringing or glitches on the SCSI bus.

Note that several of the Quantums became unusable during use in this
application, so much so that the Adaptec BIOS could not mark blocks bad
(you get the Media-Check red screen) and the BIOS could not perform a
low-level format (the command would not even start).  These are now
collecting dust in the corner.

We finally tried using a 1542CP SCSI controller with some of the same
drives that had been reporting recoverable read errors on the 2940.
The errors stopped during the period when the 1542CP was in use,
suggesting that not all of these errors are real, but are actually some
sort of problem with the 2940 driver/sequencer.  Unfortunately, the
1542 ISA controller is too slow for the application and we really need to
use the 2940 PCI speed.   Note that running in or out of Ultra mode made
no difference to the error rate.   We even tried setting all drives to
Async mode on the 2940.  We still got errors.

Note that a disk-exercising application running on a DECstation
(using NetBSD) (random writes 10X the system cache size followed by reads
and compares) with the same drives doesn't encounter any media problems
either.

>How-To-Repeat:

The application used to kill these drives on the 2940 was Diablo, with
drive 0 being the / and /usr partitions and three stripped (ccd driver)
4.3GB Quantum drives or one 9GB Seagate for the /news partition.  Failures
would begin in as little as two minutes once articles started coming in:

17:05:11 cabal1: FreeBSD 2.2.5-971003-BETA #0: Fri Oct  3 16:51:38 CDT 1997
17:37:40 cabal1 login: ROOT LOGIN (root) ON ttyv0
(Diablo application started here)
18:42:19 cabal1: sd1(ahc0:3:0): RECOVERED ERROR info:0x468f80 asc:17,1 Recovered data with retries field replaceable unit: ea sks:80,1
18:42:19 cabal1: , retries:4
18:48:36 cabal1: sd1(ahc0:3:0): RECOVERED ERROR info:0x8d7c80 asc:17,1 Recovered data with retries field replaceable unit: ea sks:80,1
18:48:36 cabal1: , retries:4
19:36:25 cabal1: sd1(ahc0:3:0): RECOVERED ERROR info:0x8febcc asc:17,1 Recovered data with retries field replaceable unit: d2 sks:80,1
19:36:25 cabal1: , retries:4
19:44:33 cabal1: sd1(ahc0:3:0): RECOVERED ERROR info:0x4b8a77 asc:18,1
19:44:33 cabal1: sd1(ahc0:3:0):  Recovered data with error correction & retries applied field replaceable unit: ea sks:80,1
19:44:33 cabal1: , retries:4
20:42:36 cabal1: sd1(ahc0:3:0): RECOVERED ERROR info:0x70ab40 asc:10,0 Id CRC or ECC error field replaceable unit: f1 sks:80,1
20:42:36 cabal1: , retries:4
20:56:03 cabal1: sd1(ahc0:3:0): RECOVERED ERROR info:0x52cc84 asc:17,1 Recovered data with retries field replaceable unit: ea sks:80,1
20:56:03 cabal1: , retries:4

It's 21:15 now.   Based on previous tests, drive 1 will accumulate enough
of these messages to be reported unusable in roughly 24 hours,  at least
until I reboot and/or low-level format it.   

Clearly this problem isn't easy to run into or there would be a lot more
people reporting it.  I gave this PR a lower priority because of this, 
even though for this application, this is a killer problem.

>Fix:
	
Not known.

>Audit-Trail:
>Unformatted:

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0xHKND-000twaC>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation