Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 03 Aug 1996 15:38:03 +0200
From:      "Julian H. Stacey" <jhs@freebsd.org>
To:        grog@lemis.de (Greg Lehey)
Cc:        scsi@freebsd.org, fabio@cesar.unicamp.br, fty@mcnc.org, gcrutchr@nightflight.com, j@uriah.heep.sax.de, jc@irbs.com, julian@freebsd.org, kuku@gilberto.physik.rwth-aachen.de, mrm@Sceard.com, nikm@ixa.net, tomppa@fidata.fi, wilko@yedi.iaf.nl, Scott Kelly <scott@relay.forest.com>
Subject:   Re: 8 * 0xFF bytes at intermittent multiples of 0x1000 
Message-ID:  <199608031338.PAA01488@vector.jhs.no_domain>
In-Reply-To: Your message of "Sun, 14 Jul 1996 16:23:13 %2B0200." <199607141423.QAA22112@allegro.lemis.de> 

next in thread | previous in thread | raw e-mail | index | archive | help
Hi, Reference:
> From: grog@lemis.de (Greg Lehey) 
> Date: Sun, 14 Jul 1996 16:23:13 +0200 (MET DST) 
>
> In early June 1996, Julian H. Stacey wrote:
> >
> > To scsi@freebsd.org
> > Cc Adaptec 1542A SCSI Adapter People, Julian Elischer.
> >
> > [	I last posted to +1542A owners + bugs@ ,
> > 	but scsi@ now seems more appropriate than bugs@.
> > 	I & some other 1542A people are most probably not on scsi@ list,
> > 	so please be careful if trimming CC line.
> > ]
> >
> > I (Julian Stacey <jhs@freebsd.org>) did a load more hardware changes & test
- s,
> > including swapping my Adaptec 1542A for a 1542B, & swapping sd0 & sd1,
> > & eventually deduced it was not my 1542A that was mis-behaving,
> > 	(returning 8 * 0xFF bytes at intermittent multiples of 0x1000),
> > but was one of 2 HP 97548S SCSI 1 633MB disks.
> >
> > Either the disk is faulty, or maybe the scsi code might not be
> > allowing for some strange sequence, or some such.
> >
> > __HOWEVER__
> > We can't dismiss it as an isolated equipment fault, as
> > 	- tomppa@fidata.fi detects similar data corruptions,
> > 	- scott@relay.forest.com seems to be having similar problems,
> > 	  but with a 1542B,
> > 	- perhaps other people are suffering similar corruption
> > 	  without realising it.
> >
> > Partial Conclusion:
> > 	1542A people can `relax',  to the extent that 1542B seems to be
> > 	able to trigger the fault too (I don't have a1542C or 2940 etc)
> 
> I've just run into this same problem, but I can't confirm your
> findings.

I wasn't clear which findings you can't confirm, so I read ahead,
& conclude you mean you can't confirm my disc hardware error suspicion;
I conclude you suspect software error, like I used to ?

> I'm putting together a machine out of old junk parts.
>
> Currently it has a 486/66 with 16 MB and two full-height 5\(14"
> drives:
> 
> (aha0:0:0): "CDC 94161-9 6226" type 0 fixed SCSI 1
> sd0(aha0:0:0): Direct-Access 148MB (304605 512 byte sectors)
> (aha0:1:0): "CDC 94171-9 5836" type 0 fixed SCSI 1
> sd1(aha0:1:0): Direct-Access 308MB (631017 512 byte sectors)
> 
> Although these drives both claim to be CDC, the second one has a
> Seagate label on it.

My good drive is:
	"HP 97548S 8928" type 0 fixed SCSI 1
	Direct-Access 633MB (1296512 512 byte sectors)
My flaky drive is:
	"HP 97548S C023" type 0 fixed SCSI 1
	Direct-Access 633MB (1296512 512 byte sectors)
(`good` & `flaky` being independent of 1542A or 1542B,
  also independent of sd0 & sd1 physical allocation,
  also independent of whether running 2.0.5 Rel or 2.1.0 Rel )


> I installed 2.1-RELEASE on the machine from CD-ROM, and immediately
> after booting lots of programs SIGSEGVed.  I compared them with the
> original and found almost exactly the same symptoms you describe:
> here's the result of comparing /usr/bin at a later time:
> 
> /usr/bin/cu bin/cu differ: char 40961, line 131
> /usr/bin/uucp bin/uucp differ: char 32769, line 97
> /usr/bin/uupick bin/uupick differ: char 32769, line 102
> /usr/bin/uustat bin/uustat differ: char 32769, line 111
> /usr/bin/as bin/as differ: char 81921, line 185
> /usr/bin/awk bin/awk differ: char 32769, line 83
> /usr/bin/bc bin/bc differ: char 32769, line 134
> /usr/bin/cvs bin/cvs differ: char 212993, line 725
> /usr/bin/gdb bin/gdb differ: char 475137, line 5209
> /usr/bin/grep bin/grep differ: char 32771, line 107
> /usr/bin/egrep bin/egrep differ: char 32771, line 107
> /usr/bin/fgrep bin/fgrep differ: char 32771, line 107
> (many more)
> 
> It's interesting to note how many come immediately after the first 32
> KB.  In the cases I looked at, a number of bytes had been replaced by
> 0xff; the total size of the executable didn't change.  In most other
> cases, too, the corruption was at or immediately after the beginning
> of a memory page.

Ah ! new perspective :-) i'd been thinking only in times of disc PCB ICs,
& size of on disc card buffer chips.


> Another point: I've only seen this corruption on the second disk.

Yes that's what I first saw, but then, observations changed,
can't explain that !

> Considering that they're almost identical, that's interesting.  I
> don't know how to explain it, except that maybe it's a coincidence.
> 
> The big difference from your experience is that I replaced the 1542A
> with a 1542B, and the problems completely disappeared.  Let's look at
> the other responders:
> 
> >> Date: Tue, 11 Jun 1996 16:56:50 -0400
> >> From: Scott Kelly <scott@relay.forest.com>
> >> To: jhs@freebsd.org
> >> Subject: Adaptec 1542A Users (from 12 Apr 1996)
> >>
> >>
> >> I seem to be having similar problems, but with a 1542B... Do you know if t
- here
> >> has been a driver update since April?
> 
> Are you sure that these are the exact problems?  What other hardware
> are you running?
> 
> > For reference, I'll append parts of my <jhs> last mail:
> >> Tomi Vainio <tomppa@fidata.fi>
> >> Has confirmed he sees the same Adaptec 1542A SCSI adapter bug that I do.
> >>
> >> > I connected sd1 to my 1542A and here are results:
> >> >
> >> > 1. No problems if testblock is only one that generates disk activity.
> >> > 2. I launched couple find processes to sd0 and at same time I
> >> >    run testblock. Testblock failed only 1/10 of test runs.
> >> > 3. I copied files with cp to sd1 when running testblock on
> >> >    sd1. Testblock failed on every time.
> 
> Yes, I had a vague feeling that it was related to the amount of disk
> activity.
> 
> 
> >> So it looks like a generic bug in FreeBSD code:
> >> 	With a 1542A (& not a 1542B, which seems OK),
> >> 	In simultaneous multiple task write mode to sd1 (or 2 or 3 or 4),
> >> 	At random multiples of 0x1000 bytes,
> >> 	The first 8 bytes of a block get forced to 0xFF.
> >> (Of course it may well be that FreeBSD code is not `in error' but merely
> >> doesnt allow for some wart in the 1542A, that's fixed in the 1542B,
> >> but whatever, we need a fix).
> >
> > As above in this mail, I think I'm wrong there, it's not 1542A sepcific,
> > I get it with 2 different 1542B's as well
> 
> Do you have 1542Bs with which you don't get it?

No, I only have 2 1542Bs & 1 A, all show error on same drive.


> When I get a bit of time, I intend to install BSD/OS on the same
> configuration and see if it has the same problems.

Let us know your further deductions from that please :-)

> Greg

I used to feel I had found a bug in the driver, but now tend to view
my problem here as a bad disc, but its worrying when I hear you observe
the same things I do, & others see similar things too !
I have NETBSD src/ here (but no bins & no OS-BSD), but not much time, & anyway
seem to recall Julian Elischer wrote scsi for both Net & Free, so if 
Free & Net are resumably similar scsi code,
it'd be a less meaningful test than you trying OS BSD on your system).

Anyone else who even just suspects misbehaving discs,
is welcome to a copy of my testblock.c & .man
(it runs in user not root mode, & wont destroy your file systems & data :-)

Julian
--
Julian H. Stacey	jhs@freebsd.org  	http://www.freebsd.org/~jhs/



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199608031338.PAA01488>