Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Oct 1998 00:59:05 -0600
From:      "Justin T. Gibbs" <gibbs@plutotech.com>
To:        Terry Lambert <tlambert@primenet.com>
Cc:        gibbs@plutotech.com (Justin T. Gibbs), Don.Lewis@tsc.tdk.com, julian@whistle.com, freebsd-fs@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG
Subject:   Re: filesystem safety and SCSI disk write caching 
Message-ID:  <199810130705.BAA12205@pluto.plutotech.com>
In-Reply-To: Your message of "Mon, 12 Oct 1998 22:58:15 -0000." <199810122258.PAA11377@usr02.primenet.com> 

next in thread | previous in thread | raw e-mail | index | archive | help
>> >} 2) Use a drive with non-bogus firmware.  Recent Seagate and IBM
>> >} drives should work just fine.  I haven't validated any Quantum
>> >} drives in this regard yet.
>> >
>> >But how can tell if the firmware is non-bogus?
>> 
>> Ask Terry since he has stated that he 'doesn't have any drives with
>> non-bogus firmware'.
>
>A)	Run soft updates
>B)	Press "reset" occasionally
>C)	Note any anomalies in the resulting fsck when the machine
>	comes back up
>D)	if count < 200, goto B
>E)	if # of anomalies > 0, print "bad firmware".

You're missing a large step here.  You can't prove that the 'anomaly'
is related to the drive firmware without a trace of all transactions
on the SCSI bus.  It could well be a missing dependency in the soft
update code.  I'd be more than happy to reproduce your failure scenario
while recording a SCSI bus trace so that the fault is easy to interpret.
Just send me any *modern* drive that you think fails.

You should also ensure that your reset button does not cause any power
spikes on the drive power lines.  That would be cheating.

>It's very hard to do this in software, without providing a mechanism
>to actually break into the latency link between the drive reporting
>a write cached operation has been written, and the actual writing.

If you can cause this a failure to occur by hitting your reset button, I
should be able to cause it to occur by using a paper-clip if the reset
condition (cased by the SCSI card BIOS in the reset button case) is the
event that causes cache corruption.  Both are non-deterministic methods of
error injection.

>Such a latency link only exists on drives which Justin has identified
>as having broken firmware due to the behaviour reported by Don Lewis.

I'm still unclear as to whether Don was turning off power or hitting what I
consider the reset button.  His comment about UPSes use makes me think he
was testing power outage scenarios.

>I would be much more interested in knowing what drives and firmware
>revisions of those drives Justin has, since both mine and Don Lewis's
>are demonstrably broken.

Since you were able to test 4 drives so quickly, I'd love to see well
documented information on exactly how the file system was inconsistent
in the failure cases.

--
Justin



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199810130705.BAA12205>