Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 01 Mar 1999 14:41:58 -0500
From:      Andrew Heybey <ath@niksun.com>
To:        Mike Smith <mike@smith.net.au>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Advice wanted on tracking down bug (or hw problem?) in 3.1R 
Message-ID:  <199903011941.OAA28487@stiegl.niksun.com>
In-Reply-To: Your message of Fri, 26 Feb 1999 12:22:25 -0800. <199902262022.MAA09175@dingo.cdrom.com> 

next in thread | previous in thread | raw e-mail | index | archive | help

>> >>On Fri, 26 Feb 1999 09:52:33 -0800, Mike Smith <mike@smith.net.au> said:
>>   >> I have just submitted PR kern/10243, but I thought I would ask
>>   >> for some advice on hackers as well.
>>   >> 
>>   >> The bug is that under certain loads, read(2) can return corrupted
>>   >> data (ie data that are not in the file on disk).  The instances I
>>   >> have seen are relatively small amounts (8-64 bytes) of corrupt
>>   >> data at the end of a 4k page.  The corrupt data is from a file
>>   >> previously read or another position in the current file.  I have
>>   >> also seen this problem in 3.0-RELEASE but not in 2.2.8-RELEASE.
>> 
>>   mike> Can you look at the corrupt data and see if you can identify
>>   mike> it?  In particular, look for objects that look like IP
>>   mike> addresses, MAC addresses, pointers into kernel space, ascii
>>   mike> text, etc.  This is usually the best way to work out where the
>>   mike> data is coming from.
>> 
>> The data is always (in every instance that I have examined) from some
>> other part of the file currently being read or some other file in my
>> set of test files.  How my test setup works is that I have 30 50MB
>> files.  The files are filled with sequential integers (counting over
>> the entire 1.5GB).  My test program reads from the files (in order,
>> starting over at file #0 when it reaches file #29) and compares what
>> read(2) returns to what should be there (based on file number and file
>> offset).
>> 
>> One other possible clue: This morning I hooked my disks up to the
>> regular Ultra SCSI (40MB/s) port of the 7890 controller rather than
>> the Ultra/2 (80MB/s) port and I haven't seen the bug yet.  I am not
>> 100% positive since I have only run it for a few hours so far, but
>> before I could almost always make the bug happen withing 10-15
>> minutes.
>
>Could you try bzero'ing your buffers before every read?  This sniffs 
>very much like short transfers rather than sniping...
>

More information:

I ran a test where I stopped all activity on the system as soon as the
first test program observed the bug.  That is, I stopped the other
programs reading the disk and turned off the packet generator that had
been raising the network load.  Then I read the file with the garbage
data again and it still contains the same garbage at the same offset.
If I do enough disk I/O to flush it from the cache and then read it
again it is fine.

This behavior seems to confirm that it isn't a race condition (because
then I would expect the subsequent read of the file to return the
correct data).  Rather, it seems that the buffer cache has become
corrupted because of a short DMA.  Any other suggestions?  Would this
more likely be a driver bug or a hw bug?

It still seems to be the case that I cannot duplicate the bug with the 
disk connected to the 40MB/sec SCSI bus.

andrew


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199903011941.OAA28487>