Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Feb 2001 14:35:47 -0800 (PST)
From:      mjh@aciri.org
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   kern/25104: file corruption with Adaptec 29160 SCSI adapter
Message-ID:  <200102142235.f1EMZlP79180@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         25104
>Category:       kern
>Synopsis:       file corruption with Adaptec 29160 SCSI adapter
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Feb 14 14:40:01 PST 2001
>Closed-Date:
>Last-Modified:
>Originator:     Mark Handley
>Release:        4.2-RELEASE
>Organization:
ACIRI
>Environment:
gaur.aciri.org: uname -a
FreeBSD gaur.aciri.org 4.2-RELEASE FreeBSD 4.2-RELEASE #1: Sat Jan 20 20:49:54 PST 2001     root@gaur.aciri.org:/usr/src/sys/compile/ACIRI-4.2-USB  i386   
>Description:
I've got five 1.1GHz Athlon systems, running FreeBSD 4.2R with 512MB
RAM, Asus A7V motherboards, Adaptec 29160 U160 SCSI adaptors, and
SEAGATE ST318451LW 18GB drives.  The problem is I'm seeing file 
corruption when I write large (approx 512Mb or larger) files, 
especially when I write them rapidly.  I can't guarantee it doesn't 
happen with smaller files, but I wrote a thousand 100MB files, and 
not one of them was corrupted.

The problem basically is that the files get 64-byte chunks (usally 64, 
sometimes smaller)of other data in the middle of them.  I first 
noticed the problem with scp, but the problem also happens with 
moderate repeatability when simply rapidly writing a big file by 
redirecting stdout.  


Here's the quick-hack test program:

#include<stdio.h>
#define FSIZE 1000*1024*1024
main() {
  int i,j;
  int buf[1024];
  j=0;
  for(i=0;i<FSIZE/4;i++) {
    buf[j]=i;
    if (j==1023) {
      fwrite(buf, 1024, 4, stdout);
      j=0;
    } else {
      j++;
    }
  }
}                          

Basically it's writing 1000MB to stdout, writing incrementing values
to each 32-bit word.  I direct stdout to a file.  The MD5 checksum of
the output file should be 1da068574fdb3e3b9ffc3b2022cca171, but
sometimes (somewhere between 1-in-3 and 1-in-10 tries) the file gets
corrupted.

The program to read this back is:

#include <stdio.h>
#define FSIZE 1000*1024*1024
main() {
  int i;
  int j, prev;
  int mode=0;
  for(i=0;i<FSIZE/4;i++) {
    fread(&j, 1, 4, stdin);
    if (mode==0) {
      if (i!=j) {
        printf("-----------------------------\n");
        printf("problem start at word: %d\n", i);
        printf("got value %d instead of %d\n", j, i);
        mode=1;
      }
    } else {
      if (i==j) {
        printf("-----------------------------\n");
        printf("last word of problem : %d\n", i-1);
        printf("got value %d instead of %d\n", prev, i-1);
        mode=0;
      }
    }                                             
    prev=j;
  }
}     

Here's one sample output, where there are two separate corruptions:

gaur.aciri.org: ./unfoo3 < t4
-----------------------------
problem start at word: 114561360
got value 909456435 instead of 114561360

got value 171522103 instead of 114561361
got value 875770417 instead of 114561362
got value 943142453 instead of 114561363
got value 842074681 instead of 114561364
got value 909456435 instead of 114561365
got value 171522103 instead of 114561366
got value 875770417 instead of 114561367
got value 943142453 instead of 114561368
got value 842074681 instead of 114561369
got value 909456435 instead of 114561370
got value 171522103 instead of 114561371
got value 875770417 instead of 114561372
got value 943142453 instead of 114561373
got value 842074681 instead of 114561374
got value 909456435 instead of 114561375
-----------------------------
last word of problem : 114561375
got value 909456435 instead of 114561375
-----------------------------
problem start at word: 237338864
got value 112460016 instead of 237338864
 
got value 112460017 instead of 237338865
got value 112460018 instead of 237338866
got value 112460019 instead of 237338867
got value 112460020 instead of 237338868
got value 112460021 instead of 237338869
got value 112460022 instead of 237338870
got value 112460023 instead of 237338871
got value 112460024 instead of 237338872
got value 112460025 instead of 237338873
got value 112460026 instead of 237338874
got value 112460027 instead of 237338875
got value 112460028 instead of 237338876
got value 112460029 instead of 237338877
got value 112460030 instead of 237338878
got value 112460031 instead of 237338879
-----------------------------
last word of problem : 237338879
got value 112460031 instead of 237338879

In this case, there are two corruptions.  The first corruption seems
to be some random chunk of data; the second (more typical) corruption
seems to be a copy of an earlier piece of the file.

In most cases, the corruption seems to be of a 64-byte
chunk of the file replaced with some other data, typically (but not
always) an earlier chunk of the same file.  I've never seen more than
64 bytes corrupted, but on one of the machines I've seen
smaller corruptions.


I originally thought this was a hardware problem, but I've reproduced
it on the three identical machines I've tried, so if it is a hardware
fault, it's in the whole batch.  I've also tried to reproduce it on
an additional 1GHz Athlon/A7V machine with a Adaptec 2940 SCSI
adaptor, but that machine doesn't suffer from the same
problem, so I'm beginning to suspect an interaction between the 
Adaptec 29160 driver and the filesystem when writing large files
as being a possible cause.

Here's the dmesg.boot from one of the problem machines in case it helps.

gaur.aciri.org: more /var/run/dmesg.boot
Copyright (c) 1992-2000 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD 4.2-RELEASE #1: Sat Jan 20 20:49:54 PST 2001
    root@gaur.aciri.org:/usr/src/sys/compile/ACIRI-4.2-USB
Timecounter "i8254"  frequency 1193182 Hz
CPU: AMD Athlon(tm) Processor (1109.89-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0x642  Stepping = 2
  Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR>
  AMD Features=0xc0440000<<b18>,AMIE,DSP,3DNow!>
real memory  = 536788992 (524208K bytes)
avail memory = 518864896 (506704K bytes)
Preloaded elf kernel "kernel" at 0xc03c8000.
Pentium Pro MTRR support enabled
md0: Malloc disk
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Host to PCI bridge> on motherboard
pci0: <PCI bus> on pcib0
pcib2: <PCI to PCI bridge (vendor=1106 device=8305)> at device 1.0 on pci0
pci1: <PCI bus> on pcib2
isab0: <VIA 82C686 PCI-ISA bridge> at device 4.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <VIA 82C686 ATA66 controller> port 0xd800-0xd80f at device 4.1 on pci0 ata1: at 0x170 irq 15 on atapci0
pci0: <VIA 83C572 USB controller> at 4.2 irq 12
pci0: <VIA 83C572 USB controller> at 4.3 irq 12
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xa400-0xa43f mem 0xd6800000-0xd68fffff,0xd7000000-0xd7000fff irq 10 at device 11.0 on pci0
fxp0: Ethernet address 00:02:b3:10:b4:67
pci0: <3D Labs model 000a graphics accelerator> at 12.0 irq 11
ahc0: <Adaptec 29160 Ultra160 SCSI adapter> port 0xa000-0xa0ff mem 0xd5800000-0xd5800fff irq 12 at device 13.0 on pci0
aic7892: Wide Channel A, SCSI Id=7, 32/255 SCBs
atapci1: <Promise ATA100 controller> port 0x8400-0x843f,0x8800-0x8803,0x9000-0x9007,0x9400-0x9403,0x9800-0x9807 mem 0xd5000000-0xd501ffff irq 10 at device 17.0 on pci0
pcib1: <Host to PCI bridge> on motherboard
pci2: <PCI bus> on pcib1
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
DUMMYNET initialized (000608)
IP packet filtering initialized, divert disabled, rule-based forwarding disabled, default to deny, logging disabled
acd0: CDROM <SONY CDU4811> at ata1-master using PIO4
Waiting 5 seconds for SCSI devices to settle
Mounting root from ufs:/dev/da0s1a
da0 at ahc0 bus 0 target 0 lun 0
da0: <SEAGATE ST318451LW 0003> Fixed Direct Access SCSI-3 device
da0: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled
da0: 17501MB (35843671 512 byte sectors: 255H 63S/T 2231C)  
>How-To-Repeat:
Write several very large files rapidly (see above).   Some fraction 
of them will be corrupted (I see between 5% and 25% of 512MB files
get corrupted).
>Fix:


>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200102142235.f1EMZlP79180>