Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Dec 2001 16:43:19 -0800 (PST)
From:      John Polstra <jdp@polstra.com>
To:        hardware@freebsd.org
Subject:   Question about a strange hardware problem
Message-ID:  <XFMail.011214164319.jdp@polstra.com>

next in thread | raw e-mail | index | archive | help
I've got an intermittent hardware problem on one of the CVSup mirror
sites, and I would appreciate some experienced opinions about whether
it's likely to be in the SCSI controller, the SCSI cable, the hard
drive, or elsewhere.  The symptom is that the checkouts.cvs file
which maintains state between CVSup updates occasionally gets 1-bit
errors at random places in it.  I haven't seen any similar errors in
the actual content on the mirror; but it is on a different drive,
its access patterns are different, and errors there might be less
noticeable.  The errors in checkouts.cvs cause updates to break until
I intervene manually, so I notice those pretty quickly.

The motherboard is an Asus P2B-LS board with on-board Adaptec chip.
Here's the relevant part of the dmesg output:

ahc0: <Adaptec aic7890/91 Ultra2 SCSI adapter> port 0xd000-0xd0ff mem
0xe2000000-0xe2000fff irq 10 at device 6.0 on pci0
aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs
da1 at ahc0 bus 0 target 1 lun 0
da1: <IBM DNES-309170W SAH0> Fixed Direct Access SCSI-3 device 
da1: 80.000MB/s transfers (40.000MHz, offset 30, 16bit), Tagged Queueing Enabled
da1: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C)
da0 at ahc0 bus 0 target 0 lun 0
da0: <IBM DNES-309170W SA30> Fixed Direct Access SCSI-3 device 
da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da0: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C)

The files with the 1-bit errors are on da0, which also has all of
the OS files.  The mirror content is on da1.

The OS version is FreeBSD-4.2-STABLE from around last January, plus
a few security patches.

The system has been up for 317 days.  It seems like if the problem
were in the RAM (the obvious place), it would have crashed by now.  I
can't remember whether it has ECC memory or now, and the system isn't
physically accessible to me.  If the errors were on the SCSI cable,
parity checking ought to detect them.  And if the media were bad, I
should be seeing some disk errors in the dmesg output.  But I have
never seen even one.  The errors don't show up in specific disk
blocks -- they appear to be at random places.

Given all that, it seems to me that the problem must be in the drive
electronics of da0.  What do you think?  As a test, I've moved the
files that always show the errors over to da1 for a while, to see if
that fixes it.

John

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hardware" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.011214164319.jdp>