Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 23 Oct 1998 18:31:10 -0700 (PDT)
From:      Chris Timmons <skynyrd@opus.cts.cwu.edu>
To:        freebsd-scsi@FreeBSD.ORG
Subject:   Thrashing CAM on SMP
Message-ID:  <Pine.BSF.3.96.981023181248.28551A-100000@opus.cts.cwu.edu>

next in thread | raw e-mail | index | archive | help

I tried recently to reproduce the problems Mark Murray has with CAM & SMP
(panic with X going and lots of filesystem activity.)  I couldn't panic,
but I did have the machine wedge with recurring, non-recoverable device
tiemouts on the system and swap disks.  The machine is a server and
doesn't have a workstation video card.  Of course, I forgot
BREAK_TO_DEBUGGER, so I couldn't get a dump.

Using an SMP -CURRENT from just before the 3.0 release, I set up 3 256M
bonnies on different spindles, an md5 of a 280MB file, and a 'make -j 12
buildworld' - all in loops to repeat over and over.  The buildworld also
unmounted, newfs-ed and remounted /usr/obj after each turn.  The machine
is a dual-PII 266 tyan tiger.

The system lasted for a couple days with a load average between 5 and 12. 
The activity lights on the 3 bonnie drives were almost always solid green
and the box sounded like a popcorn popper. 

<IBM DDRS-34560W S71D>             at scbus0 target 0 lun 0 (pass0,da0)
<IBM DDRS-34560W S71D>             at scbus0 target 1 lun 0 (pass1,da1)
<SEAGATE ST34572W 0718>            at scbus1 target 0 lun 0 (pass2,da2)
<SEAGATE ST34572W 0784>            at scbus1 target 1 lun 0 (pass3,da3)
<QUANTUM XP34550W LXY4>            at scbus1 target 4 lun 0 (pass4,da4)

During the time it was alive, the bonnies were running on da2, da3, and
da4.  The only trouble I had were device timeouts on the firmware-buggy
Atlas-II, and an occasional hiccup on the SEAGATES.  I'm using 40MHZ xfer
rates and adaptec cables with the active terminators - drive termination
off. 

midtest3:/root#> grep BDR /var/log/messages
Oct 21 04:16:13 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 21 05:27:29 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 21 14:44:18 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 21 17:10:31 midtest3 /kernel: (da3:ahc1:0:1:0): BDR message in message
buffer
Oct 21 17:11:31 midtest3 /kernel: (da3:ahc1:0:1:0): BDR message in message
buffer
Oct 21 17:12:30 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 21 19:47:07 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 21 20:04:24 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 01:38:54 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 02:51:36 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 04:10:12 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 05:51:51 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 07:41:04 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 07:47:00 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 09:22:32 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 10:50:59 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 11:06:40 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 13:34:20 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 15:12:07 midtest3 /kernel: (da2:ahc1:0:0:0): BDR message in message
buffer
Oct 22 15:13:07 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 15:28:40 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 15:43:34 midtest3 /kernel: (da2:ahc1:0:0:0): BDR message in message
buffer
Oct 22 15:44:34 midtest3 /kernel: (da2:ahc1:0:0:0): BDR message in message
buffer
Oct 22 15:45:34 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 16:18:28 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB
Oct 22 17:06:45 midtest3 /kernel: (da4:ahc1:0:4:0): Queuing a BDR SCB


When it finally died, I'd swear it was telling me that da0 and/or da1 kept
timing out - messages to the serial console which I of course didn't trap.
The machine would respond to pings and print out the BDR timeout messages,
but would not do anything else, so it was apparantly stuck at a fairly
high spl. 

I'm getting up-to-date, noticing Ken's mega-commit recently.  I'll be able
to break in with ddb now, and can take a dump if the situation re-occurs.
The system is in a mega rack-mount case with multiple cooling fans blowing
directly on the drives which were cool to the touch during the middle of
the run, so I don't think we overheated.

-Chris



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.981023181248.28551A-100000>