From owner-freebsd-scsi@FreeBSD.ORG Wed May 21 21:22:02 2003 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CB61C37B401 for ; Wed, 21 May 2003 21:22:02 -0700 (PDT) Received: from entiat.olympus.net (entiat.olympus.net [65.117.224.69]) by mx1.FreeBSD.org (Postfix) with ESMTP id D91AC43FB1 for ; Wed, 21 May 2003 21:22:01 -0700 (PDT) (envelope-from cp@olympus.net) Received: from intentiat2 ([127.0.0.2] helo=intEntiat.olympus.net) by entiat.olympus.net with esmtp (Exim 4.10) id 19Ihab-0005aZ-00 for freebsd-scsi@freebsd.org; Wed, 21 May 2003 21:22:01 -0700 Received: from 0-1pool20-189.nas7.bellevue1.wa.us.da.qwest.net ([67.3.20.189] helo=compaq7058) by entiat.olympus.net with smtp (Exim 4.10) id 19IhaF-0005SY-00 for freebsd-scsi@freebsd.org; Wed, 21 May 2003 21:21:39 -0700 Message-ID: <000301c3201a$481cee80$bd140343@compaq7058> From: "cp" To: Date: Wed, 21 May 2003 21:26:13 -0700 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Envelope-to: freebsd-scsi@freebsd.org X-Olympus-SmartMail: Virus-scanned only Subject: AIC7902 Failures X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 May 2003 04:22:03 -0000 I'm trying to ready this system for production as a webserver. Due to its fragility from SCSI timeouts(?), I'm unable to do so. The hardware configuration is: Supermicro 7043A-8R (X5DA8 Mbd, 2 Xeon 2.6Ghz, 2 GB, AIC7902, Super GEM 318, E7505), 2 Seagate Cheetah ST336753LC and 1 WD1201TB IDE. I purchased 5.0 CDs (January 2003 #0) and installed with few bios changes. It failed during the ports so I reinstalled and found that occasionally I could make it through but after repeating 10-15 times it failed more often than not. The transfers would slow and then I would get the following error: panic: page fault syncing disks, buffers remaining... panic: allocbuf: buffer not busy Uptime: .... (da0:ahd0:0:0:0): Synchronize cache failed, status==0x5b,scsi status==0x0 Terminate ACPI panic: bwrite:buffer is not busy??? Uptime... Removing one drive would change the scenario but it would still fail at times. I did a single drive install, built the new kernel and ran it that way for couple of days running 30 disk read/write scripts. It would still occasionally fail and showed timeouts in the debug messages. Of course even one failure is too many for a business. On 5.0, any of the failures would dismount the pack and a power off was necessary to bring it back. I isolated (not true) that it was a bad disk and forced the vendor to replace it. I still had the same results. I ran through all possibilities of BIOS disabling and hardware strap disables but found that only setting the drives to 80MB/40Mhz would give me what seemed a stable system. At this setting (discounting CPU waste) the very expensive SCSI drives run slower than the IDE in my tests. It was suggested that I try 4.8 and indeed I had seen one instance (in searching the net) where a developer had recommended a driver change on an older version of this machine that was dated in february or march. I purchased the 4.8 CDs and started over. I went through the same process as I had with 5.0 and while installations never failed, I could make it fail in running. The curious thing on 4.8 was I always received a dump card state message in the midst of boot. This was consistent so I could capture it. I ran this way for a couple of weeks finding again that I needed drop the disks to 80/40 to make the system stable. Even giving in to what seemed a performance problem, I still could not justify putting the system in. Out of it's 6 FreeBSD predecessors, I have had one system that ever crashed and it was my fault in kernel configuration. Just to make sure I didn't have a problem with hardware, I once again installed Windows 2000 Pro with adaptec drivers and ran the system as hard as Windows can run. It didn't fail thoughout the weekend. The system was QA'd on W2K Server prior to my receiving it so that wasn't a surprise. It also runs in hard testing without failure on 4.8 running from the IDE drive. Below is the dmesg for 5.0 (this is an unsanitized SMP kernel so I apologize for the mess, I tossed 5.0 back on just to get it for this message) and the dmesg for 4.8 (note the Dump Card State on 4.8 from the ahd driver which hopefully will be helpful). If there is something more I can collect, please let me know. I've invested many hours into this system, it can't go back to the vendor and I am willing to debug it anyway I can. ************5.0 dmesg************** Copyright (c) 1992-2003 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.0-RELEASE #0: Tue May 20 05:51:50 PDT 2003 root@tshed2..../usr/obj/usr/src/sys/TSHEDSMP Preloaded elf kernel "/boot/kernel/kernel" at 0xc04aa000. Preloaded elf module "/boot/kernel/acpi.ko" at 0xc04aa0a8. Timecounter "i8254" frequency 1193182 Hz CPU: Pentium 4 (2665.92-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf27 Stepping = 7 Features=0xffffffffbfebfbff> real memory = 2146893824 (2047 MB) avail memory = 2085244928 (1988 MB) Programming 24 pins in IOAPIC #0 IOAPIC #0 intpin 2 -> irq 0 Programming 24 pins in IOAPIC #1 Programming 24 pins in IOAPIC #2 FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): apic id: 0, version: 0x00050014, at 0xfee00000 cpu1 (AP): apic id: 6, version: 0x00050014, at 0xfee00000 cpu2 (AP): apic id: 1, version: 0x00050014, at 0xfee00000 cpu3 (AP): apic id: 7, version: 0x00050014, at 0xfee00000 io0 (APIC): apic id: 2, version: 0x00178020, at 0xfec00000 io1 (APIC): apic id: 3, version: 0x00178020, at 0xfec80000 io2 (APIC): apic id: 4, version: 0x00178020, at 0xfec80100 Initializing GEOMetry subsystem Pentium Pro MTRR support enabled npx0: on motherboard npx0: INT 16 interface acpi0: on motherboard ACPI-0625: *** Info: GPE Block0 defined as GPE0 to GPE31 Using $PIR table, 21 entries at 0xc00fde70 acpi0: power button is handled as a fixed feature programming model. Timecounter "ACPI-fast" frequency 3579545 Hz acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0 acpi_cpu0: on acpi0 acpi_cpu1: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 IOAPIC #0 intpin 17 -> irq 2 agp0: mem 0xf4000000-0xf7ffffff at device 0.0 on pci0 pci0: at device 0.1 (no driver attached) pcib1: mem 0xf8000000-0xfbffffff at device 1.0 on pci0 pci1: on pcib1 IOAPIC #0 intpin 21 -> irq 11 pci1: at device 0.0 (no driver attached) pcib2: at device 2.0 on pci0 pcib2: could not get PCI interrupt routing table for \\_SB_.PCI0.HLB_ - AE_NOT_FOUND pci2: on pcib2 pci2: at device 28.0 (no driver attached) pcib3: at device 29.0 on pci2 pci3: on pcib3 IOAPIC #2 intpin 6 -> irq 16 em0: port 0x3000-0x303f mem 0xf2100000-0xf211ffff irq 16 at device 3.0 on pci3 em0: Speed:N/A Duplex:N/A pci2: at device 30.0 (no driver attached) pcib4: at device 31.0 on pci2 pci4: on pcib4 IOAPIC #1 intpin 8 -> irq 17 IOAPIC #1 intpin 9 -> irq 18 ahd0: port 0x4000-0x40ff,0x4400-0x44ff mem 0xf2200000-0xf2201fff irq 17 at device 3.0 on pci4 aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs ahd1: port 0x4800-0x48ff,0x4c00-0x4cff mem 0xf2202000-0xf2203fff irq 18 at device 3.1 on pci4 aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs pcib5: at device 30.0 on pci0 pci5: on pcib5 isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x2440-0x244f,0x374-0x377,0x170-0x177,0x3f4-0x3f7,0x1f0-0x1f7 mem 0xf0000000-0xf00003ff irq 0 at device 31.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 pci0: at device 31.3 (no driver attached) pci0: at device 31.5 (no driver attached) acpi_button0: on acpi0 atkbdc0: port 0x64,0x60 irq 1 on acpi0 atkbd0: flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 fdc0: port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 orm0: