From owner-freebsd-scsi@FreeBSD.ORG Tue Sep 21 19:49:31 2010 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 46B4A1065672 for ; Tue, 21 Sep 2010 19:49:31 +0000 (UTC) (envelope-from lambert@lambertfam.org) Received: from sysmon.tcworks.net (sysmon.tcworks.net [65.66.76.4]) by mx1.freebsd.org (Postfix) with ESMTP id EE4F98FC26 for ; Tue, 21 Sep 2010 19:49:30 +0000 (UTC) Received: from sysmon.tcworks.net (localhost [127.0.0.1]) by sysmon.tcworks.net (8.13.1/8.13.1) with ESMTP id o8LJa3Qo071405 for ; Tue, 21 Sep 2010 14:36:03 -0500 (CDT) (envelope-from lambert@lambertfam.org) Received: (from lambert@localhost) by sysmon.tcworks.net (8.13.1/8.13.1/Submit) id o8LJa37p071404 for freebsd-scsi@freebsd.org; Tue, 21 Sep 2010 14:36:03 -0500 (CDT) (envelope-from lambert@lambertfam.org) X-Authentication-Warning: sysmon.tcworks.net: lambert set sender to lambert@lambertfam.org using -f Date: Tue, 21 Sep 2010 14:36:03 -0500 From: Scott Lambert To: freebsd-scsi@freebsd.org Message-ID: <20100921193603.GA18674@sysmon.tcworks.net> Mail-Followup-To: freebsd-scsi@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.2i Subject: Controller is no longer running X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Sep 2010 19:49:31 -0000 I've had this problem occur about five times in the last year since we've been on 8.x. It happened with 7.x also, but it wasn't as critical a machine back then and I didn't care as much and hoped 8 would make it all better. The machine wasn't loaded as heavily and it probably happenned three times in two years. The problem may happen two days, two hours, or 5 months apart. I haven't been able to figure out a set of conditions which apply every time it happens. It does tend to happen while the backups are running, amanda dump or tar. I think that just provides the critical disk I/O load level to make the problem more likely. I swear I took picture of the error messages on the console the time before this when it happened, but can't find them now. This morning I had remote hands power cycle it while I was en-route to the office. The message on-screen was or was very close to "The controller is no longer running". I remember messages about timing out commands to the raid controller after something like 15 seconds from the last time. The firmware on the controller is from 2006 and is the latest I found to be available. Is this a known problem with the Adaptec 2120S type RAID cards? Or do I just have bad hardware? The array is always intact after a power cycle. But fsck has to fix many things. It is now a cyrus-imapd mail server. FreeBSD 8.1-STABLE #0: Thu Aug 19 19:41:51 CDT 2010 root@cyrus.example.com:/usr/obj/usr/src/sys/GENERIC i386 CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2793.02-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf48 Family = f Model = 4 Stepping = 8 Features=0xbfebfbff Features2=0x649d AMD Features=0x20100000 AMD Features2=0x1 TSC: P-state invariant real memory = 2147483648 (2048 MB) Physical memory chunk(s): 0x0000000000001000 - 0x000000000009dfff, 643072 bytes (157 pages) 0x0000000000100000 - 0x00000000003fffff, 3145728 bytes (768 pages) 0x0000000001026000 - 0x000000007db8afff, 2092322816 bytes (510821 pages) avail memory = 2090995712 (1994 MB) aac0: mem 0xf8000000-0xfbffffff irq 50 at device 9.0 on pci3 aac0: Reserved 0x4000000 bytes for rid 0x10 type 3 at 0xf8000000 aac0: Enable Raw I/O aac0: New comm. interface enabled ioapic2: routing intpin 2 (PCI IRQ 50) to lapic 0 vector 51 aac0: [MPSAFE] aac0: [ITHREAD] aac0: i960 80303 100MHz, 64MB memory (48MB cache, 16MB execution), optional battery present aac0: Kernel 4.2-0, Build 8205, S/N 503926 aac0: Supported Options=31d7e aac0: Adaptec 2120S, aac driver 2.1.9-1 aacp0: on aac0 aacd0: on aac0 aacd0: 279962MB (573362176 sectors) GEOM: new disk aacd0 (probe0:aacp0:0:0:0): Data overrun (probe0:aacp0:0:0:0): Retrying command (probe0:aacp0:0:0:0): Data overrun (probe0:aacp0:0:0:0): Retrying command (probe0:aacp0:0:0:0): Data overrun (probe0:aacp0:0:0:0): Retrying command (probe0:aacp0:0:0:0): Data overrun (probe0:aacp0:0:0:0): Retrying command (probe0:aacp0:0:0:0): Data overrun (probe0:aacp0:0:0:0): Error 5, Retries exhausted (probe0:aacp0:0:2:0): Data overrun (probe0:aacp0:0:2:0): Retrying command (probe0:aacp0:0:2:0): Data overrun (probe0:aacp0:0:2:0): Retrying command (probe0:aacp0:0:2:0): Data overrun (probe0:aacp0:0:2:0): Retrying command (probe0:aacp0:0:2:0): Data overrun (probe0:aacp0:0:2:0): Retrying command (probe0:aacp0:0:2:0): Data overrun (probe0:aacp0:0:2:0): Error 5, Retries exhausted (probe0:aacp0:0:3:0): Data overrun (probe0:aacp0:0:3:0): Retrying command (probe0:aacp0:0:3:0): Data overrun (probe0:aacp0:0:3:0): Retrying command (probe0:aacp0:0:3:0): Data overrun (probe0:aacp0:0:3:0): Retrying command (probe0:aacp0:0:3:0): Data overrun (probe0:aacp0:0:3:0): Retrying command (probe0:aacp0:0:3:0): Data overrun (probe0:aacp0:0:3:0): Error 5, Retries exhausted (probe0:aacp0:0:4:0): Data overrun (probe0:aacp0:0:4:0): Retrying command (probe0:aacp0:0:4:0): Data overrun (probe0:aacp0:0:4:0): Retrying command (probe0:aacp0:0:4:0): Data overrun (probe0:aacp0:0:4:0): Retrying command (probe0:aacp0:0:4:0): Data overrun (probe0:aacp0:0:4:0): Retrying command (probe0:aacp0:0:4:0): Data overrun (probe0:aacp0:0:4:0): Error 5, Retries exhausted (probe0:aacp0:0:6:0): Data overrun (probe0:aacp0:0:6:0): Retrying command (probe0:aacp0:0:6:0): Data overrun (probe0:aacp0:0:6:0): Retrying command (probe0:aacp0:0:6:0): Data overrun (probe0:aacp0:0:6:0): Retrying command (probe0:aacp0:0:6:0): Data overrun (probe0:aacp0:0:6:0): Retrying command (probe0:aacp0:0:6:0): Data overrun (probe0:aacp0:0:6:0): Error 5, Retries exhausted pass0 at aacp0 bus 0 scbus0 target 0 lun 0 pass0: Fixed Uninstalled SCSI-3 device pass0: 3.300MB/s transfers pass1 at aacp0 bus 0 scbus0 target 2 lun 0 pass1: Fixed Uninstalled SCSI-3 device pass1: 3.300MB/s transfers pass2 at aacp0 bus 0 scbus0 target 3 lun 0 pass2: Fixed Uninstalled SCSI-3 device pass2: 3.300MB/s transfers pass3 at aacp0 bus 0 scbus0 target 4 lun 0 pass3: Fixed Uninstalled SCSI-3 device pass3: 3.300MB/s transfers pass4 at aacp0 bus 0 scbus0 target 6 lun 0 pass4: Fixed Uninstalled SCSI-2 device pass4: 3.300MB/s transfers ses0 at aacp0 bus 0 scbus0 target 6 lun 0 ses0: Fixed Uninstalled SCSI-2 device ses0: 3.300MB/s transfers ses0: SAF-TE Compliant Device pass0 at aacp0 bus 0 scbus0 target 0 lun 0 pass0: Fixed Uninstalled SCSI-3 device pass0: 3.300MB/s transfers pass1 at aacp0 bus 0 scbus0 target 2 lun 0 pass1: Fixed Uninstalled SCSI-3 device pass1: 3.300MB/s transfers pass2 at aacp0 bus 0 scbus0 target 3 lun 0 pass2: Fixed Uninstalled SCSI-3 device pass2: 3.300MB/s transfers pass3 at aacp0 bus 0 scbus0 target 4 lun 0 pass3: Fixed Uninstalled SCSI-3 device pass3: 3.300MB/s transfers Trying to mount root from ufs:/dev/aacd0s1a WARNING: / was not properly dismounted -- Scott Lambert KC5MLE Unix SysAdmin lambert@lambertfam.org