From owner-freebsd-scsi@FreeBSD.ORG Sun Jun 1 10:38:45 2003 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 407BA37B404 for ; Sun, 1 Jun 2003 10:38:45 -0700 (PDT) Received: from magic.adaptec.com (magic-mail.adaptec.com [208.236.45.100]) by mx1.FreeBSD.org (Postfix) with ESMTP id 75D0E43F93 for ; Sun, 1 Jun 2003 10:38:44 -0700 (PDT) (envelope-from scott_long@btc.adaptec.com) Received: from redfish.adaptec.com (redfish.adaptec.com [162.62.50.11]) by magic.adaptec.com (8.11.6/8.11.6) with ESMTP id h51HXmZ28839; Sun, 1 Jun 2003 10:33:48 -0700 Received: from btc.adaptec.com (hollin.btc.adaptec.com [10.100.253.56]) by redfish.adaptec.com (8.8.8p2+Sun/8.8.8) with ESMTP id KAA22298; Sun, 1 Jun 2003 10:38:42 -0700 (PDT) Message-ID: <3EDA3982.5040202@btc.adaptec.com> Date: Sun, 01 Jun 2003 11:36:02 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030414 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Marc G. Fournier" References: <20030601131404.P6572@hub.org> In-Reply-To: <20030601131404.P6572@hub.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: freebsd-scsi@freebsd.org Subject: Re: Critical bug in Adaptec(aac) driver ... X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 01 Jun 2003 17:38:45 -0000 Marc G. Fournier wrote: > As those on this list will have seen over the past few months, I have a > server that had (past tense) an Adaptec 2120s controller in her that was > giving alot of grief ... about 3 weeks ago, the server it was in *really* > blew up ... one drive was reported as down (in a RAID5 array), and when we > tried to bring it back up, a second drive started to "fail" ... I got the > techs to shut her down, and literally rushed to the remote location to see > if there was anything that I could do to at least recover the data ... > > When I got there to bring it back up, the server reported that a 3rd drive > had failed ... and within a few hours, a 4th drive failed ... the result > being that we lost all of the data on that server, which turned out to be > quite painful to recover ... > > While down there, we replaced the Adaptec controller with an Intel one, > reformatted the exact same drives, in the exact same chassis, and she's > been running fine since ... > > On my trip back, I had a chat with a friend that does development work in > the Linux world, and who had had that server previous to myself, and > apparently there is a "known bug" in Linux that he says sounds exactly > like what I experienced (they hit it right in the middle of developing on > that box) and that there are apparently two Linux kernel patches that they > had to apply (after rebuilding from scratch) to correct the problem ... > > The way he explained the problem to me, he made it sound like the kernel > driver was interacting with the BIOs and causing some corruption ... not > sure at what level, but since trying to swap in a new controller didn't > restore things, I'm suspecting at the hard drive level ... ? > > Scott, while down there, I tried just about everything I could think to > ... we replaced the SCSI cable, put the drives/controller into a second > identical chassis, swap host controller cards themselves (I had brought > spares) ... and that server, as I mentioned, is currently running quite > happily with an Intel host controller in it :( So, unless the same > "failure" was hitting two host controllers, hardware failure doesn't seem > to have been the cause ... > I understand your frustration and wish there was more I could do to help. Please send me whatever information that you have. Scott