From owner-freebsd-scsi@FreeBSD.ORG  Sun Jun  1 10:38:45 2003
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 407BA37B404
	for <freebsd-scsi@freebsd.org>; Sun,  1 Jun 2003 10:38:45 -0700 (PDT)
Received: from magic.adaptec.com (magic-mail.adaptec.com [208.236.45.100])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 75D0E43F93
	for <freebsd-scsi@freebsd.org>; Sun,  1 Jun 2003 10:38:44 -0700 (PDT)
	(envelope-from scott_long@btc.adaptec.com)
Received: from redfish.adaptec.com (redfish.adaptec.com [162.62.50.11])
	by magic.adaptec.com (8.11.6/8.11.6) with ESMTP id h51HXmZ28839;
	Sun, 1 Jun 2003 10:33:48 -0700
Received: from btc.adaptec.com (hollin.btc.adaptec.com [10.100.253.56])
	by redfish.adaptec.com (8.8.8p2+Sun/8.8.8) with ESMTP id KAA22298;
	Sun, 1 Jun 2003 10:38:42 -0700 (PDT)
Message-ID: <3EDA3982.5040202@btc.adaptec.com>
Date: Sun, 01 Jun 2003 11:36:02 -0600
From: Scott Long <scott_long@btc.adaptec.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030414
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: "Marc G. Fournier" <scrappy@hub.org>
References: <20030601131404.P6572@hub.org>
In-Reply-To: <20030601131404.P6572@hub.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
cc: freebsd-scsi@freebsd.org
Subject: Re: Critical bug in Adaptec(aac) driver ...
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 01 Jun 2003 17:38:45 -0000

Marc G. Fournier wrote:
> As those on this list will have seen over the past few months, I have a
> server that had (past tense) an Adaptec 2120s controller in her that was
> giving alot of grief ... about 3 weeks ago, the server it was in *really*
> blew up ... one drive was reported as down (in a RAID5 array), and when we
> tried to bring it back up, a second drive started to "fail" ... I got the
> techs to shut her down, and literally rushed to the remote location to see
> if there was anything that I could do to at least recover the data ...
> 
> When I got there to bring it back up, the server reported that a 3rd drive
> had failed ... and within a few hours, a 4th drive failed ... the result
> being that we lost all of the data on that server, which turned out to be
> quite painful to recover ...
> 
> While down there, we replaced the Adaptec controller with an Intel one,
> reformatted the exact same drives, in the exact same chassis, and she's
> been running fine since ...
> 
> On my trip back, I had a chat with a friend that does development work in
> the Linux world, and who had had that server previous to myself, and
> apparently there is a "known bug" in Linux that he says sounds exactly
> like what I experienced (they hit it right in the middle of developing on
> that box) and that there are apparently two Linux kernel patches that they
> had to apply (after rebuilding from scratch) to correct the problem ...
> 
> The way he explained the problem to me, he made it sound like the kernel
> driver was interacting with the BIOs and causing some corruption ... not
> sure at what level, but since trying to swap in a new controller didn't
> restore things, I'm suspecting at the hard drive level ... ?
> 
> Scott, while down there, I tried just about everything I could think to
> ... we replaced the SCSI cable, put the drives/controller into a second
> identical chassis, swap host controller cards themselves (I had brought
> spares) ... and that server, as I mentioned, is currently running quite
> happily with an Intel host controller in it :(  So, unless the same
> "failure" was hitting two host controllers, hardware failure doesn't seem
> to have been the cause ...
> 

I understand your frustration and wish there was more I could do to 
help.  Please send me whatever information that you have.

Scott