From owner-freebsd-stable@FreeBSD.ORG Tue Mar 7 22:07:07 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DAB4516A420; Tue, 7 Mar 2006 22:07:06 +0000 (GMT) (envelope-from tonyb@byrnehq.com) Received: from schubert.byrnehq.com (dsl-33-12.dsl.netsource.ie [213.79.33.12]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2C26B43D45; Tue, 7 Mar 2006 22:07:05 +0000 (GMT) (envelope-from tonyb@byrnehq.com) Received: from localhost ([192.168.10.1]) by schubert.byrnehq.com (8.13.3/8.13.3) with ESMTP id k27M6XqL044909; Tue, 7 Mar 2006 22:06:39 GMT (envelope-from tonyb@byrnehq.com) Resent-Date: Tue, 7 Mar 2006 22:06:39 GMT Resent-Message-Id: <200603072206.k27M6XqL044909@schubert.byrnehq.com> Date: Tue, 7 Mar 2006 22:06:55 +0000 From: Tony Byrne Organization: ByrneHQ X-Priority: 3 (Normal) Message-ID: <1867339507.20060307220655@byrnehq.com> To: freebsd-stable@freebsd.org, freebsd-scsi@freebsd.org Resent-from: Tony Byrne MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ByrneHQ-SA-Hits: -1.635 X-Scanned-By: MIMEDefang 2.51 on 192.168.10.254 Cc: Subject: MegaRAID lockups under 5.5 PRELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Mar 2006 22:07:07 -0000 Folks, We have a pair of servers, each with an of Intel SCRU42X branded MegaRAID controller installed. The cards both have battery backed cache. The servers form a 2-node Slony cluster for PostgreSQL, with each node having a single RAID5 array consisting of 3 live disks with a hot standby disk. About a week ago we upgraded FreeBSD on both boxes from a point on RELENG_5 (1 July 2005) to 5.5-PRERELEASE and now one of the boxes has started behaving badly. In the space of two days we've had about half a dozen occurrences of random processes blocking and rendering the machine unusable. 'top' shows the stricken processes in state 'ffsfsn' and they cannot be killed. The affected machine is a Slony subscriber and so isn't directly used by customers, but is still an important component in our system. PostgreSQL seems to be most likely to block, but sshd has also blocked preventing logins. The interesting thing is that the main box, which has an almost identical configuration and a greater work load, has remained unaffected, so far. The only difference between the two machine configurations is that the misbehaving machine only has 128Mb of on-board battery backed cache, while the main machine has 256Mb. The firmware on both machines is Intel's version 413Y. This is not the first time that we've had problems like this with FreeBSD and this model of controller. For background see: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=172108+0+archive/2004/freebsd-stable/20041231.freebsd-stable http://docs.freebsd.org/cgi/getmsg.cgi?fetch=126427+0+archive/2004/freebsd-stable/20041121.freebsd-stable The current problem is very reminiscent of the latter issue, which had been causing headaches for us over a year ago, but which disappeared after some driver improvements by Scott Long (many thanks Scott :-). I see from a diff of the driver code from RELENG_5 on 1 July 2005 and the latest RELENG_5 version that some further changes have been made to the driver. I would have been moderately happy if I could have pinned the reemergence of the problem on these changes, because then I would have had a specific cause. However, as a test, I reverted to the *previous* kernel from 1 July 2005 and the box blocked in sshd within a few hours, preventing login. At this stage I'm looking at upgrading the firmware of the RAID card to the latest and greatest and if that doesn't resolve it, I plan to make a jump to FreeBSD 6, which appears to have substantial changes to the amr driver and which might solve the problem. Before I leap though, I'd be interested in hearing if anyone is familiar with the behavior that I've described and can point me in the right direction. Alternatively, if someone can say "hey we use FreeBSD 6.x with the SCRU42X and it works great" then I'd be obliged. Our woes may well be caused be some faulty hardware, especially since the other box has remained stable after the same upgrade, but I remain suspicious because the the problem arose so soon after an upgrade of FreeBSD and after many months of uptime. Many thanks, Regards, Tony. -- Tony Byrne