Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Mar 2006 22:06:55 +0000
From:      Tony Byrne <freebsd-stable@byrnehq.com>
To:        freebsd-stable@freebsd.org, freebsd-scsi@freebsd.org
Subject:   MegaRAID lockups under 5.5 PRELEASE
Message-ID:  <1867339507.20060307220655@byrnehq.com>
Resent-Message-ID: <200603072206.k27M6XqL044909@schubert.byrnehq.com>

next in thread | raw e-mail | index | archive | help
Folks,

We have a pair of servers, each with an of Intel SCRU42X branded
MegaRAID controller installed. The cards both have battery backed
cache. The servers form a 2-node Slony cluster for PostgreSQL, with
each node having a single RAID5 array consisting of 3 live disks with
a hot standby disk.

About a week ago we upgraded FreeBSD on both boxes from a point on
RELENG_5 (1 July 2005) to 5.5-PRERELEASE and now one of the boxes has
started behaving badly.

In the space of two days we've had about half a dozen occurrences of
random processes blocking and rendering the machine unusable.  'top'
shows the stricken processes in state 'ffsfsn' and they cannot be killed.

The affected machine is a Slony subscriber and so isn't directly used
by customers, but is still an important component in our system.
PostgreSQL seems to be most likely to block, but sshd has also blocked
preventing logins.

The interesting thing is that the main box, which has an almost
identical configuration and a greater work load, has remained
unaffected, so far. The only difference between the two machine
configurations is that the misbehaving machine only has 128Mb of
on-board battery backed cache, while the main machine has 256Mb. The
firmware on both machines is Intel's version 413Y.

This is not the first time that we've had problems like this with
FreeBSD and this model of controller.  For background see:

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=172108+0+archive/2004/freebsd-stable/20041231.freebsd-stable

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=126427+0+archive/2004/freebsd-stable/20041121.freebsd-stable

The current problem is very reminiscent of the latter issue, which had
been causing headaches for us over a year ago, but which disappeared
after some driver improvements by Scott Long (many thanks Scott :-).

I see from a diff of the driver code from RELENG_5 on 1 July 2005 and
the latest RELENG_5 version that some further changes have been made
to the driver. I would have been moderately happy if I could have
pinned the reemergence of the problem on these changes, because then I
would have had a specific cause. However, as a test, I reverted to the
*previous* kernel from 1 July 2005 and the box blocked in sshd within
a few hours, preventing login.

At this stage I'm looking at upgrading the firmware of the RAID card
to the latest and greatest and if that doesn't resolve it, I plan to
make a jump to FreeBSD 6, which appears to have substantial changes to
the amr driver and which might solve the problem.

Before I leap though, I'd be interested in hearing if anyone is
familiar with the behavior that I've described and can point me in the
right direction. Alternatively, if someone can say "hey we use FreeBSD
6.x with the SCRU42X and it works great" then I'd be obliged.

Our woes may well be caused be some faulty hardware, especially since
the other box has remained stable after the same upgrade, but I remain
suspicious because the the problem arose so soon after an upgrade of
FreeBSD and after many months of uptime.

Many thanks,

Regards,

Tony.

-- 
Tony Byrne





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1867339507.20060307220655>