Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Jun 2008 10:28:33 -0600
From:      Scott Long <scottl@samsco.org>
To:        bseklecki@collaborativefusion.com
Cc:        Sean McAfee <smcafee@collaborativefusion.com>, scottl@freebsd.org, Jason Thomson <jason.thomson@mintel.com>, "freebsd-hardware@freebsd.org" <freebsd-hardware@freebsd.org>, Benjie Chen <benjie@addgene.org>
Subject:   Re: PERC5 (LSI MegaSAS) Patrol Read crashes
Message-ID:  <486909B1.3020309@samsco.org>
In-Reply-To: <1214840198.18670.43.camel@soundwave.ws.pitbpa0.priv.collaborativefusion.com>
References:  <20071114122210.42E8613C4BB@mx1.freebsd.org>	 <1195160114.4042.154.camel@new-host> <1214840198.18670.43.camel@soundwave.ws.pitbpa0.priv.collaborativefusion.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Brian A. Seklecki wrote:
> On Thu, 2007-11-15 at 15:55 -0500, Brian A Seklecki (Mobile) wrote:
>> Normally I'd be praising Dell, but I think a little vendor bashing is
>> due here.
> 
> All:
> 
> Just to follow up, we've been running these 1st-generation 2950s in our
> lab with RHEl5.2 x86_64 for ~3 weeks w/o any disk or I/O problems.
> 
> It must have been some random bug with the FreeBSD mfi(4) that only
> affected that revision of the PERC5, or, since the motherboard/CPU
> family/chipset is entirely different in R2 and R3, something with
> FreeBSD and how it was handling the controller (ACPI?)
> 
> We never had any stability problems with R2 and R3 on RELENG_6_3 on the
> 2950 or 1950.
> 
>>From now on we'll wait for R2 before we go anywhere near new Dell
> gear.  
> 
> What do you think the chances of them dumping LSI for Acera and Broadcom
> for Intel? :)
> 
> ~BAS
> 
>> Its a software bug (driver).  It can probably be easily fixed.  I
>> think there's a PR on it somewhere (will check).

The problem is a firmware bug in the Megaraid SAS controller.  It seems
that while the controller can handle 512 or more concurrent commands,
it can only handle 128 concurrent commands to each array.  Patrols
reads aren't the primary cause, they just help the problem appear; when
a patrol read cycle runs, it tends to slow down i/o enough that commands
to the array get backed up, and you tend to reach the 128 limit.

I don't know if there is a firmware fix from Dell/LSI, or if there will
ever be a fix.  FreeBSD drivers tend to stress hardware a lot more
than Linux and Windows do, and since the latter two are used as the
QA yardstick, anything that doesn't affect them doesn't usually get
fixed.  An easy work-around for the driver is to change the following
line in /sys/dev/mfi/mfi.c::mfi_alloc_commands()

ncmds = sc->mfi_max_fw_cmds;

to

ncmds = 128;

A more complete solution requires me writing an i/o scheduler in the
driver, something that would take quite a bit of effort.

With all this said, I still stand behind LSI controllers.  This bug,
while unfortunate, is relatively minor and easy to work around, and
it's the only significant bug that has turned up in over two and half
years with this hardware.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?486909B1.3020309>