Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 29 Sep 2007 21:18:17 -0400
From:      "Benjie Chen" <benjie@addgene.org>
To:        "Sean McAfee" <smcafee@collaborativefusion.com>
Cc:        freebsd-hardware@freebsd.org
Subject:   Re: PERC5 (LSI MegaSAS) Patrol Read crashes
Message-ID:  <c53be070709291818u5b7b81d7l5ac6f318336f2101@mail.gmail.com>
In-Reply-To: <46FD6E94.2080608@collaborativefusion.com>
References:  <46FD6E94.2080608@collaborativefusion.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I can confirm this problem on a PE1950 with the 6.2 i386 kernel as well.
Manual patrol read started by megacli crashes the system.

Thanks,
Benjie

On 9/28/07, Sean McAfee <smcafee@collaborativefusion.com> wrote:
>
>
> We first became aware of this problem about a month ago.  A database
> server was up but was completely unresponsive to anything other than
> pings.  I power cycled it via the DRAC and after we couldn't find
> anything suspicious in the logs, we figured it was a fluke.
>
> Until the next day, when its twin did the same exact thing.   This time,
> I was able to get a screen shot through the DRAC console.  Using old
> daily outputs and that screenshot, we correlated the crashes to patrol
> reads.  Since then, we've only seen it "in the wild" on one other
> machine, a 1950, but I've been trying to chase the problem down without
> much luck.
>
> I'm fortunate to have three machines at my disposal for this testing, so
> I was able to try a variety of combinations:
>
> Server 1:
> Chassis:          2950 v1
> System BIOS:      1.1.0
> PERC firmware:    1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package)
> OS:               6.2-R_p7, 6-STABLE
>
> Server 2:
> Chassis:          2950 v1
> System BIOS:      1.1.0
> PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
> OS:               6.2-R_p7, 6-STABLE
>
> Server 3:
> Chassis:          2950 v2
> System BIOS:      1.5.1
> PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
> OS:               6.2-R_p7
>
> They're all running amd64 and each combination was tried with and
> without the linux_mfi.ko patches found in PR-113232.  For disks, they all
> have 2x36gb RAID1, 4x73gb RAID10 (all SAS.)  We use
> linux_mfi.ko+linux-megacli
> for management.
>
> The original problem occurred during automatic patrol reads coupled with
> heavy disk load.  I've changed the delay interval for the automatic
> patrol reads and tried to reproduce it but haven't had enough success to
> make it useful for troubleshooting.  Since the automatic reads are meant
> to be as least aggressive as possible, I've been running a manual patrol
> read (megacli -AdpPR -Start -a0), which triggers a crash regardless
> of what I/O is like.
>
> The behavior has little to no variation; shortly after the read is
> started, disk writes immediately cease (shown via an scp from another
> machine).  After a minute, the console will begin to fill up with lines
> such as:
>
> mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS
>
> The first 8 values of the hex never change - I bring that up because I
> suspect the problem has something to do with the enclosure, which is
> attached at 8, 255, or fffffff, depending on where you're looking.
>
> I've let it go up to 6000 seconds, but it eventually ends in a kernel
> panic.
> That just seems to be a side effect of the original problem (processes
> with
> nowhere to write data), so I'm not too hung up on that.
>
> There's never anything pertaining to it in the controller's event log.
>
> Besides the platform version differences I mentioned above, I've tried:
> - Reducing the patrol read rate
> - Pulling down and modifying the patches from PR-115133 (which seems to
> set an upper boundary at 0xffffffff)
> - Invoking a0/aALL interchangeably
> - Changing the cache flush interval
> - Disabling disk coercion
> - A bunch of other long-shot settings from megacli that aren't worth
> listing
>
> Nothing has shown any appreciable difference in the behavior.
>
> Does anyone have an idea about what could be going on or anything else
> we can try?  For now, I'll probably just disable them and set them
> to auto/1 hour delay during outage windows only, but I'm hoping that
> someone is able to help with this.  At the very least, maybe I can save
> someone a whole bunch of time.
>
> Thanks in advance for any help.
>
> --
> Sean McAfee
> Collaborative Fusion, Inc.
>   smcafee@collaborativefusion.com
>   412-422-3463 x 4025
>
> 1710 Murray Avenue, Suite 320
> Pittsburgh, PA 15217
>
> ****************************************************************
> IMPORTANT: This message contains confidential information
> and is intended only for the individual named. If the reader of
> this message is not an intended recipient (or the individual
> responsible for the delivery of this message to an intended
> recipient), please be advised that any re-use, dissemination,
> distribution or copying of this message is prohibited. Please
> notify the sender immediately by e-mail if you have received
> this e-mail by mistake and delete this e-mail from your system.
> E-mail transmission cannot be guaranteed to be secure or
> error-free as information could be intercepted, corrupted, lost,
> destroyed, arrive late or incomplete, or contain viruses. The
> sender therefore does not accept liability for any errors or
> omissions in the contents of this message, which arise as a
> result of e-mail transmission.
> ****************************************************************
>
> _______________________________________________
> freebsd-hardware@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
> To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org
> "
>



-- 
Benjie Chen, Ph.D.
Addgene, a better way to share plasmids
www.addgene.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?c53be070709291818u5b7b81d7l5ac6f318336f2101>