From owner-freebsd-scsi@FreeBSD.ORG  Wed Nov  2 08:56:30 2011
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 05000106564A
	for <freebsd-scsi@freebsd.org>; Wed,  2 Nov 2011 08:56:30 +0000 (UTC)
	(envelope-from peter.maloney@brockmann-consult.de)
Received: from moutng.kundenserver.de (moutng.kundenserver.de
	[212.227.126.171])
	by mx1.freebsd.org (Postfix) with ESMTP id A41808FC0C
	for <freebsd-scsi@freebsd.org>; Wed,  2 Nov 2011 08:56:29 +0000 (UTC)
Received: from [10.3.0.26] ([141.4.215.32])
	by mrelayeu.kundenserver.de (node=mrbap0) with ESMTP (Nemesis)
	id 0MWhTP-1RSWZ91Dsx-00XIsw; Wed, 02 Nov 2011 09:43:52 +0100
Message-ID: <4EB102C7.8080401@brockmann-consult.de>
Date: Wed, 02 Nov 2011 09:43:51 +0100
From: Peter Maloney <peter.maloney@brockmann-consult.de>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
	rv:1.9.2.18) Gecko/20110617 Thunderbird/3.1.11
MIME-Version: 1.0
To: Jason Wolfe <nitroboost@gmail.com>
References: <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>	<4EAEF431.7090108@brockmann-consult.de>
	<CAAAm0r1T1ifTQt5A5O+jwUoKoGjzcbho606wCt4SpM3AQ-WM3Q@mail.gmail.com>
In-Reply-To: <CAAAm0r1T1ifTQt5A5O+jwUoKoGjzcbho606wCt4SpM3AQ-WM3Q@mail.gmail.com>
X-Enigmail-Version: 1.1.2
X-Provags-ID: V02:K0:l9N7rDkQkC+AsK40qVaA1cTE/ku/nKfJ0okSl1Qynrs
	Ka2sNOCjWC1hyonoMbaQpXymtJ2LtwiwMBSuVq7vs921YGoT26
	Z8ys2XphzaR+0Liq/4uHWdt16gvXMCYlUm/6fHjoMrl7he8Cbk
	vgTIF77H3yUDH/0PRDhgxmIaUTbxWkdwCu8uyVIpST82509kWG
	0oiLWhcvNao78rhX3f+dynv4tmFKOAQJw5p1zrnnIIc0aSGFll
	5Rmh6LdrRTJv4xlwReOI2fFU4vXY3tznUq4L5uj+jVcarzejFp
	jJDa0fpZGNefckAoAs2ny1Lb7ST9xpafxr1Mc4q0f0WaTKU1Ik
	4wHAQsIyyhaGb9fbuVmbMqgH+3DhIcGlGuP1xQfsL
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: freebsd-scsi@freebsd.org
Subject: Re: mps/LSI SAS2008 controller crashes when smartctl is run with
 upped disk tags
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Nov 2011 08:56:30 -0000

On 11/01/2011 09:32 PM, Jason Wolfe wrote:
> On Mon, Oct 31, 2011 at 12:17 PM, Peter Maloney
> <peter.maloney@brockmann-consult.de
> <mailto:peter.maloney@brockmann-consult.de>> wrote:
>
>     Dear Jason,
>
>     I get a simlar problem on a system with an LSI 9211-8i with 20 SATA
>     disks attached (2 SSDs and 18 spnning disks). My system doesn't hang,
>     panic, or reset though. I just lose access to one disk, which is then
>     considered FAULTED in my zpool status (with the ZFS file system). If I
>     physically remove the FAULTED disk and run "gpart recover da0", I
>     get a
>     panic. Otherwise, the system keeps running in a degraded state.
>      When I
>     reboot and resilver, some data is found damaged and repaired, not just
>     refreshed with the latest state. The server has 1 HBA and 2
>     backplanes,
>     and I have the 2 mirrored root disks on different backplanes.
>     Maybe that
>     is why mine runs degraded and yours hang.
>
>     This happened twice so far (in around a month or two), and both
>     times it
>     was one of the mirrored root disks (SSDs) that faulted.
>
>     My tags are set to 255. I will try reproducing it as you said, and
>     then
>     if it fails, rebooting and trying again setting tags to 2 as you
>     suggested.
>
>     And *thank you very much for this information*. This is the last
>     outstanding issue with this server. I hope this workaround helps.
>
>     # camcontrol tags /dev/da0
>     (pass0:mps0:0:7:0): device openings: 255
>
>
> Peter,
>
> This happens 'randomly' for you, or do you have some automated process
> running smartctl that trips the drives up occasionally?
It appears to be completely random, but it could be something specific
going on that I just didn't think of. I don't know how to trigger it. I
wrote a script once that looped over the disks once with smartctl (which
I installed from ports) and recorded the device id, size of the disks,
etc.. But it didn't cause a crash, and I didn't try looping it
constantly to crash it.

The system uses "zfs send" to send the whole pool to another machine. It
uses rsync to back up some servers on to it. It serves a bunch of data
over NFS and has samba online also but not in use. The primary user of
the NFS shares is VMWare ESXi, which has a terrible problem with
synchronous writes, which might put a heavier load on the system.
> The way I'm getting around it currently is to just move
> /usr/local/sbin/smartctl elsewhere, and replacing it with a wrapper
> that simply drops the tags to 1, executes to the new smartctl location
> with the options passed, then moves the tags back to whatever you
> prefer. There will obviously be a small detriment here, but it should
> be fairly quick and hopefully not even noticeable in your case.
In my reading, I found that people think that reducing the io queues
(via kernel parameters) for zfs actually improves performance (moving
the queue to the OS I guess), so if the tags is similar, then I wasn't
thinking there would be too much of a drop. And also luckily, this
system of mine is not a performance machine... just a huge file server.
So if it is slower but more stable that way, I will leave tags set to 2
forever.
>
> If smartctl is not triggering these events for you, any idea what is?
I have no real clue, but my guess is that some NFS shares are using the
ZIL (zfs log device) a lot, and since that device is horribly
inefficient (scoring like 1500 iops during ZIL use on a disk that scores
50-140k on other tests), it causes the IO system to be overloaded, and
trigger the failure, purely based on load rather than something
particular like smartctl. So for now, I disabled my ZIL to see if it
still crashes.

Also on my list of things to try is:
-change to the IT firmware instead of IR, since ZFS prefers to have no
RAID in there at all.
-change the tags to 2
-try the LSI driver for the 9210-8i
http://www.lsi.com/products/storagec...AS9210-8i.aspx
<http://www.lsi.com/products/storagecomponents/Pages/LSISAS9210-8i.aspx>

Here is my forum thread about it:

http://forums.freebsd.org/showthread.php?t=26656

Are you using ZFS? Is your root volume in hardware RAID or software
RAID? I am curious because you say your systems hang, and mine just runs
degraded.
>
> Jason


Peter

-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney@brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------