From owner-freebsd-scsi@FreeBSD.ORG  Wed Nov  2 18:05:47 2011
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 052B910657BF
	for <freebsd-scsi@freebsd.org>; Wed,  2 Nov 2011 18:05:47 +0000 (UTC)
	(envelope-from nitroboost@gmail.com)
Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com
	[209.85.215.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 81EE18FC3C
	for <freebsd-scsi@freebsd.org>; Wed,  2 Nov 2011 18:05:46 +0000 (UTC)
Received: by eyd10 with SMTP id 10so564610eyd.13
	for <freebsd-scsi@freebsd.org>; Wed, 02 Nov 2011 11:05:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:content-type; bh=YAh7VKceP+hZP6peyXqr0OXPQM7Jg3Sw6qQlfJzITNo=;
	b=efjfBifr+2k0APqYap7yT3wkbhCXzMCKZGfrGSf8a2jWt7RFP5mlXG3PMWGtLCSgqh
	4igXugaPKf0Fw7/4VoJAeTkDHgxOshjPjunAGpQ1e2d1ph7x6lgS3BtLAQbd0TwCSC5U
	d4su8Ck+SRCXwCmLfP0wImLWco4CugsdDj+UU=
MIME-Version: 1.0
Received: by 10.182.17.103 with SMTP id n7mr1100067obd.68.1320257145101; Wed,
	02 Nov 2011 11:05:45 -0700 (PDT)
Received: by 10.182.35.193 with HTTP; Wed, 2 Nov 2011 11:05:44 -0700 (PDT)
In-Reply-To: <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>
References: <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>
Date: Wed, 2 Nov 2011 11:05:44 -0700
Message-ID: <CAAAm0r2TDHEcdN43MATU-ERzoDr=2Hy029YUTjuxh+9CBni1vw@mail.gmail.com>
From: Jason Wolfe <nitroboost@gmail.com>
To: freebsd-scsi@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Re: mps/LSI SAS2008 controller crashes when smartctl is run with
 upped disk tags
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Nov 2011 18:05:47 -0000

On Tue, Nov 1, 2011 at 11:13 AM, Jason Wolfe <nitroboost@gmail.com> wrote:

> Luckily remote syslogging is enabled, so while nothing is kept locally, we
> see these messages similar to these transmitted before the server hangs,
> requiring a power cycle:
>


> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 510
>
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 713
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 942
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 356
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 492
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 976
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 339
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 746
> (da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 613
> (da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16
> (da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID
> 305
> (da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 594
>
> In some cases that would be followed by this, which would usually be the
> last transmission, though we don't see this in all cases.  It may just be
> the system isn't always alive long enough to transmit:
>
> kernel: mps0: IOC Fault 0x40006003, Resetting
>
>
Hello,

Testing with the LSI supplied driver, it appears they have a code path for
this condition that causes our driver to crash.  Here are 2 sets of
messages:

mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80003fb000 cm
0xffffff800040bdf8
(da0:mpslsi0:0:8:0): WRITE(10). CDB: 2a 0 55 bf 5a 3f 0 1 0 0 length 131072
SMID 97 command timeout cm 0xffffff800040bdf8 ccb 0xffffff00
mpslsi0: mpssas_alloc_tm freezing simq
mpslsi0: timedout cm 0xffffff800040bdf8 allocated tm 0xffffff8000409070
(da0:mpslsi0:0:8:0): READ(10). CDB: 28 0 55 96 48 7f 0 0 80 0 length 65536
SMID 171 completed cm 0xffffff80004105a8 ccb 0xffffff03c3443y
(da0:mpslsi0:0:8:0): READ(10). CDB: 28 0 54 f8 a4 3f 0 0 80 0 length 65536
SMID 762 completed cm 0xffffff8000434230 ccb 0xffffff001317ay
(da0:mpslsi0:0:8:0): WRITE(10). CDB: 2a 0 55 bf 5a 3f 0 1 0 0 length 131072
SMID 97 completed timedout cm 0xffffff800040bdf8 ccb 0xffff1
(noperiph:mpslsi0:0:8:0): SMID 50 finished recovery after aborting TaskMID
97
mpslsi0: mpssas_free_tm releasing simq


mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80003fb000 cm
0xffffff8000441e18
(da7:mpslsi0:0:15:0): WRITE(10). CDB: 2a 0 33 76 29 ef 0 1 0 0 length
131072 SMID 989 command timeout cm 0xffffff8000441e18 ccb 0xfffff0
mpslsi0: mpssas_alloc_tm freezing simq
mpslsi0: timedout cm 0xffffff8000441e18 allocated tm 0xffffff80004063e0
(da7:mpslsi0:0:15:0): READ(10). CDB: 28 0 71 14 a1 4f 0 1 0 0 length 131072
SMID 857 completed cm 0xffffff8000439e38 ccb 0xffffff001316y
(da7:mpslsi0:0:15:0): READ(10). CDB: 28 0 71 e4 98 57 0 0 80 0 length 65536
SMID 300 completed cm 0xffffff80004182a0 ccb 0xffffff0392f0y
(da7:mpslsi0:0:15:0): WRITE(10). CDB: 2a 0 33 76 29 ef 0 1 0 0 length
131072 SMID 989 completed timedout cm 0xffffff8000441e18 ccb 0xff1
(noperiph:mpslsi0:0:15:0): SMID 4 finished recovery after aborting TaskMID
989
mpslsi0: mpssas_free_tm releasing simq

The server ran for 10 minutes with these happening every 10-30 seconds,
with our community driver the first instance of commands timing out during
this smartctl storm would cause the server to hang and sometimes the
controller to reset.  Hopefully this is helpful to someone.

Jason