Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 2 Nov 2011 11:05:44 -0700
From:      Jason Wolfe <nitroboost@gmail.com>
To:        freebsd-scsi@freebsd.org
Subject:   Re: mps/LSI SAS2008 controller crashes when smartctl is run with upped disk tags
Message-ID:  <CAAAm0r2TDHEcdN43MATU-ERzoDr=2Hy029YUTjuxh%2B9CBni1vw@mail.gmail.com>
In-Reply-To: <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>
References:  <CAAAm0r2-pXLEZVoG7g_dkym6MzLJXggjOQh3a8t5QO90vPJvfw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Nov 1, 2011 at 11:13 AM, Jason Wolfe <nitroboost@gmail.com> wrote:

> Luckily remote syslogging is enabled, so while nothing is kept locally, we
> see these messages similar to these transmitted before the server hangs,
> requiring a power cycle:
>


> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 510
>
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 713
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 942
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 356
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 492
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 976
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 339
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 746
> (da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 613
> (da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16
> (da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID
> 305
> (da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 594
>
> In some cases that would be followed by this, which would usually be the
> last transmission, though we don't see this in all cases.  It may just be
> the system isn't always alive long enough to transmit:
>
> kernel: mps0: IOC Fault 0x40006003, Resetting
>
>
Hello,

Testing with the LSI supplied driver, it appears they have a code path for
this condition that causes our driver to crash.  Here are 2 sets of
messages:

mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80003fb000 cm
0xffffff800040bdf8
(da0:mpslsi0:0:8:0): WRITE(10). CDB: 2a 0 55 bf 5a 3f 0 1 0 0 length 131072
SMID 97 command timeout cm 0xffffff800040bdf8 ccb 0xffffff00
mpslsi0: mpssas_alloc_tm freezing simq
mpslsi0: timedout cm 0xffffff800040bdf8 allocated tm 0xffffff8000409070
(da0:mpslsi0:0:8:0): READ(10). CDB: 28 0 55 96 48 7f 0 0 80 0 length 65536
SMID 171 completed cm 0xffffff80004105a8 ccb 0xffffff03c3443y
(da0:mpslsi0:0:8:0): READ(10). CDB: 28 0 54 f8 a4 3f 0 0 80 0 length 65536
SMID 762 completed cm 0xffffff8000434230 ccb 0xffffff001317ay
(da0:mpslsi0:0:8:0): WRITE(10). CDB: 2a 0 55 bf 5a 3f 0 1 0 0 length 131072
SMID 97 completed timedout cm 0xffffff800040bdf8 ccb 0xffff1
(noperiph:mpslsi0:0:8:0): SMID 50 finished recovery after aborting TaskMID
97
mpslsi0: mpssas_free_tm releasing simq


mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80003fb000 cm
0xffffff8000441e18
(da7:mpslsi0:0:15:0): WRITE(10). CDB: 2a 0 33 76 29 ef 0 1 0 0 length
131072 SMID 989 command timeout cm 0xffffff8000441e18 ccb 0xfffff0
mpslsi0: mpssas_alloc_tm freezing simq
mpslsi0: timedout cm 0xffffff8000441e18 allocated tm 0xffffff80004063e0
(da7:mpslsi0:0:15:0): READ(10). CDB: 28 0 71 14 a1 4f 0 1 0 0 length 131072
SMID 857 completed cm 0xffffff8000439e38 ccb 0xffffff001316y
(da7:mpslsi0:0:15:0): READ(10). CDB: 28 0 71 e4 98 57 0 0 80 0 length 65536
SMID 300 completed cm 0xffffff80004182a0 ccb 0xffffff0392f0y
(da7:mpslsi0:0:15:0): WRITE(10). CDB: 2a 0 33 76 29 ef 0 1 0 0 length
131072 SMID 989 completed timedout cm 0xffffff8000441e18 ccb 0xff1
(noperiph:mpslsi0:0:15:0): SMID 4 finished recovery after aborting TaskMID
989
mpslsi0: mpssas_free_tm releasing simq

The server ran for 10 minutes with these happening every 10-30 seconds,
with our community driver the first instance of commands timing out during
this smartctl storm would cause the server to hang and sometimes the
controller to reset.  Hopefully this is helpful to someone.

Jason



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAAAm0r2TDHEcdN43MATU-ERzoDr=2Hy029YUTjuxh%2B9CBni1vw>