From owner-freebsd-scsi@freebsd.org  Tue Jul  7 12:02:24 2015
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 83606995AAC
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jul 2015 12:02:24 +0000 (UTC)
 (envelope-from lists@yamagi.org)
Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 4AC59104F
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jul 2015 12:02:23 +0000 (UTC)
 (envelope-from lists@yamagi.org)
Received: from [192.168.100.101] (helo=aka)
 by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256)
 (Exim 4.85 (FreeBSD)) (envelope-from <lists@yamagi.org>)
 id 1ZCQz7-0000LK-GC; Tue, 07 Jul 2015 13:24:22 +0200
Date: Tue, 7 Jul 2015 13:24:16 +0200
From: Yamagi Burmeister <lists@yamagi.org>
To: freebsd-scsi@freebsd.org
Subject: Device timeouts(?) with LSI SAS3008 on mpr(4)
Message-Id: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org>
X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.27; amd64-portbld-freebsd10.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jul 2015 12:02:24 -0000

Hello,
I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE
as of r283938 on 2 servers and r285196 on the last one. 

The controller identify themself as:

----

mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
pci2 mpr0: IOCFacts  : MsgVersion: 0x205
        HeaderVersion: 0x2300
        IOCNumber: 0
        IOCExceptions: 0x0
        MaxChainDepth: 128
        NumberOfPorts: 1
        RequestCredit: 10240
        ProductID: 0x2221
        IOCRequestFrameSize: 32
        MaxInitiators: 32
        MaxTargets: 1024
        MaxSasExpanders: 42
        MaxEnclosures: 43
        HighPriorityCredit: 128
        MaxReplyDescriptorPostQueueDepth: 65504
        ReplyFrameSize: 32
        MaxVolumes: 0
        MaxDevHandle: 1106
        MaxPersistentEntries: 128
mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
mpr0: IOCCapabilities:
7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>

----

08.00.00.00 is the last available firmware.


Since day one 'dmesg' is cluttered with CAM errors:

----

mpr1: Sending reset from mprsas_send_abort for target ID 5
        (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
(da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
(da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
(da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
(da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power
on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying
command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2
Aborting command 0xfffffe0001601a30

mpr1: Sending reset from mprsas_send_abort for target ID 2
        (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
(da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
(da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
(da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
(da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
(da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
asc:29,0 (Power on, reset, or bus device reset occurred)
(da8:mpr1:0:2:0): Retrying command (per sense data)
(noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
0xfffffe000160b660

----

ZFS doesn't like this and sees read errors or even write errors. In
extreme cases the device is marked as FAULTED:

----

  pool: examplepool
 state: DEGRADED
status: One or more devices are faulted in response to persistent
errors. Sufficient replicas exist for the pool to continue functioning
in a degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the
device repaired.
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	examplepool DEGRADED     0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da3p1   ONLINE       0     0     0
	    da4p1   ONLINE       0     0     0
	    da5p1   ONLINE       0     0     0
	logs
	  da1p1     FAULTED      3     0     0  too many errors
	cache
	  da1p2     FAULTED      3     0     0  too many errors
	spares
	  da2p1     AVAIL   

errors: No known data errors

----

The problems arise on all 3 machines all all SSDs nearly daily. So I
highly suspect a software issue. Has anyone an idea what's going on and
what I can do to solve this problems? More information can be provided
if necessary.

Regards,
Yamagi

-- 
Homepage:  www.yamagi.org
XMPP:      yamagi@yamagi.org
GnuPG/GPG: 0xEFBCCBCB