From owner-freebsd-scsi@freebsd.org Tue Jul 7 12:02:24 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 83606995AAC for ; Tue, 7 Jul 2015 12:02:24 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (yugo.yamagi.org [212.48.122.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4AC59104F for ; Tue, 7 Jul 2015 12:02:23 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from [192.168.100.101] (helo=aka) by mail1.yamagi.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.85 (FreeBSD)) (envelope-from ) id 1ZCQz7-0000LK-GC; Tue, 07 Jul 2015 13:24:22 +0200 Date: Tue, 7 Jul 2015 13:24:16 +0200 From: Yamagi Burmeister To: freebsd-scsi@freebsd.org Subject: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-Id: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> X-Mailer: Sylpheed 3.4.2 (GTK+ 2.24.27; amd64-portbld-freebsd10.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2015 12:02:24 -0000 Hello, I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of r283938 on 2 servers and r285196 on the last one. The controller identify themself as: ---- mpr0: port 0x6000-0x60ff mem 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on pci2 mpr0: IOCFacts : MsgVersion: 0x205 HeaderVersion: 0x2300 IOCNumber: 0 IOCExceptions: 0x0 MaxChainDepth: 128 NumberOfPorts: 1 RequestCredit: 10240 ProductID: 0x2221 IOCRequestFrameSize: 32 MaxInitiators: 32 MaxTargets: 1024 MaxSasExpanders: 42 MaxEnclosures: 43 HighPriorityCredit: 128 MaxReplyDescriptorPostQueueDepth: 65504 ReplyFrameSize: 32 MaxVolumes: 0 MaxDevHandle: 1106 MaxPersistentEntries: 128 mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd mpr0: IOCCapabilities: 7a85c ---- 08.00.00.00 is the last available firmware. Since day one 'dmesg' is cluttered with CAM errors: ---- mpr1: Sending reset from mprsas_send_abort for target ID 5 (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command 0xfffffe0001601a30 mpr1: Sending reset from mprsas_send_abort for target ID 2 (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command 0xfffffe000160b660 ---- ZFS doesn't like this and sees read errors or even write errors. In extreme cases the device is marked as FAULTED: ---- pool: examplepool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: none requested config: NAME STATE READ WRITE CKSUM examplepool DEGRADED 0 0 0 raidz1-0 ONLINE 0 0 0 da3p1 ONLINE 0 0 0 da4p1 ONLINE 0 0 0 da5p1 ONLINE 0 0 0 logs da1p1 FAULTED 3 0 0 too many errors cache da1p2 FAULTED 3 0 0 too many errors spares da2p1 AVAIL errors: No known data errors ---- The problems arise on all 3 machines all all SSDs nearly daily. So I highly suspect a software issue. Has anyone an idea what's going on and what I can do to solve this problems? More information can be provided if necessary. Regards, Yamagi -- Homepage: www.yamagi.org XMPP: yamagi@yamagi.org GnuPG/GPG: 0xEFBCCBCB