From owner-freebsd-current@FreeBSD.ORG  Tue Sep  6 23:23:09 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BF4281065673
	for <freebsd-current@freebsd.org>; Tue,  6 Sep 2011 23:23:09 +0000 (UTC)
	(envelope-from tjg@soe.ucsc.edu)
Received: from mail-01.cse.ucsc.edu (mail-01.cse.ucsc.edu [128.114.48.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 73BA88FC0A
	for <freebsd-current@freebsd.org>; Tue,  6 Sep 2011 23:23:09 +0000 (UTC)
Received: from mail-01.cse.ucsc.edu (mail-01.cse.ucsc.edu [128.114.48.32])
	by mail-01.cse.ucsc.edu (Postfix) with ESMTP id BC355774C005
	for <freebsd-current@freebsd.org>; Tue,  6 Sep 2011 16:04:26 -0700 (PDT)
Date: Tue, 6 Sep 2011 16:04:26 -0700 (PDT)
From: Tim Gustafson <tjg@soe.ucsc.edu>
To: freebsd-current@freebsd.org
Message-ID: <1922360058.114440.1315350266688.JavaMail.root@mail-01.cse.ucsc.edu>
In-Reply-To: <2050180973.114414.1315349796607.JavaMail.root@mail-01.cse.ucsc.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [128.114.49.22]
X-Mailer: Zimbra 6.0.9_GA_2686 (ZimbraWebClient - FF3.0
	([unknown])/6.0.9_GA_2686)
Subject: RELENG_8 / mpt / zpool Errors
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Sep 2011 23:23:09 -0000

Hi all,

I'm running RELENG_8:

----------
root@bsd-03: uname -a
FreeBSD bsd-03 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Aug 22 14:58:58 PDT 2011     root@bsd-03:/usr/obj/usr/src/sys/GENERIC  amd64
----------

We've got an MPT controller installed with 32 drives attached:

----------
root@bsd-03: dmesg | grep mpt
mpt0: <LSILogic SAS/SATA Adapter> port 0xec00-0xecff mem 0xef3fc000-0xef3fffff,0xef3e0000-0xef3effff irq 32 at device 0.0 on pci3
mpt0: [ITHREAD]
mpt0: MPI Version=1.5.19.0
ses0 at mpt0 bus 0 scbus1 target 32 lun 0
ses1 at mpt0 bus 0 scbus1 target 33 lun 0
da5 at mpt0 bus 0 scbus1 target 0 lun 0
.....SNIP.....
da36 at mpt0 bus 0 scbus1 target 31 lun 0
----------

We have a zpool on those drives configured into one large zfs file system:

----------
root@bsd-03: zpool status
  pool: jails
 state: ONLINE
 scan: resilvered 5.51M in 0h12m with 0 errors on Tue Sep  6 15:10:23 2011
config:

	NAME        STATE     READ WRITE CKSUM
	jails       ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da8     ONLINE       0     0     0
	    da9     ONLINE       0     0     0
	    da10    ONLINE       0     0     0
	    da11    ONLINE       0     0     0
	    da12    ONLINE       0     0     0
	  raidz1-1  ONLINE       0     0     0
	    da13    ONLINE       0     0     0
	    da14    ONLINE       0     0     0
	    da15    ONLINE       0     0     0
	    da16    ONLINE       0     0     0
	    da17    ONLINE       0     0     0
	    da18    ONLINE       0     0     0
	    da19    ONLINE       0     0     0
	    da20    ONLINE       0     0     0
	  raidz1-2  ONLINE       0     0     0
	    da21    ONLINE       0     0     0
	    da22    ONLINE       0     0     0
	    da23    ONLINE       0     0     0
	    da24    ONLINE       0     0     0
	    da25    ONLINE       0     0     0
	    da26    ONLINE       0     0     0
	    da27    ONLINE       0     0     0
	    da28    ONLINE       0     0     0
	  raidz1-3  ONLINE       0     0     0
	    da29    ONLINE       0     0     0
	    da30    ONLINE       0     0     0
	    da31    ONLINE       0     0     0
	    da32    ONLINE       0     0     0
	    da33    ONLINE       0     0     0
	    da34    ONLINE       0     0     0
	    da35    ONLINE       0     0     0
	    da36    ONLINE       0     0     0

errors: No known data errors
----------

We're seeing some occasional oddness.  About every two weeks it seems the controller temporarily loses connectivity with the drives and the zpool goes a bit bonkers and reports a dozen or so corrupted files.  A "zpool scrub" goes through and reports that everything's been fixed and everything seems OK again (although I have not 100% confirmed that there is no file corruption yet, but I'm giving ZFS's check-summing logic the benefit of the doubt here).

When we have problems, it tends to be accompanied by the following in my dmesg:

----------
(da20:mpt0:0:15:0): READ(10). CDB: 28 0 90 b0 6b dd 0 0 9 0 
(da20:mpt0:0:15:0): CAM status: SCSI Status Error
(da20:mpt0:0:15:0): SCSI status: Check Condition
(da20:mpt0:0:15:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da17:mpt0:0:12:0): READ(10). CDB: 28 0 90 b0 6c e 0 0 2 0 
(da17:mpt0:0:12:0): CAM status: SCSI Status Error
(da17:mpt0:0:12:0): SCSI status: Check Condition
(da17:mpt0:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
mpt0: request 0xffffff800080b520:10990 timed out for ccb 0xffffff013227b000 (req->ccb 0xffffff013227b000)
mpt0: attempting to abort req 0xffffff800080b520:10990 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x0
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff800080b520:10990
mpt0: mpt_cam_event: 0x1b
mpt0: mpt_cam_event: 0x1b
mpt0: SAS discovery error: Port: 0x00 Status: 0x00004002
mpt0: SAS discovery error: Port: 0x00 Status: 0x00000010
mpt0: request 0xffffff8000811310:54341 timed out for ccb 0xffffff000897a000 (req->ccb 0xffffff000897a000)
mpt0: attempting to abort req 0xffffff8000811310:54341 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff8000811310:54341
mpt0: mpt_cam_event: 0x1b
mpt0: mpt_cam_event: 0x1b
----------

So, is this an OS/driver issue?  Is it a bad controller?  Bad cables?  Bad disks?

As always, any help is greatly appreciated.  Thanks!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Tim Gustafson                                                tjg@soe.ucsc.edu
Baskin School of Engineering                                     831-459-5354
UC Santa Cruz                                         Baskin Engineering 317B
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-