From owner-freebsd-scsi@FreeBSD.ORG  Tue Oct 25 19:33:04 2011
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7454B106564A
	for <freebsd-scsi@freebsd.org>; Tue, 25 Oct 2011 19:33:04 +0000 (UTC)
	(envelope-from ken@kdm.org)
Received: from nargothrond.kdm.org (nargothrond.kdm.org [70.56.43.81])
	by mx1.freebsd.org (Postfix) with ESMTP id 3B0738FC08
	for <freebsd-scsi@freebsd.org>; Tue, 25 Oct 2011 19:33:03 +0000 (UTC)
Received: from nargothrond.kdm.org (localhost [127.0.0.1])
	by nargothrond.kdm.org (8.14.2/8.14.2) with ESMTP id p9PJX3LE037869;
	Tue, 25 Oct 2011 13:33:03 -0600 (MDT)
	(envelope-from ken@nargothrond.kdm.org)
Received: (from ken@localhost)
	by nargothrond.kdm.org (8.14.2/8.14.2/Submit) id p9PJX2s9037868;
	Tue, 25 Oct 2011 13:33:02 -0600 (MDT) (envelope-from ken)
Date: Tue, 25 Oct 2011 13:33:02 -0600
From: "Kenneth D. Merry" <ken@freebsd.org>
To: Karli Sj?berg <Karli.Sjoberg@slu.se>
Message-ID: <20111025193302.GA30409@nargothrond.kdm.org>
References: <82B38DBF-DD3A-46CD-93F6-02CDB6506E05@slu.se>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <82B38DBF-DD3A-46CD-93F6-02CDB6506E05@slu.se>
User-Agent: Mutt/1.4.2i
Cc: "freebsd-scsi@freebsd.org" <freebsd-scsi@freebsd.org>, fs@freebsd.org
Subject: Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Oct 2011 19:33:04 -0000

On Thu, Oct 20, 2011 at 13:28:17 +0200, Karli Sj?berg wrote:
> Hi,
> 
> I?m in the process of vacating a Sun/Oracle system to a another Supermicro/FreeBSD system, doing zfs send/recv between. Two times now, the system has panicked while not doing anything at all, and it?s throwing alot of SCSI/CAM-related errors while doing IO-intensive operations, like send/recv, resilver, and zpool has sometimes reported read/write errors on the hard drives. Best part is that the errors in messages are about all hard drives at one time or another, and they are connected with separate cables, controllers and caddies. Specs:
> 
> HW:
> 1x  Supermicro X8SIL-F
> 2x  Supermicro AOC-USAS2-L8i
> 2x  Supermicro CSE-M35T-1B
> 1x  Intel Core i5 650 3,2GHz
> 4x  2GB 1333MHZ DDR3 ECC UDIMM
> 10x SAMSUNG HD204UI (in a raidz2 zpool)
> 1x  OCZ Vertex 3 240GB (L2ARC)
> 
> SW:
> # uname -a
> FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
> # zpool get version pool1
> NAME   PROPERTY  VALUE    SOURCE
> pool1  version   28       default[/CODE]
> 
> I got the panic from the IPMI KVM:
> http://i55.tinypic.com/synpzk.png

In looking at the panic, this is a ZFS panic.  Nothing the disks do should
be able to cause ZFS to panic.  ZFS is panicing in avl_add():

	/*
	 * This is unfortunate.  We want to call panic() here, even for
	 * non-DEBUG kernels.  In userland, however, we can't depend on anything
	 * in libc or else the rtld build process gets confused.  So, all we can
	 * do in userland is resort to a normal ASSERT().
	 */
	if (avl_find(tree, new_node, &where) != NULL)
#ifdef _KERNEL
		panic("avl_find() succeeded inside avl_add()");
#else
		ASSERT(0);
#endif

There are certainly timeouts and two terminated IOCs in the log below.  That
does suggest a hardware or driver problem, but it isn't very obvious what
it might be.

I have seen bad behavior with SATA drives behind 3Gb Maxim expanders
talking to 6GB LSI controllers, but your particular configuration does not
involve any expanders, and therefore is not that particular STP issue.

My best guess, and it is a guess, is that either the drives are misbehaving
(i.e. firmware type problem) or you've got a cabling issue.

If you have more hardware available, you might try swapping out the cables
and/or drives to see if you can reproduce the drive errors with a
different setup.  If you swap the drives, I would use a different brand if
you've got them available.

I'm CCing the fs list, perhaps someone there can look at the stack trace
above and figure out what ZFS might be doing.

Again, ZFS should survive any errors from the drives, and the panic above
looks like ZFS is flagging a logic bug somewhere.

> 
> And an extract from /var/log/messages:
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(10). CDB: 2a 0 6 13 66 f 0 0 f 0 
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(6). CDB: a 0 1 b2 2 0 
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 859
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 495
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 725
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 722
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 438
> Oct 19 17:40:38 fs2-7 kernel: mps1: (1:4:0) terminated ioc 804b scsi 0 state c xfer 0
> Oct 19 17:40:38 fs2-7 last message repeated 3 times
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 859 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 495
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 495 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 725
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 725 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 722
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 722 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 438
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 438 complete
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 6 25 4f 75 0 0 b 0 
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 2d a5 10 ca 0 0 80 0 
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:45:40 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 976
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 636
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 888
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 983
> Oct 19 17:45:41 fs2-7 kernel: mps0: (0:1:0) terminated ioc 804b scsi 0 state c xfer 0
> Oct 19 17:45:41 fs2-7 last message repeated 2 times
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 976 complete
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 636
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 636 complete
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 888
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 888 complete
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 983
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 983 complete
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 a7 2 0 0 3 0 
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 b0 9 0 0 9 0 
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> 
> What?s going on?
> 
> Regards
> Karli Sj?berg_______________________________________________
> freebsd-scsi@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG