From owner-freebsd-current@FreeBSD.ORG Tue Sep 6 23:23:09 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BF4281065673 for ; Tue, 6 Sep 2011 23:23:09 +0000 (UTC) (envelope-from tjg@soe.ucsc.edu) Received: from mail-01.cse.ucsc.edu (mail-01.cse.ucsc.edu [128.114.48.32]) by mx1.freebsd.org (Postfix) with ESMTP id 73BA88FC0A for ; Tue, 6 Sep 2011 23:23:09 +0000 (UTC) Received: from mail-01.cse.ucsc.edu (mail-01.cse.ucsc.edu [128.114.48.32]) by mail-01.cse.ucsc.edu (Postfix) with ESMTP id BC355774C005 for ; Tue, 6 Sep 2011 16:04:26 -0700 (PDT) Date: Tue, 6 Sep 2011 16:04:26 -0700 (PDT) From: Tim Gustafson To: freebsd-current@freebsd.org Message-ID: <1922360058.114440.1315350266688.JavaMail.root@mail-01.cse.ucsc.edu> In-Reply-To: <2050180973.114414.1315349796607.JavaMail.root@mail-01.cse.ucsc.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [128.114.49.22] X-Mailer: Zimbra 6.0.9_GA_2686 (ZimbraWebClient - FF3.0 ([unknown])/6.0.9_GA_2686) Subject: RELENG_8 / mpt / zpool Errors X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Sep 2011 23:23:09 -0000 Hi all, I'm running RELENG_8: ---------- root@bsd-03: uname -a FreeBSD bsd-03 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Aug 22 14:58:58 PDT 2011 root@bsd-03:/usr/obj/usr/src/sys/GENERIC amd64 ---------- We've got an MPT controller installed with 32 drives attached: ---------- root@bsd-03: dmesg | grep mpt mpt0: port 0xec00-0xecff mem 0xef3fc000-0xef3fffff,0xef3e0000-0xef3effff irq 32 at device 0.0 on pci3 mpt0: [ITHREAD] mpt0: MPI Version=1.5.19.0 ses0 at mpt0 bus 0 scbus1 target 32 lun 0 ses1 at mpt0 bus 0 scbus1 target 33 lun 0 da5 at mpt0 bus 0 scbus1 target 0 lun 0 .....SNIP..... da36 at mpt0 bus 0 scbus1 target 31 lun 0 ---------- We have a zpool on those drives configured into one large zfs file system: ---------- root@bsd-03: zpool status pool: jails state: ONLINE scan: resilvered 5.51M in 0h12m with 0 errors on Tue Sep 6 15:10:23 2011 config: NAME STATE READ WRITE CKSUM jails ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 da11 ONLINE 0 0 0 da12 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 da13 ONLINE 0 0 0 da14 ONLINE 0 0 0 da15 ONLINE 0 0 0 da16 ONLINE 0 0 0 da17 ONLINE 0 0 0 da18 ONLINE 0 0 0 da19 ONLINE 0 0 0 da20 ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 da21 ONLINE 0 0 0 da22 ONLINE 0 0 0 da23 ONLINE 0 0 0 da24 ONLINE 0 0 0 da25 ONLINE 0 0 0 da26 ONLINE 0 0 0 da27 ONLINE 0 0 0 da28 ONLINE 0 0 0 raidz1-3 ONLINE 0 0 0 da29 ONLINE 0 0 0 da30 ONLINE 0 0 0 da31 ONLINE 0 0 0 da32 ONLINE 0 0 0 da33 ONLINE 0 0 0 da34 ONLINE 0 0 0 da35 ONLINE 0 0 0 da36 ONLINE 0 0 0 errors: No known data errors ---------- We're seeing some occasional oddness. About every two weeks it seems the controller temporarily loses connectivity with the drives and the zpool goes a bit bonkers and reports a dozen or so corrupted files. A "zpool scrub" goes through and reports that everything's been fixed and everything seems OK again (although I have not 100% confirmed that there is no file corruption yet, but I'm giving ZFS's check-summing logic the benefit of the doubt here). When we have problems, it tends to be accompanied by the following in my dmesg: ---------- (da20:mpt0:0:15:0): READ(10). CDB: 28 0 90 b0 6b dd 0 0 9 0 (da20:mpt0:0:15:0): CAM status: SCSI Status Error (da20:mpt0:0:15:0): SCSI status: Check Condition (da20:mpt0:0:15:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da17:mpt0:0:12:0): READ(10). CDB: 28 0 90 b0 6c e 0 0 2 0 (da17:mpt0:0:12:0): CAM status: SCSI Status Error (da17:mpt0:0:12:0): SCSI status: Check Condition (da17:mpt0:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) mpt0: request 0xffffff800080b520:10990 timed out for ccb 0xffffff013227b000 (req->ccb 0xffffff013227b000) mpt0: attempting to abort req 0xffffff800080b520:10990 function 0 mpt0: mpt_wait_req(1) timed out mpt0: mpt_recover_commands: abort timed-out. Resetting controller mpt0: mpt_cam_event: 0x0 mpt0: mpt_cam_event: 0x0 mpt0: completing timedout/aborted req 0xffffff800080b520:10990 mpt0: mpt_cam_event: 0x1b mpt0: mpt_cam_event: 0x1b mpt0: SAS discovery error: Port: 0x00 Status: 0x00004002 mpt0: SAS discovery error: Port: 0x00 Status: 0x00000010 mpt0: request 0xffffff8000811310:54341 timed out for ccb 0xffffff000897a000 (req->ccb 0xffffff000897a000) mpt0: attempting to abort req 0xffffff8000811310:54341 function 0 mpt0: mpt_wait_req(1) timed out mpt0: mpt_recover_commands: abort timed-out. Resetting controller mpt0: mpt_cam_event: 0x0 mpt0: completing timedout/aborted req 0xffffff8000811310:54341 mpt0: mpt_cam_event: 0x1b mpt0: mpt_cam_event: 0x1b ---------- So, is this an OS/driver issue? Is it a bad controller? Bad cables? Bad disks? As always, any help is greatly appreciated. Thanks! -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Tim Gustafson tjg@soe.ucsc.edu Baskin School of Engineering 831-459-5354 UC Santa Cruz Baskin Engineering 317B -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-