From owner-svn-src-all@FreeBSD.ORG Thu Feb 18 14:08:30 2010 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C92A4106566C; Thu, 18 Feb 2010 14:08:30 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 979088FC0A; Thu, 18 Feb 2010 14:07:21 +0000 (UTC) Received: from lawrence1.loshell.room52.net (unknown [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id 230C27E878; Fri, 19 Feb 2010 01:07:17 +1100 (EST) Message-ID: <4B7D4962.8070706@freebsd.org> Date: Fri, 19 Feb 2010 01:06:26 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU; rv:1.9.1.5) Gecko/20100105 Thunderbird/3.0 MIME-Version: 1.0 To: Alexander Motin References: <201002141938.o1EJcRpx065470@svn.freebsd.org> In-Reply-To: <201002141938.o1EJcRpx065470@svn.freebsd.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: svn-src-stable@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, svn-src-stable-8@FreeBSD.org Subject: Re: svn commit: r203889 - in stable/8/sys: cam cam/ata cam/scsi dev/ahci dev/asr dev/ata dev/ciss dev/hptiop dev/hptrr dev/mly dev/mpt dev/ppbus dev/siis dev/trm dev/twa dev/usb/storage X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Feb 2010 14:08:30 -0000 Hi Alexander and all, On 02/15/10 06:38, Alexander Motin wrote: > Author: mav > Date: Sun Feb 14 19:38:27 2010 > New Revision: 203889 > URL: http://svn.freebsd.org/changeset/base/203889 > > Log: > MFC r203108: > Large set of CAM improvements: [snip] I've been having issues with the mpt-driven LSI SAS adapter in my SunFire X4100 server running FreeBSD 8-STABLE r202132. Under certain disk workloads like running an svn update of the src tree or kernel compile, the disk subsystem will become extremely unresponsive in a stalled like state, and /var/log/messages will report a number of these: mpt0: mpt_cam_event: 0x16 It does eventually come good after a minute or two even though the svn op or build is still running, then it will maybe repeat a few times stalled/good behaviour sometimes with minutes between events. A couple of times it has gotten even more upset reporting things like this: mpt0: mpt_cam_event: 0x16 mpt0: mpt_cam_event: 0x16 mpt0: request 0xffffff80002f1400:54058 timed out for ccb 0xffffff0001c65000 (req->ccb 0xffffff0001c65000) mpt0: attempting to abort req 0xffffff80002f1400:54058 function 0 mpt0: request 0xffffff80002fd100:54059 timed out for ccb 0xffffff009f3ec800 (req->ccb 0xffffff009f3ec800) mpt0: request 0xffffff80002efcf0:54060 timed out for ccb 0xffffff0001bd2000 (req->ccb 0xffffff0001bd2000) mpt0: mpt_recover_commands: IOC Status 0x4a. Resetting controller. mpt0: mpt_cam_event: 0x0 mpt0: mpt_cam_event: 0x0 mpt0: completing timedout/aborted req 0xffffff80002f1400:54058 mpt0: completing timedout/aborted req 0xffffff80002fd100:54059 mpt0: completing timedout/aborted req 0xffffff80002efcf0:54060 mpt0: mpt_cam_event: 0x16 mpt0: mpt_cam_event: 0x12 mpt0: mpt_cam_event: 0x12 mpt0: mpt_cam_event: 0x16 mpt0: Volume(0:2): Volume Status Changed mpt0: request 0xffffff80002f8990:0 timed out for ccb 0xffffff009f3cb800 (req->ccb 0) No ill effects are observed after such an episode and the array remains in healthy as-normal state. The only observable problem is the stall of all disk IO while these events occur. The disk configuration is 2 x 320GB WD3200BEKT 7200RPM SATA HDDs in RAID1. The hardware reports itself as: mpt0: port 0xa800-0xa8ff mem 0xfc4fc000-0xfc4fffff,0xfc4e0000-0xfc4effff irq 28 at device 3.0 on pci2 mpt0: [ITHREAD] mpt0: MPI Version=1.5.13.0 mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) mpt0: 1 Active Volume (2 Max) mpt0: 2 Hidden Drive Members (10 Max) mpt0@pci0:2:3:0: class=0x010000 card=0x30601000 chip=0x00501000 rev=0x02 hdr=0x00 vendor = 'LSI Logic (Was: Symbios Logic, NCR)' device = 'SAS 3000 series, 4-port with 1064 -StorPort' class = mass storage subclass = SCSI As best I can tell, the hardware is ok, both disks report as fine without SMART errors and are only 2 months old, so wanted to rule out software issues. On upgrading to recent 8-STABLE, I got a page fault kernel panic on boot in the mpt driver mpt_raid0 kproc. After some trial and error, r203888 is the most recent revision that boots fine, whilst r203889 exhibits the page fault. I should also note that r203888 still sees the "mpt0: mpt_cam_event: 0x16" messages and associated disk IO stalls. I compiled DDB into my r203889 kernel. Unfortunately my ILO emulates a USB keyboard so I can't do anything in DDB which is a huge pain, but here's the info I did get (hand transcribed): Fatal trap 12: page fault while in kernel mode current process: mpt_raid0 Stopped at xpt_rescan+0x1d: movq 0x10(%rsi),%rdx So there are two separate issues here: 1. Any thoughts on how to resolve the regression in the mpt driver with the r203889 commit? 2. Any thoughts on the behaviour I'm seeing with the mpt_cam_event messages? Is it possible it's just a driver issue? Is the hardware likely bad? I'm really hoping they'll go away once the driver issue is resolved as the freezes are fairly unacceptable on a production machine and the hardware appears to pass all checks I've done so far. Cheers, Lawrence