From owner-freebsd-scsi@FreeBSD.ORG  Tue Apr 23 14:18:48 2013
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 3701B34E;
 Tue, 23 Apr 2013 14:18:48 +0000 (UTC) (envelope-from ken@kdm.org)
Received: from nargothrond.kdm.org (nargothrond.kdm.org [70.56.43.81])
 by mx1.freebsd.org (Postfix) with ESMTP id 0902A1636;
 Tue, 23 Apr 2013 14:18:47 +0000 (UTC)
Received: from nargothrond.kdm.org (localhost [127.0.0.1])
 by nargothrond.kdm.org (8.14.2/8.14.2) with ESMTP id r3NE2bsh051266;
 Tue, 23 Apr 2013 08:02:37 -0600 (MDT)
 (envelope-from ken@nargothrond.kdm.org)
Received: (from ken@localhost)
 by nargothrond.kdm.org (8.14.2/8.14.2/Submit) id r3NE2b7o051265;
 Tue, 23 Apr 2013 08:02:37 -0600 (MDT) (envelope-from ken)
Date: Tue, 23 Apr 2013 08:02:37 -0600
From: "Kenneth D. Merry" <ken@freebsd.org>
To: Alexander Motin <mav@freebsd.org>
Subject: Re: Repeated msgs & kernel panic w/ r246437 (Revamp the CAM enclosure
 services driver)
Message-ID: <20130423140237.GA50775@nargothrond.kdm.org>
References: <20130422030053.GA23186@FreeBSD.org> <517641C6.7010905@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <517641C6.7010905@FreeBSD.org>
User-Agent: Mutt/1.4.2i
Cc: John <jwd@freebsd.org>, FreeBSD SCSI <freebsd-scsi@freebsd.org>
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Apr 2013 14:18:48 -0000

On Tue, Apr 23, 2013 at 11:09:42 +0300, Alexander Motin wrote:
> On 22.04.2013 06:00, John wrote:
> >Hi Folks,
> >
> >    After updating one of our servers to the latest stable image,
> >it appears that commit r246437 appears to be causing it to panic.
> >
> >The commit:
> >
> >http://svnweb.freebsd.org/base?view=revision&revision=246437
> >
> >What one of our servers looks like:
> >
> >http://people.freebsd.org/~jwd/zfsnfsserver.jpg
> >
> >The last known working commit:
> >
> >http://people.freebsd.org/~jwd/r246437/dmesg.r246431.clean.txt
> >
> >With commit r246437:
> >
> >http://people.freebsd.org/~jwd/r246437/dmesg.r246437.log.txt
> >
> >Note, most of the dmesg output is related to the ses devices. It
> >repeats itself multiple times before the panic.
> >
> >ses39: ses0,pass20: Element descriptor: '            '
> >ses39: ses0,pass20: SAS Expander: 24 Physses39:  phy 0: connector 255 
> >other 255
> >ses39:  phy 1: connector 255 other 255
> >ses39:  phy 2: connector 255 other 255
> >ses39:  phy 3: connector 255 other 255
> >ses39:  phy 4: connector 255 other 255
> >ses39:  phy 5: connector 255 other 255
> >ses39:  phy 6: connector 255 other 255
> >
> >etc, etc...
> 
> That is not my part of code, but I think it is just too verbose debug 
> messages, that should be hidden.

Yes, it is probably too verbose, especially on such a large system.

> >After just a few minutes, the system panics. A pair of images
> >of the screen (sorry, no serial console at this time):
> >
> >Panic: http://people.freebsd.org/~jwd/r246437/20130419_160143.jpg
> >
> >bt: http://people.freebsd.org/~jwd/r246437/20130419_110158.jpg
> 
> Despite that you are talking about "latest stable image", I believe your 
> kernel is not latest 9-STABLE. Your backtrace reminds me about locking 
> problems that should be already fixed from several sides. For example, 
> on present 9-STABLE ses_path_iter_devid_callback() doesn't call 
> xpt_create_path(), but calls xpt_create_path_unlocked() instead. If you 
> can reproduce the issue with latest 9-STABLE, please provide respective 
> information.

I agree.  I added the xpt_create_path_unlocked() call to fix a
panic with a stack trace just like the one above.  It looks like a problem
due to running r246437 exactly.

> >We are currently running a test to see if the fact that all our
> >shelves are dual-attached, allowing us to use geom multipath is
> >related. ie: we have disabled the 2nd HBA thus cutting the total
> >number of da & ses devices in half and thus not executing the
> >code in the commit that tracks duplicate ses devices.
> >
> >Note, if we disable both HBA devices and boot the system up it
> >does not panic or print out the repeated messages, but of course
> >we have no disks :-)
> >
> >I am unclear on the "connector 255 other 255" messages and have not
> >taken the time to look into them yet.
> >
> >I would appreciate any insights folks can provide.
> >
> >Many Thanks,
> >John
> >
> >ps: We've had to seriously increase the console buffer size to
> >capture the complete dmesg output...
> >
> >options   MSGBUF_SIZE=(32768*32)
> >
> >Can we delay starting the kernel daemon until after the system
> >is up and /var/log/messages is available?  Just a thought...
> 
> The goal of this code was to create persistent location-dependent names 
> for devices. It may be better to have them earlier.

Yes, I agree.

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG