From owner-freebsd-scsi@FreeBSD.ORG Tue Apr 23 14:18:48 2013 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3701B34E; Tue, 23 Apr 2013 14:18:48 +0000 (UTC) (envelope-from ken@kdm.org) Received: from nargothrond.kdm.org (nargothrond.kdm.org [70.56.43.81]) by mx1.freebsd.org (Postfix) with ESMTP id 0902A1636; Tue, 23 Apr 2013 14:18:47 +0000 (UTC) Received: from nargothrond.kdm.org (localhost [127.0.0.1]) by nargothrond.kdm.org (8.14.2/8.14.2) with ESMTP id r3NE2bsh051266; Tue, 23 Apr 2013 08:02:37 -0600 (MDT) (envelope-from ken@nargothrond.kdm.org) Received: (from ken@localhost) by nargothrond.kdm.org (8.14.2/8.14.2/Submit) id r3NE2b7o051265; Tue, 23 Apr 2013 08:02:37 -0600 (MDT) (envelope-from ken) Date: Tue, 23 Apr 2013 08:02:37 -0600 From: "Kenneth D. Merry" To: Alexander Motin Subject: Re: Repeated msgs & kernel panic w/ r246437 (Revamp the CAM enclosure services driver) Message-ID: <20130423140237.GA50775@nargothrond.kdm.org> References: <20130422030053.GA23186@FreeBSD.org> <517641C6.7010905@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <517641C6.7010905@FreeBSD.org> User-Agent: Mutt/1.4.2i Cc: John , FreeBSD SCSI X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Apr 2013 14:18:48 -0000 On Tue, Apr 23, 2013 at 11:09:42 +0300, Alexander Motin wrote: > On 22.04.2013 06:00, John wrote: > >Hi Folks, > > > > After updating one of our servers to the latest stable image, > >it appears that commit r246437 appears to be causing it to panic. > > > >The commit: > > > >http://svnweb.freebsd.org/base?view=revision&revision=246437 > > > >What one of our servers looks like: > > > >http://people.freebsd.org/~jwd/zfsnfsserver.jpg > > > >The last known working commit: > > > >http://people.freebsd.org/~jwd/r246437/dmesg.r246431.clean.txt > > > >With commit r246437: > > > >http://people.freebsd.org/~jwd/r246437/dmesg.r246437.log.txt > > > >Note, most of the dmesg output is related to the ses devices. It > >repeats itself multiple times before the panic. > > > >ses39: ses0,pass20: Element descriptor: ' ' > >ses39: ses0,pass20: SAS Expander: 24 Physses39: phy 0: connector 255 > >other 255 > >ses39: phy 1: connector 255 other 255 > >ses39: phy 2: connector 255 other 255 > >ses39: phy 3: connector 255 other 255 > >ses39: phy 4: connector 255 other 255 > >ses39: phy 5: connector 255 other 255 > >ses39: phy 6: connector 255 other 255 > > > >etc, etc... > > That is not my part of code, but I think it is just too verbose debug > messages, that should be hidden. Yes, it is probably too verbose, especially on such a large system. > >After just a few minutes, the system panics. A pair of images > >of the screen (sorry, no serial console at this time): > > > >Panic: http://people.freebsd.org/~jwd/r246437/20130419_160143.jpg > > > >bt: http://people.freebsd.org/~jwd/r246437/20130419_110158.jpg > > Despite that you are talking about "latest stable image", I believe your > kernel is not latest 9-STABLE. Your backtrace reminds me about locking > problems that should be already fixed from several sides. For example, > on present 9-STABLE ses_path_iter_devid_callback() doesn't call > xpt_create_path(), but calls xpt_create_path_unlocked() instead. If you > can reproduce the issue with latest 9-STABLE, please provide respective > information. I agree. I added the xpt_create_path_unlocked() call to fix a panic with a stack trace just like the one above. It looks like a problem due to running r246437 exactly. > >We are currently running a test to see if the fact that all our > >shelves are dual-attached, allowing us to use geom multipath is > >related. ie: we have disabled the 2nd HBA thus cutting the total > >number of da & ses devices in half and thus not executing the > >code in the commit that tracks duplicate ses devices. > > > >Note, if we disable both HBA devices and boot the system up it > >does not panic or print out the repeated messages, but of course > >we have no disks :-) > > > >I am unclear on the "connector 255 other 255" messages and have not > >taken the time to look into them yet. > > > >I would appreciate any insights folks can provide. > > > >Many Thanks, > >John > > > >ps: We've had to seriously increase the console buffer size to > >capture the complete dmesg output... > > > >options MSGBUF_SIZE=(32768*32) > > > >Can we delay starting the kernel daemon until after the system > >is up and /var/log/messages is available? Just a thought... > > The goal of this code was to create persistent location-dependent names > for devices. It may be better to have them earlier. Yes, I agree. Ken -- Kenneth Merry ken@FreeBSD.ORG