From owner-freebsd-scsi@FreeBSD.ORG Mon Apr 22 03:00:53 2013 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 821) id ADC12A4A; Mon, 22 Apr 2013 03:00:53 +0000 (UTC) Date: Mon, 22 Apr 2013 03:00:53 +0000 From: John To: FreeBSD SCSI Subject: Repeated msgs & kernel panic w/ r246437 (Revamp the CAM enclosure services driver) Message-ID: <20130422030053.GA23186@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Apr 2013 03:00:53 -0000 Hi Folks, After updating one of our servers to the latest stable image, it appears that commit r246437 appears to be causing it to panic. The commit: http://svnweb.freebsd.org/base?view=revision&revision=246437 What one of our servers looks like: http://people.freebsd.org/~jwd/zfsnfsserver.jpg The last known working commit: http://people.freebsd.org/~jwd/r246437/dmesg.r246431.clean.txt With commit r246437: http://people.freebsd.org/~jwd/r246437/dmesg.r246437.log.txt Note, most of the dmesg output is related to the ses devices. It repeats itself multiple times before the panic. ses39: ses0,pass20: Element descriptor: ' ' ses39: ses0,pass20: SAS Expander: 24 Physses39: phy 0: connector 255 other 255 ses39: phy 1: connector 255 other 255 ses39: phy 2: connector 255 other 255 ses39: phy 3: connector 255 other 255 ses39: phy 4: connector 255 other 255 ses39: phy 5: connector 255 other 255 ses39: phy 6: connector 255 other 255 etc, etc... After just a few minutes, the system panics. A pair of images of the screen (sorry, no serial console at this time): Panic: http://people.freebsd.org/~jwd/r246437/20130419_160143.jpg bt: http://people.freebsd.org/~jwd/r246437/20130419_110158.jpg We are currently running a test to see if the fact that all our shelves are dual-attached, allowing us to use geom multipath is related. ie: we have disabled the 2nd HBA thus cutting the total number of da & ses devices in half and thus not executing the code in the commit that tracks duplicate ses devices. Note, if we disable both HBA devices and boot the system up it does not panic or print out the repeated messages, but of course we have no disks :-) I am unclear on the "connector 255 other 255" messages and have not taken the time to look into them yet. I would appreciate any insights folks can provide. Many Thanks, John ps: We've had to seriously increase the console buffer size to capture the complete dmesg output... options MSGBUF_SIZE=(32768*32) Can we delay starting the kernel daemon until after the system is up and /var/log/messages is available? Just a thought...