From owner-freebsd-scsi@FreeBSD.ORG Sat Jun 1 03:25:08 2013 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 821) id 32DF76E; Sat, 1 Jun 2013 03:25:08 +0000 (UTC) Date: Sat, 1 Jun 2013 03:25:08 +0000 From: John To: FreeBSD SCSI Subject: Re: Repeated msgs & kernel panic w/ r246437 (Revamp the CAM enclosure services driver) Message-ID: <20130601032508.GA72781@FreeBSD.org> References: <20130422030053.GA23186@FreeBSD.org> <517641C6.7010905@FreeBSD.org> <20130423140237.GA50775@nargothrond.kdm.org> <20130508193313.GA65921@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130508193313.GA65921@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Alexander Motin , "Kenneth D. Merry" X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Jun 2013 03:25:08 -0000 ----- John's Original Message ----- > ----- Kenneth D. Merry's Original Message ----- > > > On 22.04.2013 06:00, John wrote: > > > >Hi Folks, > > > > > > > > After updating one of our servers to the latest stable image, > > > >it appears that commit r246437 appears to be causing it to panic. > > > I agree. I added the xpt_create_path_unlocked() call to fix a > > panic with a stack trace just like the one above. It looks like a problem > > due to running r246437 exactly. > > As noted above, the system panics with the latest stable image. We > simply started backing off commits until we found where the system > no longer panic'd. > > Side note: is there a way to disable the new daemon that I haven't > seen? sysctl? loader config? > > Apologies for the delay - here is the stable information: > > 'working' tag - has 1 of the external LSI cards disabled (so only 1 path through the shelves). > 'broken' tag - has both cards enabled and panics in short order. > > Both of these are with a stable build (r249895) > > http://www.freebsd.org/~jwd/r246437/dmesg.working.txt > http://www.freebsd.org/~jwd/r246437/dmesg.broken.txt Hi Folks, I had a little time to look at this again.. I enabled WITNESS on a 9-stable system from 5/30/2013. It ran for a short while before panic'ing. ses40: ses0,pass20: Element descriptor: ' ' ses40: ses0,pass20: SAS Expander: 24 Physses40: phy 0: connector 255 other 255 ses40: phy 1: connector 255 other 255 ses40: phy 2: connector 255 other 255 ses40: phy 3: connector 255 other 255 ses40: phy 4: connector 255 other 255 ses40: phy 5: connector 255 other 255 ses40: phy 6: connector 255 other 255 ses40: phy 7: connector 255 other 255 ses40: phy 8: connector 255 other 255 ses40: phy 9: connector 255 other 255 ses40: phy 10: connector 255 other 255 ses40: phy 11: connector 255 other 255 ses40: phy 12: connector 255 other 255 ses40: phy 13: connector 255 other 255 ses40: phy 14: connector 255 other 255 ses40: phy 15: connector 255 other 255 ses40: phy 16: connector 255 other 255 ses40: phy 17: connector 255 other 255 ses40: phy 18: connector 255 other 255 ses40: phy 19: connector 255 other 255 ses40: phy 20: connector 255 other 255 ses40: phy 21: connector 255 other 255 ses40: phy 22: connector 255 other 255 ses40: phy 23: connector 255 other 255 Kernel page fault with the following non-sleepable locks held: exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0 (0xffffff8cce0161b8) locked @ /usr/src.2013-05-30_18.01.06/sys/cam/cam_periph.h:192 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffffa42c96c480 kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffa42c96c540 _witness_debugger() at _witness_debugger+0x2c/frame 0xffffffa42c96c560 witness_warn() at witness_warn+0x2d2/frame 0xffffffa42c96c6a0 trap_pfault() at trap_pfault+0x6a/frame 0xffffffa42c96c730 trap() at trap+0x42a/frame 0xffffffa42c96c930 calltrap() at calltrap+0x8/frame 0xffffffa42c96c930 --- trap 0xc, rip = 0xffffffff80cc69e8, rsp = 0xffffffa42c96c9f0, rbp = 0xffffffa42c96ca20 --- memcpy() at memcpy+0x8/frame 0xffffffa42c96ca20 xpt_getattr() at xpt_getattr+0x10f/frame 0xffffffa42c96cb30 pass_add_physpath() at pass_add_physpath+0x6d/frame 0xffffffa42c96cb60 taskqueue_run_locked() at taskqueue_run_locked+0x93/frame 0xffffffa42c96cbc0 taskqueue_thread_loop() at taskqueue_thread_loop+0x3e/frame 0xffffffa42c96cbe0 fork_exit() at fork_exit+0x135/frame 0xffffffa42c96cc30 fork_trampoline() at fork_trampoline+0xe/frame 0xffffffa42c96cc30 --- trap 0, rip = 0, rsp = 0xffffffa42c96ccf0, rbp = 0 --- Fatal trap 12: page fault while in kernel mode cpuid = 4; apic id = 12 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80cc69e8 stack pointer = 0x28:0xffffffa42c96c9f0 frame pointer = 0x28:0xffffffa42c96ca20 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (thread taskq) lock order reversal: (Giant after non-sleepable) lock order reversal: (Giant after non-sleepable) 1st 0xffffff8cce0161b8 MPT2SAS lock (MPT2SAS lock) @ /usr/src.2013-05-30_18.01.06/sys/cam/cam_periph.h:192 1st 0xffffff8cce0161b8 MPT2SAS lock (MPT2SAS lock) @ /usr/src.2013-05-30_18.01.06/sys/cam/cam_periph.h:192 2nd 0xffffffff813d1760 Giant (Giant) @ /usr/src.2013-05-30_18.01.06/sys/dev/usb/input/ukbd.c:1942 2nd 0xffffffff813d1760 Giant (Giant) @ /usr/src.2013-05-30_18.01.06/sys/dev/usb/input/ukbd.c:1942 KDB: stack backtrace: KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffffa42c96c3b0 kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffa42c96c470 _witness_debugger() at _witness_debugger+0x2c/frame 0xffffffa42c96c490 witness_checkorder() at witness_checkorder+0x875/frame 0xffffffa42c96c550 _mtx_lock_flags() at _mtx_lock_flags+0x80/frame 0xffffffa42c96c580 ukbd_poll() at ukbd_poll+0x27/frame 0xffffffa42c96c5a0 kbdmux_poll() at kbdmux_poll+0x3f/frame 0xffffffa42c96c5c0 cngrab() at cngrab+0x2f/frame 0xffffffa42c96c5e0 kdb_trap() at kdb_trap+0xc6/frame 0xffffffa42c96c640 trap_fatal() at trap_fatal+0x29d/frame 0xffffffa42c96c6a0 trap_pfault() at trap_pfault+0x7a/frame 0xffffffa42c96c730 trap() at trap+0x42a/frame 0xffffffa42c96c930 calltrap() at calltrap+0x8/frame 0xffffffa42c96c930 --- trap 0xc, rip = 0xffffffff80cc69e8, rsp = 0xffffffa42c96c9f0, rbp = 0xffffffa42c96ca20 --- memcpy() at memcpy+0x8/frame 0xffffffa42c96ca20 xpt_getattr() at xpt_getattr+0x10f/frame 0xffffffa42c96cb30 pass_add_physpath() at pass_add_physpath+0x6d/frame 0xffffffa42c96cb60 taskqueue_run_locked() at taskqueue_run_locked+0x93/frame 0xffffffa42c96cbc0 taskqueue_thread_loop() at taskqueue_thread_loop+0x3e/frame 0xffffffa42c96cbe0 fork_exit() at fork_exit+0x135/frame 0xffffffa42c96cc30 fork_trampoline() at fork_trampoline+0xe/frame 0xffffffa42c96cc30 --- trap 0, rip = 0, rsp = 0xffffffa42c96ccf0, rbp = 0 --- [ thread pid 0 tid 100034 ] Stopped at memcpy+0x8: repe movsq (%rsi),%es:(%rdi) db> bt Tracing pid 0 tid 100034 td 0xfffffe002c98c920 memcpy() at memcpy+0x8/frame 0xffffffa42c96ca20 xpt_getattr() at xpt_getattr+0x10f/frame 0xffffffa42c96cb30 pass_add_physpath() at pass_add_physpath+0x6d/frame 0xffffffa42c96cb60 taskqueue_run_locked() at taskqueue_run_locked+0x93/frame 0xffffffa42c96cbc0 taskqueue_thread_loop() at taskqueue_thread_loop+0x3e/frame 0xffffffa42c96cbe0 fork_exit() at fork_exit+0x135/frame 0xffffffa42c96cc30 fork_trampoline() at fork_trampoline+0xe/frame 0xffffffa42c96cc30 --- trap 0, rip = 0, rsp = 0xffffffa42c96ccf0, rbp = 0 --- db> Bits and pieces... 48 enclosure daemons running: Tracing command enc_daemon47 pid 64 tid 100160 td 0xfffffe009dd28490 sched_switch() at sched_switch+0x194/frame 0xffffffa70faa4a40 mi_switch() at mi_switch+0x208/frame 0xffffffa70faa4a90 sleepq_switch() at sleepq_switch+0xfc/frame 0xffffffa70faa4ac0 sleepq_wait() at sleepq_wait+0x4d/frame 0xffffffa70faa4af0 _sleep() at _sleep+0x3d4/frame 0xffffffa70faa4b80 enc_daemon() at enc_daemon+0xde/frame 0xffffffa70faa4be0 fork_exit() at fork_exit+0x135/frame 0xffffffa70faa4c30 fork_trampoline() at fork_trampoline+0xe/frame 0xffffffa70faa4c30 --- trap 0, rip = 0, rsp = 0xffffffa70faa4cf0, rbp = 0 --- And it looks like there are some reversals within the daemon: acquiring duplicate lock of same type: "MPT2SAS lock" 1st MPT2SAS lock @ /usr/src.2013-05-30_18.01.06/sys/cam/cam_periph.h:192 2nd MPT2SAS lock @ /usr/src.2013-05-30_18.01.06/sys/cam/scsi/scsi_xpt.c:2529 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffffa70fa4a010 ses27: phy 0: SAS device type 1 id 1 ses27: phy 0: protocols: Initiator( None ) Target( SSP ) ses27: phy 0: parent 50014380019fa87d addr 5000c500345510be ses26: phy 0: SAS device type 1 id 1 ses26: phy 0: protocols: Initiator( None ) Target( SSP ) ses26: phy 0: parent 50014380019f20bd addr 5000c50034642cb2 ses28: phy 0: parent 50014380019f787d addr 5000c50034635216 kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffa70fa4a0d0 _witness_debugger() at _witness_debugger+0x2c/frame 0xffffffa70fa4a0f0 witness_checkorder() at witness_checkorder+0x95e/frame 0xffffffa70fa4a1b0 _mtx_lock_flags() at _mtx_lock_flags+0x80/frame 0xffffffa70fa4a1e0 scsi_action() at scsi_action+0x1d1/frame 0xffffffa70fa4a200 ses_setphyspath_callback() at ses_setphyspath_callback+0x177/frame 0xffffffa70fa4a310 ses_path_iter_devid_callback() at ses_path_iter_devid_callback+0x1c6/frame 0xffffffa70fa4a8b0 ses_devids_iter() at ses_devids_iter+0xb1/frame 0xffffffa70fa4a930 ses_paths_iter() at ses_paths_iter+0x20/frame 0xffffffa70fa4a950 ses_publish_physpaths() at ses_publish_physpaths+0x263/frame 0xffffffa70fa4ab80 enc_daemon() at enc_daemon+0x2a4/frame 0xffffffa70fa4abe0 fork_exit() at fork_exit+0x135/frame 0xffffffa70fa4ac30 fork_trampoline() at fork_trampoline+0xe/frame 0xffffffa70fa4ac30 --- trap 0, rip = 0, rsp = 0xffffffa70fa4acf0, rbp = 0 --- Cheers, John