From owner-freebsd-stable@FreeBSD.ORG Mon Sep 16 18:24:50 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 821) id 9B21FBEB; Mon, 16 Sep 2013 18:24:50 +0000 (UTC) Date: Mon, 16 Sep 2013 18:24:50 +0000 From: John To: FreeBSD-Stable Subject: Re: Panic: 9.2-PRERELEASE - enc_daemon Message-ID: <20130916182450.GA60744@FreeBSD.org> References: <20130721134548.GA78666@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130721134548.GA78666@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Sep 2013 18:24:50 -0000 ----- John's Original Message ----- > Hi Folks, > > I'm seeing a panic with the 9.2-PRERELEASE code. The system > will stay up for anywhere from a couple of seconds to a few hours > and then panic. I have continued to look at this problem and believe I have made some progress but still have no solution... Just questions :-) I have updated my source to r255r73 which includes all the recent MFCs to cam/xpt/scsi. The problem still persists. I have attempted to simplify the problem and have been able to duplicate part of the problem with just two shelves attached serially. I have the following debugging diffs: Index: scsi/scsi_xpt.c =================================================================== --- scsi/scsi_xpt.c (revision 255573) +++ scsi/scsi_xpt.c (working copy) @@ -2445,6 +2445,15 @@ case CDAI_TYPE_PHYS_PATH: if (cdai->flags & CDAI_FLAG_STORE) { if (device->physpath != NULL) { + /* In theory the following should not be the same - see ses_setphyspath_callback() */ + printf("incoming %d %.*s\n",(int)cdai->bufsiz,(int)cdai->bufsiz,cdai->buf); + printf("current %d %.*s %*s\n",(int)device->physpath_len,(int)device->physpath_len,device->physpath,device->serial_num_len,device->serial_num); + free(device->physpath, M_CAMXPT); device->physpath = NULL; } and to ses_devids_iter(): Index: scsi/scsi_enc_ses.c =================================================================== --- scsi/scsi_enc_ses.c (revision 255573) +++ scsi/scsi_enc_ses.c (working copy) @@ -849,6 +849,7 @@ devid->length = sizeof(struct scsi_vpd_id_naa_ieee_reg); memcpy(devid->identifier, phy_addr, devid->length); + printf("callback: i=%d device=%08x\n",i,*phy_addr); callback(enc, elm, devid, callback_arg); } } The first patch shows the currently stored versus the about to be stored physpath. The 2nd patch adds some debug information to show which phys (i value) is being processed. What I find is a constant flip/flop between the device ids of the 2 shelves being stored into what I believe is the first shelves descriptor. Sep 16 17:01:22 testsys1 kernel: callback: i=0 device=00000050 Sep 16 17:01:22 testsys1 last message repeated 23 times Sep 16 17:01:22 testsys1 kernel: callback: i=0 device=00000020 Sep 16 17:01:22 testsys1 kernel: callback: i=1 device=0000002e Sep 16 17:01:22 testsys1 kernel: callback: i=2 device=000000ff Sep 16 17:01:22 testsys1 kernel: callback: i=3 device=000000ff Sep 16 17:01:22 testsys1 kernel: callback: i=4 device=000000ff Sep 16 17:01:22 testsys1 kernel: callback: i=5 device=000000ff Sep 16 17:01:22 testsys1 kernel: callback: i=6 device=00000000 Sep 16 17:01:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 Sep 16 17:01:22 testsys1 kernel: current 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403 Sep 16 17:01:22 testsys1 kernel: callback: i=7 device=00000000 5C7237P403 is the serial of the 1st shelf: # camcontrol inquiry ses0 pass29: Fixed Enclosure Services SCSI-5 device pass29: Serial Number 5C7237P403 pass29: 600.000MB/s transfers, Command Queueing Enabled # camcontrol inquiry ses1 pass55: Fixed Enclosure Services SCSI-5 device pass55: Serial Number 5C7236P2T5 pass55: 600.000MB/s transfers, Command Queueing Enabled A sample grep to show the flip/flop'ing... Sep 16 17:00:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 Sep 16 17:00:22 testsys1 kernel: current 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403 Sep 16 17:00:23 testsys1 kernel: incoming 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 Sep 16 17:00:23 testsys1 kernel: current 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 5C7237P403 Sep 16 17:01:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 Sep 16 17:01:22 testsys1 kernel: current 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403 Sep 16 17:01:23 testsys1 kernel: incoming 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 Sep 16 17:01:23 testsys1 kernel: current 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 5C7237P403 Sep 16 17:02:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 Sep 16 17:02:22 testsys1 kernel: current 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403 Sep 16 17:02:23 testsys1 kernel: incoming 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 Sep 16 17:02:23 testsys1 kernel: current 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 5C7237P403 I'm not sure why this shows up for i=6... After alot of reading & experimenting, I saw the following when running the: sg_ses -p ed ses0 command... # sg_ses -p ed ses0 HP D2700 SAS AJ941A 0147 Primary enclosure logical identifier (hex): 5001438023478280 Element Descriptor In diagnostic page: generation code: 0x0 element descriptor by type list Element type: Array device slot, subenclosure id: 0 [ti=0] Overall descriptor: Element 0 descriptor: ... Element 23 descriptor: Element 24 descriptor: Element type: SAS connector, subenclosure id: 0 [ti=1] Overall descriptor: Element 0 descriptor: ... Element 23 descriptor: Element 24 descriptor: Element type: vendor specific [0xff], subenclosure id: 0 [ti=2] Overall descriptor: Element 0 descriptor: 5C7237P403 Element type: Power supply, subenclosure id: 0 [ti=3] Overall descriptor: Element 0 descriptor: 20LiteOn 5ANLE0CLL3K6VI Element 1 descriptor: 20LiteOn 5ANLE0CLL3K6V7 Element type: Cooling, subenclosure id: 0 [ti=4] Overall descriptor: B PAVCA0B9V3E4I0 Element 0 descriptor: Element 1 descriptor: Element type: Cooling, subenclosure id: 0 [ti=5] Overall descriptor: B PAVCA0B9V3E4HH Element 0 descriptor: Element 1 descriptor: Element type: Enclosure services controller electronics, subenclosure id: 0 [ti=6] Overall descriptor: Element 0 descriptor: 014701B0PAZAV0BTM3F17A 0026 ... Could the enclosure services controller at ti=6 somehow be related to what I'm seeing? If anyone has any ideas I'd appreciate any ideas, pointers,etc. Thanks, John ps: full output from smp_discover for both shelves.. # smp_discover ses0 phy 0:D:attached:[500003944822dd2a:00 t(SSP)] 6 Gbps phy 1:D:attached:[500003944822dc96:00 t(SSP)] 6 Gbps phy 2:D:attached:[500003944822dcbe:00 t(SSP)] 6 Gbps phy 3:D:attached:[500003944822f8a2:00 t(SSP)] 6 Gbps phy 4:D:attached:[500003944822d876:00 t(SSP)] 6 Gbps phy 5:D:attached:[500003944822d776:00 t(SSP)] 6 Gbps phy 6:D:attached:[5000039448232726:00 t(SSP)] 6 Gbps phy 7:D:attached:[500003944823b8aa:00 t(SSP)] 6 Gbps phy 8:D:attached:[500003944823b846:00 t(SSP)] 6 Gbps phy 9:D:attached:[500003944822d7ae:00 t(SSP)] 6 Gbps phy 10:D:attached:[500003944823b2f2:00 t(SSP)] 6 Gbps phy 11:D:attached:[500003944823b04a:00 t(SSP)] 6 Gbps phy 12:D:attached:[5000039448239ff2:00 t(SSP)] 6 Gbps phy 13:D:attached:[500003944823db72:00 t(SSP)] 6 Gbps phy 14:D:attached:[500003944823b7a6:00 t(SSP)] 6 Gbps phy 15:D:attached:[500003944823dcc6:00 t(SSP)] 6 Gbps phy 16:D:attached:[500003944823787a:00 t(SSP)] 6 Gbps phy 17:D:attached:[500003944823b30a:00 t(SSP)] 6 Gbps phy 18:D:attached:[500003944823b0de:00 t(SSP)] 6 Gbps phy 19:D:attached:[500003944823b506:00 t(SSP)] 6 Gbps phy 20:D:attached:[500003944823e7b6:00 t(SSP)] 6 Gbps phy 21:D:attached:[5000039448232e7a:00 t(SSP)] 6 Gbps phy 22:D:attached:[500003944823cfde:00 t(SSP)] 6 Gbps phy 23:D:attached:[500003944823dc52:00 t(SSP)] 6 Gbps phy 24:D:attached:[500003944823c9e6:00 t(SSP)] 6 Gbps phy 25:S:attached:[500605b0050a7660:03 i(SSP+STP+SMP)] 6 Gbps phy 26:S:attached:[500605b0050a7660:02 i(SSP+STP+SMP)] 6 Gbps phy 27:S:attached:[500605b0050a7660:01 i(SSP+STP+SMP)] 6 Gbps phy 28:S:attached:[500605b0050a7660:00 i(SSP+STP+SMP)] 6 Gbps phy 29:T:attached:[500143802344173f:25 exp i(SMP) t(SMP)] 6 Gbps phy 30:T:attached:[500143802344173f:26 exp i(SMP) t(SMP)] 6 Gbps phy 31:T:attached:[500143802344173f:27 exp i(SMP) t(SMP)] 6 Gbps phy 32:T:attached:[500143802344173f:28 exp i(SMP) t(SMP)] 6 Gbps phy 33: inaccessible (phy vacant) phy 34: inaccessible (phy vacant) phy 35: inaccessible (phy vacant) phy 36:D:attached:[50014380234782be:36 V i(SSP) t(SSP)] 6 Gbps # smp_discover ses1 phy 0:D:attached:[500003944823cbfa:00 t(SSP)] 6 Gbps phy 1:D:attached:[500003944823ddbe:00 t(SSP)] 6 Gbps phy 2:D:attached:[500003944823d9ee:00 t(SSP)] 6 Gbps phy 3:D:attached:[500003944823c2f6:00 t(SSP)] 6 Gbps phy 4:D:attached:[500003944823c2ca:00 t(SSP)] 6 Gbps phy 5:D:attached:[500003944823d8c2:00 t(SSP)] 6 Gbps phy 6:D:attached:[500003944823cc3e:00 t(SSP)] 6 Gbps phy 7:D:attached:[500003944823d3e2:00 t(SSP)] 6 Gbps phy 8:D:attached:[500003944823d8da:00 t(SSP)] 6 Gbps phy 9:D:attached:[500003944823c39a:00 t(SSP)] 6 Gbps phy 10:D:attached:[500003944823d896:00 t(SSP)] 6 Gbps phy 11:D:attached:[500003944823dbb6:00 t(SSP)] 6 Gbps phy 12:D:attached:[500003944823c6a2:00 t(SSP)] 6 Gbps phy 13:D:attached:[500003944823dc86:00 t(SSP)] 6 Gbps phy 14:D:attached:[500003944823e7e2:00 t(SSP)] 6 Gbps phy 15:D:attached:[500003944823cae6:00 t(SSP)] 6 Gbps phy 16:D:attached:[500003944823d44a:00 t(SSP)] 6 Gbps phy 17:D:attached:[500003944823d27a:00 t(SSP)] 6 Gbps phy 18:D:attached:[5000039448234122:00 t(SSP)] 6 Gbps phy 19:D:attached:[500003944823b17e:00 t(SSP)] 6 Gbps phy 20:D:attached:[500003944823c6fe:00 t(SSP)] 6 Gbps phy 21:D:attached:[500003944823435e:00 t(SSP)] 6 Gbps phy 22:D:attached:[500003944823c776:00 t(SSP)] 6 Gbps phy 23:D:attached:[50000394482341da:00 t(SSP)] 6 Gbps phy 24:D:attached:[500003944823cc62:00 t(SSP)] 6 Gbps phy 25:S:attached:[50014380234782bf:29 exp i(SMP) t(SMP)] 6 Gbps phy 26:S:attached:[50014380234782bf:30 exp i(SMP) t(SMP)] 6 Gbps phy 27:S:attached:[50014380234782bf:31 exp i(SMP) t(SMP)] 6 Gbps phy 28:S:attached:[50014380234782bf:32 exp i(SMP) t(SMP)] 6 Gbps phy 33: inaccessible (phy vacant) phy 34: inaccessible (phy vacant) phy 35: inaccessible (phy vacant) phy 36:D:attached:[500143802344173e:36 V i(SSP) t(SSP)] 6 Gbps # camcontrol smpmaninfo ses0 Report Manufacturer Information Expander Change count: 1815 SAS 1.1 Format: Yes Component Vendor: PMCSIERA Component ID: 0x8005 Component Revision: 0x2 Vendor Specific: 0x0000000000000000 > Fatal trap 12: page fault while in kernel mode > cpuid = 31; apic id = 2f > fault virtual address = 0x0 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80d2b018 > stack pointer = 0x28:0xffffffbfd0fea080 > frame pointer = 0x28:0xffffffbfd0fea0b0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 25 (enc_daemon7) > > and: > > db:0:kdb.enter.default> show pcpu > cpuid = 31 > dynamic pcpu = 0xffffff807f203880 > curthread = 0xfffffe0032f53920: pid 25 "enc_daemon7" > curpcb = 0xffffffbfd0feabc0 > fpcurthread = none > idlethread = 0xfffffe002600b920: tid 100034 "idle: cpu31" > curpmap = 0xffffffff8141b510 > tssp = 0xffffffff81489e98 > commontssp = 0xffffffff81489e98 > rsp0 = 0xffffffbfd0feabc0 > gs32p = 0xffffffff81487fd0 > ldt = 0xffffffff81488010 > tss = 0xffffffff81488000 > > > > This looks like a bug I started tracing down a while back with > the new enclosure services (r246437 and later). I added witness > into the kernel and received the following LOR: > > > Kernel page fault with the following non-sleepable locks held: > exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0 (0xffffff8003c851b8) locked @ cam/cam_periph.h:192 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffffbfd0f3cb20 > kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffbfd0f3cbe0 > _witness_debugger() at _witness_debugger+0x2c/frame 0xffffffbfd0f3cc00 > witness_warn() at witness_warn+0x2d2/frame 0xffffffbfd0f3cd40 > trap_pfault() at trap_pfault+0x6a/frame 0xffffffbfd0f3cdd0 > trap() at trap+0x344/frame 0xffffffbfd0f3cfd0 > calltrap() at calltrap+0x8/frame 0xffffffbfd0f3cfd0 > --- trap 0xc, rip = 0xffffffff80ca8478, rsp = 0xffffffbfd0f3d090, rbp = 0xffffffbfd0f3d0c0 --- > memcpy() at memcpy+0x8/frame 0xffffffbfd0f3d0c0 > ses_setphyspath_callback() at ses_setphyspath_callback+0xb3/frame 0xffffffbfd0f3d1d0 > ses_path_iter_devid_callback() at ses_path_iter_devid_callback+0x1c6/frame 0xffffffbfd0f3d770 > ses_devids_iter() at ses_devids_iter+0xb1/frame 0xffffffbfd0f3d7f0 > ses_paths_iter() at ses_paths_iter+0x20/frame 0xffffffbfd0f3d810 > ses_publish_physpaths() at ses_publish_physpaths+0x264/frame 0xffffffbfd0f3da40 > enc_daemon() at enc_daemon+0x2a4/frame 0xffffffbfd0f3daa0 > fork_exit() at fork_exit+0x11d/frame 0xffffffbfd0f3daf0 > fork_trampoline() at fork_trampoline+0xe/frame 0xffffffbfd0f3daf0 > --- trap 0, rip = 0, rsp = 0xffffffbfd0f3dbb0, rbp = 0 --- > > > Fatal trap 12: page fault while in kernel mode > cpuid = 8; apic id = 08 > fault virtual address = 0x0 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80ca8478 > stack pointer = 0x28:0xffffffbfd0f3d090 > frame pointer = 0x28:0xffffffbfd0f3d0c0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 30 (enc_daemon12) > lock order reversal: (Giant after non-sleepable) > 1st 0xffffff8003c851b8 MPT2SAS lock (MPT2SAS lock) @ cam/cam_periph.h:192 > 2nd 0xffffffff8139bc80 Giant (Giant) @ dev/usb/input/ukbd.c:1942 > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > I'm wondering if there is a bad interaction here. > > > The system has 8 DS2700 shelves dual attached to a pair > of LSI 8e cards, thus the kernel configuration with an increased > msgbuf size. > > Kernel conf: > > include GENERIC > ident ZFS > options DDB > options KDB > options WITNESS > options MSGBUF_SIZE=(32768*16) > > And some ddb output: > > db:0:kdb.enter.default> run lockinfo > db:1:lockinfo> show locks > exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0 (0xffffff8003c851b8) locked @ cam/cam_periph.h:192 > db:1:locks> show alllocks > Process 30 (enc_daemon12) thread 0xfffffe003421a000 (100155) > exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0 (0xffffff8003c851b8) locked @ cam/cam_periph.h:192 > db:1:alllocks> show lockedvnods > Locked vnodes > db:0:kdb.enter.default> show pcpu > cpuid = 8 > dynamic pcpu = 0xffffff807f1e4800 > curthread = 0xfffffe003421a000: pid 30 "enc_daemon12" > curpcb = 0xffffffbfd0f3dbc0 > fpcurthread = none > idlethread = 0xfffffe0021ffe490: tid 100011 "idle: cpu8" > curpmap = 0xffffffff81399590 > tssp = 0xffffffff815a5640 > commontssp = 0xffffffff815a5640 > rsp0 = 0xffffffbfd0f3dbc0 > gs32p = 0xffffffff815a3778 > ldt = 0xffffffff815a37b8 > tss = 0xffffffff815a37a8 > spin locks held: > db:0:kdb.enter.default> bt > Tracing pid 30 tid 100155 td 0xfffffe003421a000 > memcpy() at memcpy+0x8/frame 0xffffffbfd0f3d0c0 > ses_setphyspath_callback() at ses_setphyspath_callback+0xb3/frame 0xffffffbfd0f3d1d0 > ses_path_iter_devid_callback() at ses_path_iter_devid_callback+0x1c6/frame 0xffffffbfd0f3d770 > ses_devids_iter() at ses_devids_iter+0xb1/frame 0xffffffbfd0f3d7f0 > ses_paths_iter() at ses_paths_iter+0x20/frame 0xffffffbfd0f3d810 > ses_publish_physpaths() at ses_publish_physpaths+0x264/frame 0xffffffbfd0f3da40 > enc_daemon() at enc_daemon+0x2a4/frame 0xffffffbfd0f3daa0 > fork_exit() at fork_exit+0x11d/frame 0xffffffbfd0f3daf0 > fork_trampoline() at fork_trampoline+0xe/frame 0xffffffbfd0f3daf0 > --- trap 0, rip = 0, rsp = 0xffffffbfd0f3dbb0, rbp = 0 --- > > Any thoughts/ideas are appreciated. I've reviewed the code and > don't see anything obvious. > > Thanks, > John > > >