Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 16 Sep 2013 18:27:47 +0000
From:      John <jwd@FreeBSD.org>
To:        FreeBSD-SCSI <freebsd-scsi@freebsd.org>
Subject:   Re: Panic: 9.2-PRERELEASE - enc_daemon
Message-ID:  <20130916182747.GB60744@FreeBSD.org>
In-Reply-To: <20130916182450.GA60744@FreeBSD.org>
References:  <20130721134548.GA78666@FreeBSD.org> <20130916182450.GA60744@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

Meant to post this to -scsi not -stable as I just did...

----- John's Original Message -----
> Hi Folks,
> 
>    I'm seeing a panic with the 9.2-PRERELEASE code. The system
> will stay up for anywhere from a couple of seconds to a few hours
> and then panic.

   I have continued to look at this problem and believe I have made
some progress but still have no solution... Just questions :-)

   I have updated my source to r255r73 which includes all the recent
MFCs to cam/xpt/scsi. The problem still persists.

   I have attempted to simplify the problem and have been able to
duplicate part of the problem with just two shelves attached serially.

   I have the following debugging diffs:

Index: scsi/scsi_xpt.c
===================================================================
--- scsi/scsi_xpt.c	(revision 255573)
+++ scsi/scsi_xpt.c	(working copy)
@@ -2445,6 +2445,15 @@
 	case CDAI_TYPE_PHYS_PATH:
 		if (cdai->flags & CDAI_FLAG_STORE) {
 			if (device->physpath != NULL) {
+				/* In theory the following should not be the same - see ses_setphyspath_callback() */
+				printf("incoming %d %.*s\n",(int)cdai->bufsiz,(int)cdai->bufsiz,cdai->buf);
+				printf("current  %d %.*s %*s\n",(int)device->physpath_len,(int)device->physpath_len,device->physpath,device->serial_num_len,device->serial_num);
+
 				free(device->physpath, M_CAMXPT);
 				device->physpath = NULL;
 			}

and to ses_devids_iter():

Index: scsi/scsi_enc_ses.c
===================================================================
--- scsi/scsi_enc_ses.c	(revision 255573)
+++ scsi/scsi_enc_ses.c	(working copy)
@@ -849,6 +849,7 @@
 		devid->length	     = sizeof(struct scsi_vpd_id_naa_ieee_reg);
 		memcpy(devid->identifier, phy_addr, devid->length);
 
+		printf("callback: i=%d device=%08x\n",i,*phy_addr);
 		callback(enc, elm, devid, callback_arg);
 	}
 }


   The first patch shows the currently stored versus the about
to be stored physpath. The 2nd patch adds some debug information
to show which phys (i value) is being processed.

   What I find is a constant flip/flop between the device ids of
the 2 shelves being stored into what I believe is the first shelves
descriptor.

Sep 16 17:01:22 testsys1 kernel: callback: i=0 device=00000050
Sep 16 17:01:22 testsys1 last message repeated 23 times
Sep 16 17:01:22 testsys1 kernel: callback: i=0 device=00000020
Sep 16 17:01:22 testsys1 kernel: callback: i=1 device=0000002e
Sep 16 17:01:22 testsys1 kernel: callback: i=2 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=3 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=4 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=5 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=6 device=00000000
Sep 16 17:01:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002
Sep 16 17:01:22 testsys1 kernel: current  60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403      
Sep 16 17:01:22 testsys1 kernel: callback: i=7 device=00000000

   5C7237P403 is the serial of the 1st shelf:

# camcontrol inquiry ses0
pass29: <HP D2700 SAS AJ941A 0147> Fixed Enclosure Services SCSI-5 device 
pass29: Serial Number 5C7237P403      
pass29: 600.000MB/s transfers, Command Queueing Enabled
# camcontrol inquiry ses1
pass55: <HP D2700 SAS AJ941A 0147> Fixed Enclosure Services SCSI-5 device 
pass55: Serial Number 5C7236P2T5      
pass55: 600.000MB/s transfers, Command Queueing Enabled

   A sample grep to show the flip/flop'ing...

Sep 16 17:00:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002
Sep 16 17:00:22 testsys1 kernel: current  60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403      
Sep 16 17:00:23 testsys1 kernel: incoming 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002
Sep 16 17:00:23 testsys1 kernel: current  60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 5C7237P403      
Sep 16 17:01:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002
Sep 16 17:01:22 testsys1 kernel: current  60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403      
Sep 16 17:01:23 testsys1 kernel: incoming 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002
Sep 16 17:01:23 testsys1 kernel: current  60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 5C7237P403      
Sep 16 17:02:22 testsys1 kernel: incoming 60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002
Sep 16 17:02:22 testsys1 kernel: current  60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002 5C7237P403      
Sep 16 17:02:23 testsys1 kernel: incoming 60 id1,enc@n5001438023441700/type@9/slot@1/elmdesc@900080050002
Sep 16 17:02:23 testsys1 kernel: current  60 id1,enc@n5001438023478280/type@9/slot@1/elmdesc@900080050002 5C7237P403      

   I'm not sure why this shows up for i=6...


   After alot of reading & experimenting, I saw the following
when running the: sg_ses -p ed ses0 command...


# sg_ses -p ed ses0
  HP        D2700 SAS AJ941A  0147
  Primary enclosure logical identifier (hex): 5001438023478280
Element Descriptor In diagnostic page:
  generation code: 0x0
  element descriptor by type list
    Element type: Array device slot, subenclosure id: 0 [ti=0]
      Overall descriptor: <empty>
      Element 0 descriptor: <empty>
...
      Element 23 descriptor: <empty>
      Element 24 descriptor: <empty>
   Element type: SAS connector, subenclosure id: 0 [ti=1]
      Overall descriptor: <empty>
      Element 0 descriptor: <empty>
...
      Element 23 descriptor: <empty>
      Element 24 descriptor: <empty>
    Element type: vendor specific [0xff], subenclosure id: 0 [ti=2]
      Overall descriptor: <empty>
      Element 0 descriptor:     5C7237P403      
    Element type: Power supply, subenclosure id: 0 [ti=3]
      Overall descriptor: <empty>
      Element 0 descriptor:   20LiteOn           5ANLE0CLL3K6VI 
      Element 1 descriptor:   20LiteOn           5ANLE0CLL3K6V7 
    Element type: Cooling, subenclosure id: 0 [ti=4]
      Overall descriptor:       B PAVCA0B9V3E4I0  
      Element 0 descriptor: <empty>
      Element 1 descriptor: <empty>
    Element type: Cooling, subenclosure id: 0 [ti=5]
      Overall descriptor:       B PAVCA0B9V3E4HH  
      Element 0 descriptor: <empty>
      Element 1 descriptor: <empty>
    Element type: Enclosure services controller electronics, subenclosure id: 0 [ti=6]
      Overall descriptor: <empty>
      Element 0 descriptor: 014701B0PAZAV0BTM3F17A  0026
...

   Could the enclosure services controller at ti=6 somehow be
related to what I'm seeing?


   If anyone has any ideas I'd appreciate any ideas, pointers,etc.

Thanks,
John

ps: full output from smp_discover for both shelves..

# smp_discover ses0
  phy   0:D:attached:[500003944822dd2a:00  t(SSP)]  6 Gbps
  phy   1:D:attached:[500003944822dc96:00  t(SSP)]  6 Gbps
  phy   2:D:attached:[500003944822dcbe:00  t(SSP)]  6 Gbps
  phy   3:D:attached:[500003944822f8a2:00  t(SSP)]  6 Gbps
  phy   4:D:attached:[500003944822d876:00  t(SSP)]  6 Gbps
  phy   5:D:attached:[500003944822d776:00  t(SSP)]  6 Gbps
  phy   6:D:attached:[5000039448232726:00  t(SSP)]  6 Gbps
  phy   7:D:attached:[500003944823b8aa:00  t(SSP)]  6 Gbps
  phy   8:D:attached:[500003944823b846:00  t(SSP)]  6 Gbps
  phy   9:D:attached:[500003944822d7ae:00  t(SSP)]  6 Gbps
  phy  10:D:attached:[500003944823b2f2:00  t(SSP)]  6 Gbps
  phy  11:D:attached:[500003944823b04a:00  t(SSP)]  6 Gbps
  phy  12:D:attached:[5000039448239ff2:00  t(SSP)]  6 Gbps
  phy  13:D:attached:[500003944823db72:00  t(SSP)]  6 Gbps
  phy  14:D:attached:[500003944823b7a6:00  t(SSP)]  6 Gbps
  phy  15:D:attached:[500003944823dcc6:00  t(SSP)]  6 Gbps
  phy  16:D:attached:[500003944823787a:00  t(SSP)]  6 Gbps
  phy  17:D:attached:[500003944823b30a:00  t(SSP)]  6 Gbps
  phy  18:D:attached:[500003944823b0de:00  t(SSP)]  6 Gbps
  phy  19:D:attached:[500003944823b506:00  t(SSP)]  6 Gbps
  phy  20:D:attached:[500003944823e7b6:00  t(SSP)]  6 Gbps
  phy  21:D:attached:[5000039448232e7a:00  t(SSP)]  6 Gbps
  phy  22:D:attached:[500003944823cfde:00  t(SSP)]  6 Gbps
  phy  23:D:attached:[500003944823dc52:00  t(SSP)]  6 Gbps
  phy  24:D:attached:[500003944823c9e6:00  t(SSP)]  6 Gbps
  phy  25:S:attached:[500605b0050a7660:03  i(SSP+STP+SMP)]  6 Gbps
  phy  26:S:attached:[500605b0050a7660:02  i(SSP+STP+SMP)]  6 Gbps
  phy  27:S:attached:[500605b0050a7660:01  i(SSP+STP+SMP)]  6 Gbps
  phy  28:S:attached:[500605b0050a7660:00  i(SSP+STP+SMP)]  6 Gbps
  phy  29:T:attached:[500143802344173f:25 exp i(SMP) t(SMP)]  6 Gbps
  phy  30:T:attached:[500143802344173f:26 exp i(SMP) t(SMP)]  6 Gbps
  phy  31:T:attached:[500143802344173f:27 exp i(SMP) t(SMP)]  6 Gbps
  phy  32:T:attached:[500143802344173f:28 exp i(SMP) t(SMP)]  6 Gbps
  phy  33: inaccessible (phy vacant)
  phy  34: inaccessible (phy vacant)
  phy  35: inaccessible (phy vacant)
  phy  36:D:attached:[50014380234782be:36  V i(SSP) t(SSP)]  6 Gbps

# smp_discover ses1
  phy   0:D:attached:[500003944823cbfa:00  t(SSP)]  6 Gbps
  phy   1:D:attached:[500003944823ddbe:00  t(SSP)]  6 Gbps
  phy   2:D:attached:[500003944823d9ee:00  t(SSP)]  6 Gbps
  phy   3:D:attached:[500003944823c2f6:00  t(SSP)]  6 Gbps
  phy   4:D:attached:[500003944823c2ca:00  t(SSP)]  6 Gbps
  phy   5:D:attached:[500003944823d8c2:00  t(SSP)]  6 Gbps
  phy   6:D:attached:[500003944823cc3e:00  t(SSP)]  6 Gbps
  phy   7:D:attached:[500003944823d3e2:00  t(SSP)]  6 Gbps
  phy   8:D:attached:[500003944823d8da:00  t(SSP)]  6 Gbps
  phy   9:D:attached:[500003944823c39a:00  t(SSP)]  6 Gbps
  phy  10:D:attached:[500003944823d896:00  t(SSP)]  6 Gbps
  phy  11:D:attached:[500003944823dbb6:00  t(SSP)]  6 Gbps
  phy  12:D:attached:[500003944823c6a2:00  t(SSP)]  6 Gbps
  phy  13:D:attached:[500003944823dc86:00  t(SSP)]  6 Gbps
  phy  14:D:attached:[500003944823e7e2:00  t(SSP)]  6 Gbps
  phy  15:D:attached:[500003944823cae6:00  t(SSP)]  6 Gbps
  phy  16:D:attached:[500003944823d44a:00  t(SSP)]  6 Gbps
  phy  17:D:attached:[500003944823d27a:00  t(SSP)]  6 Gbps
  phy  18:D:attached:[5000039448234122:00  t(SSP)]  6 Gbps
  phy  19:D:attached:[500003944823b17e:00  t(SSP)]  6 Gbps
  phy  20:D:attached:[500003944823c6fe:00  t(SSP)]  6 Gbps
  phy  21:D:attached:[500003944823435e:00  t(SSP)]  6 Gbps
  phy  22:D:attached:[500003944823c776:00  t(SSP)]  6 Gbps
  phy  23:D:attached:[50000394482341da:00  t(SSP)]  6 Gbps
  phy  24:D:attached:[500003944823cc62:00  t(SSP)]  6 Gbps
  phy  25:S:attached:[50014380234782bf:29 exp i(SMP) t(SMP)]  6 Gbps
  phy  26:S:attached:[50014380234782bf:30 exp i(SMP) t(SMP)]  6 Gbps
  phy  27:S:attached:[50014380234782bf:31 exp i(SMP) t(SMP)]  6 Gbps
  phy  28:S:attached:[50014380234782bf:32 exp i(SMP) t(SMP)]  6 Gbps
  phy  33: inaccessible (phy vacant)
  phy  34: inaccessible (phy vacant)
  phy  35: inaccessible (phy vacant)
  phy  36:D:attached:[500143802344173e:36  V i(SSP) t(SSP)]  6 Gbps

# camcontrol smpmaninfo ses0
Report Manufacturer Information
Expander Change count: 1815
SAS 1.1 Format: Yes
<HP D2700 SAS AJ941A 0147>
Component Vendor: PMCSIERA
Component ID: 0x8005
Component Revision: 0x2
Vendor Specific: 0x0000000000000000




> Fatal trap 12: page fault while in kernel mode
> cpuid = 31; apic id = 2f
> fault virtual address   = 0x0
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff80d2b018
> stack pointer           = 0x28:0xffffffbfd0fea080
> frame pointer           = 0x28:0xffffffbfd0fea0b0
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 25 (enc_daemon7)
> 
> and:
> 
> db:0:kdb.enter.default>  show pcpu
> cpuid        = 31
> dynamic pcpu = 0xffffff807f203880
> curthread    = 0xfffffe0032f53920: pid 25 "enc_daemon7"
> curpcb       = 0xffffffbfd0feabc0
> fpcurthread  = none
> idlethread   = 0xfffffe002600b920: tid 100034 "idle: cpu31"
> curpmap      = 0xffffffff8141b510
> tssp         = 0xffffffff81489e98
> commontssp   = 0xffffffff81489e98
> rsp0         = 0xffffffbfd0feabc0
> gs32p        = 0xffffffff81487fd0
> ldt          = 0xffffffff81488010
> tss          = 0xffffffff81488000
> 
> 
> 
>    This looks like a bug I started tracing down a while back with
> the new enclosure services (r246437 and later). I added witness
> into the kernel and received the following LOR:
> 
> 
> Kernel page fault with the following non-sleepable locks held:
> exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0 (0xffffff8003c851b8) locked @ cam/cam_periph.h:192
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffffbfd0f3cb20
> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffbfd0f3cbe0
> _witness_debugger() at _witness_debugger+0x2c/frame 0xffffffbfd0f3cc00
> witness_warn() at witness_warn+0x2d2/frame 0xffffffbfd0f3cd40
> trap_pfault() at trap_pfault+0x6a/frame 0xffffffbfd0f3cdd0
> trap() at trap+0x344/frame 0xffffffbfd0f3cfd0
> calltrap() at calltrap+0x8/frame 0xffffffbfd0f3cfd0
> --- trap 0xc, rip = 0xffffffff80ca8478, rsp = 0xffffffbfd0f3d090, rbp = 0xffffffbfd0f3d0c0 ---
> memcpy() at memcpy+0x8/frame 0xffffffbfd0f3d0c0
> ses_setphyspath_callback() at ses_setphyspath_callback+0xb3/frame 0xffffffbfd0f3d1d0
> ses_path_iter_devid_callback() at ses_path_iter_devid_callback+0x1c6/frame 0xffffffbfd0f3d770
> ses_devids_iter() at ses_devids_iter+0xb1/frame 0xffffffbfd0f3d7f0
> ses_paths_iter() at ses_paths_iter+0x20/frame 0xffffffbfd0f3d810
> ses_publish_physpaths() at ses_publish_physpaths+0x264/frame 0xffffffbfd0f3da40
> enc_daemon() at enc_daemon+0x2a4/frame 0xffffffbfd0f3daa0
> fork_exit() at fork_exit+0x11d/frame 0xffffffbfd0f3daf0
> fork_trampoline() at fork_trampoline+0xe/frame 0xffffffbfd0f3daf0
> --- trap 0, rip = 0, rsp = 0xffffffbfd0f3dbb0, rbp = 0 ---
> 
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 8; apic id = 08
> fault virtual address   = 0x0
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff80ca8478
> stack pointer           = 0x28:0xffffffbfd0f3d090
> frame pointer           = 0x28:0xffffffbfd0f3d0c0
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 30 (enc_daemon12)
> lock order reversal: (Giant after non-sleepable)
>  1st 0xffffff8003c851b8 MPT2SAS lock (MPT2SAS lock) @ cam/cam_periph.h:192
>  2nd 0xffffffff8139bc80 Giant (Giant) @ dev/usb/input/ukbd.c:1942
> 
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> I'm wondering if there is a bad interaction here.
> 
> 
> The system has 8 DS2700 shelves dual attached to a pair
> of LSI 8e cards, thus the kernel configuration with an increased
> msgbuf size.
> 
> Kernel conf:
> 
> include   GENERIC
> ident     ZFS
> options   DDB
> options   KDB
> options   WITNESS
> options   MSGBUF_SIZE=(32768*16)
> 
> And some ddb output:
> 
> db:0:kdb.enter.default>  run lockinfo
> db:1:lockinfo> show locks
> exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0 (0xffffff8003c851b8) locked @ cam/cam_periph.h:192
> db:1:locks>  show alllocks
> Process 30 (enc_daemon12) thread 0xfffffe003421a000 (100155)
> exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0 (0xffffff8003c851b8) locked @ cam/cam_periph.h:192
> db:1:alllocks>  show lockedvnods
> Locked vnodes
> db:0:kdb.enter.default>  show pcpu
> cpuid        = 8
> dynamic pcpu = 0xffffff807f1e4800
> curthread    = 0xfffffe003421a000: pid 30 "enc_daemon12"
> curpcb       = 0xffffffbfd0f3dbc0
> fpcurthread  = none
> idlethread   = 0xfffffe0021ffe490: tid 100011 "idle: cpu8"
> curpmap      = 0xffffffff81399590
> tssp         = 0xffffffff815a5640
> commontssp   = 0xffffffff815a5640
> rsp0         = 0xffffffbfd0f3dbc0
> gs32p        = 0xffffffff815a3778
> ldt          = 0xffffffff815a37b8
> tss          = 0xffffffff815a37a8
> spin locks held:
> db:0:kdb.enter.default>  bt
> Tracing pid 30 tid 100155 td 0xfffffe003421a000
> memcpy() at memcpy+0x8/frame 0xffffffbfd0f3d0c0
> ses_setphyspath_callback() at ses_setphyspath_callback+0xb3/frame 0xffffffbfd0f3d1d0
> ses_path_iter_devid_callback() at ses_path_iter_devid_callback+0x1c6/frame 0xffffffbfd0f3d770
> ses_devids_iter() at ses_devids_iter+0xb1/frame 0xffffffbfd0f3d7f0
> ses_paths_iter() at ses_paths_iter+0x20/frame 0xffffffbfd0f3d810
> ses_publish_physpaths() at ses_publish_physpaths+0x264/frame 0xffffffbfd0f3da40
> enc_daemon() at enc_daemon+0x2a4/frame 0xffffffbfd0f3daa0
> fork_exit() at fork_exit+0x11d/frame 0xffffffbfd0f3daf0
> fork_trampoline() at fork_trampoline+0xe/frame 0xffffffbfd0f3daf0
> --- trap 0, rip = 0, rsp = 0xffffffbfd0f3dbb0, rbp = 0 ---
> 
>    Any thoughts/ideas are appreciated. I've reviewed the code and
> don't see anything obvious.
> 
> Thanks,
> John
> 
> 
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130916182747.GB60744>