Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Sep 2014 08:36:51 -0700
From:      Nathan Whitehorn <nwhitehorn@freebsd.org>
To:        Mark Millard <markmi@dsl-only.net>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Cc:        Justin Hibbits <chmeeedalf@gmail.com>
Subject:   Re: lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal instruction 0" before-copyright hang on PowerMac G5's
Message-ID:  <5422E513.6010806@freebsd.org>
In-Reply-To: <37575F94-763C-43BF-8DD9-F648F4A7C09F@dsl-only.net>
References:  <1118046C-0FF7-49FC-82DA-DB9A7A310991@dsl-only.net> <2ED3DB50-B985-4382-8FF2-3B44E7E65453@dsl-only.net> <CAHSQbTBXxrgXQdNeCs=C5wJaT_bmh9FU836O6VnJDbJuqCUujw@mail.gmail.com> <6D729F43-662A-429E-9503-0148EC3250B1@dsl-only.net> <72535F89-3942-45A6-B351-7F746209ED9F@dsl-only.net> <0703EF26-6E33-4446-9273-BBFD0CB72893@dsl-only.net> <37575F94-763C-43BF-8DD9-F648F4A7C09F@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
There shouldn't be any exceptions at that point, nested or otherwise. 
What I suspect is happening is that Open Firmware has turned them on for 
some bizarre reason, taken one, and ended up in the kernel's handlers 
but with the Open Firmware environment. Saving and restoring the OF 
interrupt vectors would be a possible solution; flattening the device 
tree in loader so that the kernel doesn't call Open Firmware at all 
would be another. I think Justin may have tried the first at some point.
-Nathan

On 09/24/14 02:04, Mark Millard wrote:
> Now that I've had a kernel/boot crash with a successful DDB bt and 
> show registers (a different submittal) it makes for a good 
> comparison/contrast with what DDB reports for this "before copyright" 
> crash.
>
> Something unique to the "before copyright" context is...
>
> No registers are reported to have values that point into the range 
> between tmpstk and esym.
>
> In other words: There is no valid stack pointer reported as far as I 
> can tell. r1 has the value 0 instead of being a handling a valid stack 
> address. tmpstk=0xbd7000 and esym=0xbdb000 (example for one of my 
> WITH_DEBUG_FILES= and options DDB and GDB builds of 10.1-BETA2). That 
> at least gives a ball park on the range to expect for pointing into 
> the stack even with some build variation.
>
> It leaves me wondering if the DDB report is for a nested exception 
> handling. That could explain why lr points to u_trap+0x10 and srr0 
> points to k_trap+0x28 when normally srr0 would point to the the 
> failing instruction (or the instruction after) and lr to where that 
> routine would normally return to.
>
> The register values that are reported for my 10.1-BETA2 builds that 
> crash before the copyright notice are:
>
> r0: 0
> r1: 0
> r2: 0xc81538 vop_unlock_desc
> r3: 0xd18868
> r4: 0x894b58
> r5: 0
> r6: 0xc1dee0 M_AUDITBSM
> r7: 0xe3f818 ofw_real_mode
> r8: 0x1
> r9: 0xe0f580 __pcpu
> r10: 0x1c35ec0
> r11: 0
> r12: 0x10000000
> r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as 
> 0xdbb290.)
> r14-r19: all 0
> r20: 0x10c1000
> r21: 0x4
> r22: 0x180abd4
> r23: 0x1803a28
> r24: 0xc000000000008760
> r25: 0xcc89b8 smp_no...
> r26: 0xcea108 ofw_rend...
> r27: 0x894b58 ofwcall+0xa8
> r28: 0x894b58 ofwcall+0xa8
> r29: 2400022
> r30: 9000000000001032
> r31: 0xbb7d38
>
> srr0: 0x102720 k_trap+0x28
> srr1: 9000000000001032
> lr: 0x1026f0 u_trap+0x10
> ctr: 0xff846d78
> cr: 2000deb0
> xer: 0
> dar: f...d50 (lots of f's)
> dsisr: 42000000
>
>
>
>
>
>
> ===
> Mark Millard
> markmi at dsl-only.net <http://dsl-only.net>;
>
> On Sep 20, 2014, at 3:42 PM, Mark Millard <markmi at dsl-only.net 
> <http://dsl-only.net>>; wrote:
>
> [I corrected the SSR0 in the subject to be SRR0.]
>
> I did miss a register in my list (it matched the shown r30 value). And 
> it turns out to probably be very important to interpreting what the 
> "show registers" is reporting:
>
> SRR1: 0x9000000000001032
>
> But bits 43-46 of SRR1 are supposed to indicate which type of Program 
> Exception, using a single binary 1 to so. No such 1's are present.
>
> Illegal instruction would have been bit 44 being 1. (PowerPC has the 
> upper bit numbered zero and increases from there.)
>
> So the ddb "show registers" is apparently not reporting the status as 
> of when the "stopped at 0 illegal instruction 0" happened. Thus other 
> things are also likely not from that exact time frame.
>
>
>
> And I misinterpreted the LR value status: The LR value was just left 
> over from the restore_kernsrs returning when it finished. Execution 
> then flowed into k_trap. Nothing unusual involved.
>
>
>
>
>
> ===
> Mark Millard
> markmi@dsl-only.net <mailto:markmi@dsl-only.net>
>
> On Sep 18, 2014, at 8:57 PM, Mark Millard <markmi@dsl-only.net 
> <mailto:markmi@dsl-only.net>> wrote:
>
> I modified DDB to automatically "show registers" even at the early 
> "before Copyright" crash time. The end of this note will show the 
> /usr/src/sys/ddb/db_script.c diff for the hack. While I also had DDB 
> bt, the bt does not actually print a back trace for this context. (It 
> might for others.)
>
> The registers give interesting context despite the lack of a back 
> trace. I do not know if it will be sufficient to be of much immediate 
> help if someone used the information to start looking at the problem.
>
> I'll start with register lr: 0x1026f0 u_trap+0x10.
>
> /usr/src/sys/powerpc/aim/trap_subr64.S has:
>
> s_trap:
>         bf      17,k_trap               /* branch if PSL_PR is false */
>         GET_CPUINFO(%r1)
> u_trap:
>         ld      %r1,PC_CURPCB(%r1)
>         mr      %r27,%r28               /* Save LR, r29 */
>         mtsprg2 %r29
>         bl      restore_kernsrs         /* enable kernel mapping */
>         mfsprg2 %r29
>         mr      %r28,%r27
>
> /*
>  * Now the common trap catching code.
>  */
> k_trap:
>         FRAME_SETUP(PC_TEMPSAVE)
> /* Call C interrupt dispatcher: */
> trapagain:
>
> and so this appears to indicate a pending return to execute the 
> "mfsprg2 %r29" after "bl restore_kernsrs", which indicates that 
> restore_kernsrs should be active.
>
> But register srr0 indicates: 0x102720 k_trap+0x28. (So apparently in 
> FRAME_SETUP(PC_TEMPSAVE) someplace.)
>
> So it appears to me that the processor got to the k_trap code during 
> the supposed restore_kernsrs time frame. (But I'm no expert at these 
> sorts of things or for the processor.)
>
> I'll list the other register values:
>
> r0:  0
> r1:  0
> r2:  0xc1be80 M_AUDITBSM
> r3:  0xb16138
> r4:  0x8926e8 .ofwcall+0xa8
> r5:  0
> r6:  0xbb5f90
> r7:  0xe3d118 ofw_real_mode
> r8:  0x1
> r9:  0xe0ce80 __pcpu
> r10: 0x1c35ec9
> r11: 0
> r12: 0x10000000
> r13: db890    thread0
> r14-r19: all 0
> r20: 0x10bc000
> r21: 0x4
> r22: 0x1801db4
> r23: 0x1803a28
> r24: 0xc000000000008760
> r25: 0xcc6908 smp_no_rendevous_barrier
> r26: 0xec79e0 ofw_rendezvous_dispatch (yep one has v and the other zv)
> r27: 0x8926e8 .ofwcall+0xa8
> r28: 0x8926e8 .ofwcall+0xa8 (yep: same value)
> r29: 0x24000022
> r30: 0x9000000000001032
> r31: 0xc7f488 vop_unlock_desc
>
> ctr: 0xff846d78
> cr:  0x2000d7b0
> xer: 0
> dar: 0xfffffffffffffd50
> dsisr: 0x42000000
>
> (Hopefully this manual transcription from the screen display is 
> complete --and also accurate for what it does present.)
>
>
>
>
> The personal HACK to /usr/src/sys/ddb/db_script.c's 
> db_script_kdbenter(...) to have it show registers and try bt...
>
> $ cd /usr/src/sys/ddb/
> $ svnlite diff .
> Index: db_script.c
> ===================================================================
> --- db_script.c(revision 271610)
> +++ db_script.c(working copy)
> @@ -319,10 +319,25 @@
>  {
> char scriptname[DB_MAXSCRIPTNAME];
>
> +/* HACK!!! : Additional lines to force a basic default script to exist.
> +* Will dump information even if ddb input is not available for early 
> crash.
> +* Used to get more information about PowerMac G5 "before Copyright" 
> hangs.
> +*/
> +struct ddb_script *dsp = db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT);
> +if (!dsp) db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; 
> bt");
> +
> snprintf(scriptname, sizeof(scriptname), "%s.%s",
> DB_SCRIPT_KDBENTER_PREFIX, eventname);
> if (db_script_exec(scriptname, 0) == ENOENT)
> (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> +
> +/* HACK!!! : Additional lines to always use the default script,
> +*           even if scriptname existed and was executed.
> +* Will dump information even if ddb input is not available for early 
> crash.
> +* Used to get more information about PowerMac G5 "before Copyright" 
> hangs.
> +*/
> +else
> +(void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
>  }
>
>  /*-
>
>
>
> ===
> Mark Millard
> markmi at dsl-only.net <http://dsl-only.net/>;
>
> On Sep 16, 2014, at 9:28 PM, Mark Millard <markmi at dsl-only.net 
> <http://dsl-only.net/>>; wrote:
>
> In part I sent directly to you because of a past exchange (July-27) 
> where you had written:
>
>> Nathan and I both speculate that it's
>> dropping into Open Firmware (we make extensive use of OFW), and then
>> messing something up, taking a page fault or something.
>
> The specific text that I report and its uniformity when it is produced 
> seems to add a little information beyond a speculated "page fault or 
> something" and so might eventually help a little. As I understand the 
> text it is reporting execution reaching address zero without any prior 
> un-handled exceptions or other such that would stop it. A corrupted 
> stack (pointer) so a bad return address or some such? I'd guess there 
> are no explicit jumps to address zero so I expect that indirection is 
> likely involved, with the content for the indirection messed up.
>
> I really wish that I had a logic analyzer configuration for this. I've 
> not found a way to make the failing context visible so far and the 
> extra way of looking at things might have helped.
>
>
>
>
> ===
> Mark Millard
> markmi@dsl-only.net <mailto:markmi@dsl-only.net>
>
> On Sep 16, 2014, at 8:28 PM, Justin Hibbits <chmeeedalf@gmail.com 
> <mailto:chmeeedalf@gmail.com>> wrote:
>
> Hi mark,
>
> I see this on my G5, and I think it's due to the amount of RAM in the 
> machine. More than 4gb seems to confuse open firmware when called by 
> FreeBSD. There is some effort to remove the need of the callbacks but 
> thus far it's not far along. The good news is that after it boots it's 
> solid except when switching vtys, buy earlier this year or last year I 
> added a sysctl hack to disable the call into open firmware on vty 
> switch (don't recall offhand and not at my computer right now, but if 
> you grep the sysctl output for reset and ofw you can find it).
>
> -Justin
>
> On Sep 16, 2014 8:01 PM, "Mark Millard" <markmi@dsl-only.net 
> <mailto:markmi@dsl-only.net>> wrote:
>
>     I've now spent time with rebooting and power-off/power-on for all
>     3 PowerMac G5's (one PowerMac7,2 and two PowerMac11,2's) and all 3
>     get the
>
>>     GDB: no debug ports present
>>     KDB: debugger backends: DDB
>>     KDB: current backend: DDB
>>     [ thread pid -1 tid 1006665719 ]
>>     Stopped at 0: illegal instruction 0
>>     db>
>
>     when they fail just before the Copyright notice would normally be
>     displayed. None fail any earlier. At that spot none have failed
>     any other way. It is the same SSD in all 3. (Happens with other
>     SSD's as well.) Overall there is a mix of Radeon and NVIDIA
>     display boards. Besides the SSD use and RAM upgrades the rest is
>     stock equipment. scons used, not vt. (I've yet to try vt.)
>
>     Seeing a failure after the Copyright notice as been fairly rare in
>     all my experiments from when I started last April or so. The ones
>     that I've noted had Data Storage Interrupt reported. So far no
>     examples of the above have been reported after the Copyright
>     notice. So I'd guess that they are separate issues. Of course it
>     seems that only in the last few days would I have seen the above
>     sort of thing if it did happen after the Copyright notice: The
>     prior history does not count for judgements about that.
>
>     ===
>     Mark Millard
>     markmi at dsl-only.net <http://dsl-only.net/>;
>
>     On Sep 16, 2014, at 8:15 AM, Mark Millard <markmi@dsl-only.net
>     <mailto:markmi@dsl-only.net>> wrote:
>
>     Using 10.1-BETA1 I added "options DDB" and "options GDB" to
>     powerpc64's GENERIC64. (I also used WITH_DEBUG_FILES=,
>     WITHOUT_CLANG=, and WITH_DEBUG= in /etc/make.conf.) So buildworld,
>     kernel was basically just set up to have more of a debugging
>     context around (including for any ports builds).
>
>     The result was new information about the PowerMac G5 boot hangups:
>     The screen is no longer blank when the G5 is hung up without there
>     being a Copyright notice yet. It says...
>
>>     GDB: no debug ports present
>>     KDB: debugger backends: DDB
>>     KDB: current backend: DDB
>>     [ thread pid -1 tid 1006665719 ]
>>     Stopped at 0: illegal instruction 0
>>     db>
>
>     (I had no ability to input at that point.) Normally the Copyright
>     notice would have displayed instead of "[...]" and what follows.
>     (I do not claim to have all the spacing, capitalization, and such
>     correct above.)
>
>     That text is constant from hang to hang when it hangs just before
>     it would normally output the Copyright notice: The numbers do not
>     vary, much less the other text. It has never failed until after
>     the two KDB messages are present. So far I've only tested one
>     PowerMac G5, booting over and over for a few hours.
>
>
>
>     (I do not claim to be set up for remote kernel debugging. I just
>     decided to let GDB go along for the ride when I added DDB.)
>
>     ===
>     Mark Millard
>     markmi at dsl-only.net <http://dsl-only.net/>;
>
>
>
>
>
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5422E513.6010806>