Date: Fri, 26 Sep 2014 22:18:34 -0700 From: Mark Millard <markmi@dsl-only.net> To: Nathan Whitehorn <nwhitehorn@freebsd.org> Cc: Justin Hibbits <chmeeedalf@gmail.com>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal instruction 0" before-copyright hang on PowerMac G5's Message-ID: <7008CDAA-2DA2-419F-9BEC-AD823ECBFCCC@dsl-only.net> In-Reply-To: <34AA4542-56A7-453E-A00E-868EE352C96C@dsl-only.net> References: <1118046C-0FF7-49FC-82DA-DB9A7A310991@dsl-only.net> <2ED3DB50-B985-4382-8FF2-3B44E7E65453@dsl-only.net> <CAHSQbTBXxrgXQdNeCs=C5wJaT_bmh9FU836O6VnJDbJuqCUujw@mail.gmail.com> <6D729F43-662A-429E-9503-0148EC3250B1@dsl-only.net> <72535F89-3942-45A6-B351-7F746209ED9F@dsl-only.net> <0703EF26-6E33-4446-9273-BBFD0CB72893@dsl-only.net> <37575F94-763C-43BF-8DD9-F648F4A7C09F@dsl-only.net> <5422E513.6010806@freebsd.org> <1C02D0D4-14B8-465F-B493-4D3A64E4C35C@dsl-only.net> <FDA60573-A9BD-4793-8273-22E960805CFF@dsl-only.net> <0DF8A9EC-C81C-4E15-9420-6831BA7D5F8E@dsl-only.net> <54248467.4050900@freebsd.org> <D7D2B0EF-7EEB-48E7-9485-990BACC709EE@dsl-only.net> <34AA4542-56A7-453E-A00E-868EE352C96C@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
The first send of this was big enough for the moderator to be involved. = So I canceled and am sending with less history included. [I'll note that I seem to have trouble typing 0xdbb290 vs. 0xbdd290. The = actual value is 0xdbb290. The references to the incorrect typing should = say 0xbdd290, which is the wrong value. But I've had both types of = references listing the wrong text... in various notes.] =3D=3D=3D Mark Millard markmi@dsl-only.net On Sep 26, 2014, at 10:11 PM, Mark Millard <markmi@dsl-only.net> wrote: The openfirmware peer crash (i.e., the before Copyright notice crash) = happens during/just-after the MMU setup and the peer pfwcall is the = first ofwcall where pmap_bootstrapped is non-zero at the time. In other = words: the very first ofwcall in the new context fails. And this failure involves some of the same code area that I got a = backtrace for and reported as a separate crash (with the trace listed). = As a reminder for that backtrace that has a difference failure point: .pvo_vaddr_compare+0x14, instruction ld r0, r4, 0x58 [or ld r0,88(r4) in = an alternate notation] .pvo_tree_RB_FIND+0x38 .moea64_dev_direct_mapped_0x90 .pmap_dev_direct_mapped+0x84 ("_dev" was missing in earlier note) .bs_remap_earlyboot_0x6c .moea64_late_bootstrap+0x178 .moea64_bootstrap_native+0x120 .pmap_bootstrap+0xac .powerpc_init+0x514 btext+0xa8 As for the sequence of ofwcall's that I reported: starting at the last = OF_finddevice before the OF_instance_to_package that I reported in the = sequence of ofwcall's from quiesce until the crash... moea64_late_bootstrap does chosen =3D OF_finddevice("/chosen"); if (chosen !=3D -1 && OF_getprop(chosen, "mmu", &mmui, 4) !=3D = -1) { mmu =3D OF_instance_to_package(mmui); if (mmu =3D=3D -1 || (sz =3D OF_getproplen(mmu, = "translations")) =3D=3D -1) sz =3D 0; if (sz > 6144 /* tmpstksz - 2 KB headroom */) panic("moea64_bootstrap: too many ofw translations"); =20 if (sz > 0) moea64_add_ofw_mappings(mmup, mmu, sz); } with moea64_add_ofw_mappings called. Then... moea64_add_ofw_mappings does... bzero(translations, sz); OF_getprop(OF_finddevice("/"), "#address-cells", &acells, sizeof(acells)); if (OF_getprop(mmu, "translations", trans_cells, sz) =3D=3D -1) panic("moea64_bootstrap: can't get ofw translations"); And it is the next ofwcall after that last OF_getprop that fails. (It = happens to be a peer request.) Adding a dump of the pmap_bootstrapped = value with the ofwcall name in my hack for reporting things about the = crash confirmed that peer ofwcall as the first with pmap_bootstrapped = non-zero. I will note here that it is somewhat later than the above code that = pvo_vaddr_compare ends up executing via bs_remap_earlyboot. That earlier = moea64_late_bootstrap code continues after the } from the first if above = with: /* * Calculate the last available physical address. */ for (i =3D 0; phys_avail[i + 2] !=3D 0; i +=3D 2) ; Maxmem =3D powerpc_btop(phys_avail[i + 1]); /* * Initialize MMU and remap early physical mappings */ MMU_CPU_BOOTSTRAP(mmup,0); mtmsr(mfmsr() | PSL_DR | PSL_IR); pmap_bootstrapped++; bs_remap_earlyboot(); (and more). I've not found the peer call yet but it may well be after = the pvo_vaddr_compare shown above as far as execution order goes. =3D=3D=3D Mark Millard markmi at dsl-only.net On Sep 25, 2014, at 2:41 PM, Mark Millard <markmi at dsl-only.net> = wrote: The first boot after make -8 kernel without quiesce also died during = peer, I'd guess the same one. Looks like quiesce does not matter for the issue. (But it is handy for = identifying which peer fails.) =3D=3D=3D Mark Millard markmi at dsl-only.net On Sep 25, 2014, at 2:08 PM, Nathan Whitehorn <nwhitehorn at = freebsd.org> wrote: Can you comment out the call to quiesce? It may not be necessary on your = system. -Nathan On 09/25/14 13:17, Mark Millard wrote: > The "before copyright" hang/exception is during the first openfirmware = "peer" after "quiesce". The ofw_restore_trap_vec(save_trap_init) = completes fine, the ofwcall(args) is made but it does not return = normally. >=20 > Ignoring the ofwcall's from before quiesce, the sequence of ofwcall's = is: >=20 > quiesce > finddevice > parent > getprop > getprop > getprop > finddevice > getprop > instance-to-package > getproplen > finddevice > getprop > getprop > peer >=20 > And when the boot fails before the copyright that ofwcall for peer = ends up resulting in the register dump with no register pointing to the = kernel's normal stack area. >=20 > I still have no clue what is happening during peer. = ofw_restore_trap_vec(save_trap_init) is being called and is returning = before ofwcall is used. For all I know some uses of peer could require = not being quiesce'd in order for peer to be reliable. >=20 > In the form of my display indicating what executed the text reported = ends in: >=20 > <peer>^ >=20 > where the ^ indicates the stage that last completed in the call = sequence inside openfirmware_core. This information is displayed by the >=20 > x/s ofw_name_history >=20 > in the automatically created default script for DDB. I read the = sequence backwards from the end marker (here ^), following the = wraparound if there is that much text and if I care to go back that far. >=20 > FreeBSD FBSDG5M1 10.1-BETA2 FreeBSD 10.1-BETA2 #11 r271944M: Thu Sep = 25 12:14:05 PDT 2014 root@FBSDG5M1:/usr/obj/usr/src/sys/GENERIC64 = powerpc >=20 > My current hacks to get this information are: >=20 > Index: /usr/src/sys/ddb/db_script.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/ddb/db_script.c (revision 271944) > +++ /usr/src/sys/ddb/db_script.c (working copy) > @@ -319,10 +319,25 @@ > { > char scriptname[DB_MAXSCRIPTNAME]; > =20 > + /* HACK!!! : Additional lines to force a basic default script to = exist. > + * Will dump information even if ddb input is not available for = early crash. > + * Used to get more information about PowerMac G5 "before Copyright" = hangs. > + */ > + struct ddb_script *dsp =3D = db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT); > + if (!dsp) db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; = bt; x/s ofw_name_history"); > + > snprintf(scriptname, sizeof(scriptname), "%s.%s", > DB_SCRIPT_KDBENTER_PREFIX, eventname); > if (db_script_exec(scriptname, 0) =3D=3D ENOENT) > (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0); > + > + /* HACK!!! : Additional lines to always use the default script, > + * even if scriptname existed and was executed. > + * Will dump information even if ddb input is not available for = early crash. > + * Used to get more information about PowerMac G5 "before Copyright" = hangs. > + */ > + else > + (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0); > } > =20 > /*- > Index: /usr/src/sys/powerpc/conf/GENERIC64 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/conf/GENERIC64 (revision 271944) > +++ /usr/src/sys/powerpc/conf/GENERIC64 (working copy) > @@ -76,6 +76,8 @@ > # Debugging support. Always need this: > options KDB # Enable kernel debugger support. > options KDB_TRACE # Print a stack trace for a panic. > +options DDB > +options GDB > =20 > # Make an SMP-capable kernel by default > options SMP # Symmetric MultiProcessor Kernel > Index: /usr/src/sys/powerpc/ofw/ofw_machdep.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/ofw/ofw_machdep.c (revision 271944) > +++ /usr/src/sys/powerpc/ofw/ofw_machdep.c (working copy) > @@ -324,6 +324,12 @@ > openfirmware(&args); > } > =20 > +/* Part of HACK to have record of ofw call names */ > +#define ofw_name_history_record_size 256 > +char ofw_name_history[ofw_name_history_record_size+1] =3D {}; /* = Initially: automatically '\0' filled */ > +char * ofw_name_history_pos =3D ofw_name_history; > +/* End Part of HACK */ > + > static int > openfirmware_core(void *args) > { > @@ -330,6 +336,42 @@ > int result; > register_t oldmsr; > =20 > + { /* HACK to have record of ofw call names */ > + struct argtype_prefix { > + cell_t name; > + }; > + > + char *name =3D (char*) (uintptr_t) (((struct = argtype_prefix*)args)->name); > +=20 > + int i; > + > + *ofw_name_history_pos =3D '<'; > + > + for(i=3D0; (*name) && i!=3D20; i++) { > + ofw_name_history_pos++; > + if (ofw_name_history_pos =3D=3D = &ofw_name_history[ofw_name_history_record_size]) { > + ofw_name_history_pos =3D ofw_name_history; > + } > + *ofw_name_history_pos =3D *name; > + > + name++; > + } > + > + ofw_name_history_pos++; > + if (ofw_name_history_pos =3D=3D = &ofw_name_history[ofw_name_history_record_size]) { > + ofw_name_history_pos =3D ofw_name_history; > + } > + *ofw_name_history_pos =3D '>'; > + > + ofw_name_history_pos++; > + if (ofw_name_history_pos =3D=3D = &ofw_name_history[ofw_name_history_record_size]) { > + ofw_name_history_pos =3D ofw_name_history; > + } > + *ofw_name_history_pos =3D '@'; > + > + ofw_name_history[ofw_name_history_record_size] =3D '\0'; /* Paranoia = */ > + } /* HACK end */ > + > /* > * Turn off exceptions - we really don't want to end up > * anywhere unexpected with PCPU set to something strange > @@ -337,14 +379,22 @@ > */ > oldmsr =3D intr_disable(); > =20 > + *ofw_name_history_pos =3D '#'; /* HACK */ > + > ofw_sprg_prepare(); > =20 > + *ofw_name_history_pos =3D '$'; /* HACK */ > + > /* Save trap vectors */ > ofw_save_trap_vec(save_trap_of); > =20 > + *ofw_name_history_pos =3D '%'; /* HACK */ > + > /* Restore initially saved trap vectors */ > ofw_restore_trap_vec(save_trap_init); > =20 > + *ofw_name_history_pos =3D '^'; /* HACK */ > + > #if defined(AIM) && !defined(__powerpc64__) > /* > * Clear battable[] translations > @@ -357,13 +407,21 @@ > =20 > result =3D ofwcall(args); > =20 > + *ofw_name_history_pos =3D '&'; /* HACK */ > + > /* Restore trap vecotrs */ > ofw_restore_trap_vec(save_trap_of); > =20 > + *ofw_name_history_pos =3D '*'; /* HACK */ > + > ofw_sprg_restore(); > =20 > + *ofw_name_history_pos =3D '~'; /* HACK */ > + > intr_restore(oldmsr); > =20 > + *ofw_name_history_pos =3D '!'; /* HACK */ > + > return (result); > } >=20 >=20 >=20 >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 > On Sep 25, 2014, at 3:46 AM, Mark Millard <markmi at dsl-only.net> = wrote: >=20 > One source code oddity that I notice is the following mixed use of = ofw_real_mode: always tested vs. never tested (#if 0 ... #endif) ... >=20 >> /* >> * Saved SPRG0-3 from OpenFirmware. Will be restored prior to the = callback. >> */ >> register_t ofw_sprg0_save; >>=20 >> static __inline void >> ofw_sprg_prepare(void) >> { >> if (ofw_real_mode) >> return; >>=20 >> /* >> * Assume that interrupt are disabled at this point, or >> * SPRG1-3 could be trashed >> */ >> __asm __volatile("mfsprg0 %0\n\t" >> "mtsprg0 %1\n\t" >> "mtsprg1 %2\n\t" >> "mtsprg2 %3\n\t" >> "mtsprg3 %4\n\t" >> : "=3D&r"(ofw_sprg0_save) >> : "r"(ofmsr[1]), >> "r"(ofmsr[2]), >> "r"(ofmsr[3]), >> "r"(ofmsr[4])); >> } >> =20 >> static __inline void >> ofw_sprg_restore(void) >> { >> #if 0 >> if (ofw_real_mode) >> return; >> #endif >>=20 >> /* >> * Note that SPRG1-3 contents are irrelevant. They are = scratch >> * registers used in the early portion of trap handling when >> * interrupts are disabled. >> * >> * PCPU data cannot be used until this routine is called ! >> */ >> __asm __volatile("mtsprg0 %0" :: "r"(ofw_sprg0_save)); >> } >=20 > It would seem that for ofw_real_mode !=3D 0 that ofw_sprg_prepare = would never set up ofw_sprg0_save (via mfsprg0) for the later = ofw_sprg_restore's always-executed mtsprg0 that is based on = ofw_sprg0_save. >=20 > register_t seems to trace back to __int64_t --and that would leave = ofw_sprg0_save initialized to zero as a global and that would have to be = okay as the SPRG0 value to restore in such a case. (I have not tracked = down what any of the per-processor values for SPRG0 are/should-be.) >=20 >=20 >=20 >=20 >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 > On Sep 25, 2014, at 2:12 AM, Mark Millard <markmi at dsl-only.net> = wrote: >=20 > The register dump that has no kernel stack addresses in any registers = does have register contents suggesting a ofwcall use, matching up = reasonably with the code I looked at that is related to ofwcall. ofwcall = is only reached via openfirmware_core from what I can tell. (If there = are other paths into openfirmware than via ofwcall then the register = dump suggests that they are not in use around the crash.) >=20 > And openfirmware_core has logic for exception vector swapping, going = both directions: >=20 >> static int >> openfirmware_core(void *args) >> { >> int result; >> register_t oldmsr; >> =20 >> /* >> * Turn off exceptions - we really don't want to end up >> * anywhere unexpected with PCPU set to something strange >> * or the stack pointer wrong. >> */ >> oldmsr =3D intr_disable(); >> =20 >> ofw_sprg_prepare(); >> =20 >> /* Save trap vectors */ >> ofw_save_trap_vec(save_trap_of); >> =20 >> /* Restore initially saved trap vectors */ >> ofw_restore_trap_vec(save_trap_init); >> =20 >> #if defined(AIM) && !defined(__powerpc64__) >> /* >> * Clear battable[] translations >> */ >> if (!(cpu_features & PPC_FEATURE_64)) >> __asm __volatile("mtdbatu 2, %0\n" >> "mtdbatu 3, %0" : : "r" (0)); >> isync(); >> #endif >>=20 >> result =3D ofwcall(args); >>=20 >> /* Restore trap vecotrs */ >> ofw_restore_trap_vec(save_trap_of); >>=20 >> ofw_sprg_restore(); >>=20 >> intr_restore(oldmsr); >>=20 >> return (result); >> } >=20 > In turn openfirmware_core is used only by ofw_rendezvous_dispatch and = in turn that is used only by openfirmware: only PCPU_GET(cpuid) =3D=3D 0 = does the above. save_trap_init is initialized by powerpc_init using = ofw_save_trap_vec. >=20 > [Note that ofw_restore_trap_vec uses __syncicache which does not use = dcbf after the bcopy but instead uses dcbst: That is part of what lead = my investigation into the distinction --and so to my more overall dcbst = vs. dcbf use questions after proving dcbf would not be sufficient for a = fix to the specific boot issue.] >=20 > Unless the initialization of save_trap_init ends up with the wrong = contents for openfirmware it would appear that the exception vectors are = kept tracking by the above code. But the above does assume that the = openfirmware vectors are unchanged after save_trap_init is initialized: = there is no attempt at tracking of any potential updates to the = openfirmware exception vectors. >=20 > I would infer then that after ofw_restore_trap_vec(save_trap_of) is = executed is when the exception that DDB reports happened: That is when = FreeBSD's exception vectors are again in place. But a stack pointer into = the kernel stack is not then in place in any register (based on DDB's = register dump): stack handling is messed up already by the point of the = reported exception. And that may actually be why an illegal instruction = at address zero was reached: an incorrect stack context used to get an = address to execute at. >=20 >=20 >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 > On Sep 24, 2014, at 8:36 AM, Nathan Whitehorn <nwhitehorn at = freebsd.org> wrote: >=20 > There shouldn't be any exceptions at that point, nested or otherwise. = What I suspect is happening is that Open Firmware has turned them on for = some bizarre reason, taken one, and ended up in the kernel's handlers = but with the Open Firmware environment. Saving and restoring the OF = interrupt vectors would be a possible solution; flattening the device = tree in loader so that the kernel doesn't call Open Firmware at all = would be another. I think Justin may have tried the first at some point. > -Nathan >=20 > On 09/24/14 02:04, Mark Millard wrote: >> Now that I've had a kernel/boot crash with a successful DDB bt and = show registers (a different submittal) it makes for a good = comparison/contrast with what DDB reports for this "before copyright" = crash. >>=20 >> Something unique to the "before copyright" context is... >>=20 >> No registers are reported to have values that point into the range = between tmpstk and esym. >>=20 >> In other words: There is no valid stack pointer reported as far as I = can tell. r1 has the value 0 instead of being a handling a valid stack = address. tmpstk=3D0xbd7000 and esym=3D0xbdb000 (example for one of my = WITH_DEBUG_FILES=3D and options DDB and GDB builds of 10.1-BETA2). That = at least gives a ball park on the range to expect for pointing into the = stack even with some build variation. >>=20 >> It leaves me wondering if the DDB report is for a nested exception = handling. That could explain why lr points to u_trap+0x10 and srr0 = points to k_trap+0x28 when normally srr0 would point to the the failing = instruction (or the instruction after) and lr to where that routine = would normally return to. >>=20 >> The register values that are reported for my 10.1-BETA2 builds that = crash before the copyright notice are: >>=20 >> r0: 0 >> r1: 0 >> r2: 0xc81538 vop_unlock_desc >> r3: 0xd18868 >> r4: 0x894b58 >> r5: 0 >> r6: 0xc1dee0 M_AUDITBSM >> r7: 0xe3f818 ofw_real_mode >> r8: 0x1 >> r9: 0xe0f580 __pcpu >> r10: 0x1c35ec0 >> r11: 0 >> r12: 0x10000000 >> r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as = 0xdbb290.) >> r14-r19: all 0 >> r20: 0x10c1000 >> r21: 0x4 >> r22: 0x180abd4 >> r23: 0x1803a28 >> r24: 0xc000000000008760 >> r25: 0xcc89b8 smp_no... >> r26: 0xcea108 ofw_rend... >> r27: 0x894b58 ofwcall+0xa8 >> r28: 0x894b58 ofwcall+0xa8 >> r29: 2400022 >> r30: 9000000000001032 >> r31: 0xbb7d38 >>=20 >> srr0: 0x102720 k_trap+0x28 >> srr1: 9000000000001032 >> lr: 0x1026f0 u_trap+0x10 >> ctr: 0xff846d78 >> cr: 2000deb0 >> xer: 0 >> dar: f...d50 (lots of f's) >> dsisr: 42000000 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >> =3D=3D=3D >> Mark Millard >> markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7008CDAA-2DA2-419F-9BEC-AD823ECBFCCC>