Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 26 Sep 2014 22:18:34 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        Justin Hibbits <chmeeedalf@gmail.com>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal instruction 0" before-copyright hang on PowerMac G5's
Message-ID:  <7008CDAA-2DA2-419F-9BEC-AD823ECBFCCC@dsl-only.net>
In-Reply-To: <34AA4542-56A7-453E-A00E-868EE352C96C@dsl-only.net>
References:  <1118046C-0FF7-49FC-82DA-DB9A7A310991@dsl-only.net> <2ED3DB50-B985-4382-8FF2-3B44E7E65453@dsl-only.net> <CAHSQbTBXxrgXQdNeCs=C5wJaT_bmh9FU836O6VnJDbJuqCUujw@mail.gmail.com> <6D729F43-662A-429E-9503-0148EC3250B1@dsl-only.net> <72535F89-3942-45A6-B351-7F746209ED9F@dsl-only.net> <0703EF26-6E33-4446-9273-BBFD0CB72893@dsl-only.net> <37575F94-763C-43BF-8DD9-F648F4A7C09F@dsl-only.net> <5422E513.6010806@freebsd.org> <1C02D0D4-14B8-465F-B493-4D3A64E4C35C@dsl-only.net> <FDA60573-A9BD-4793-8273-22E960805CFF@dsl-only.net> <0DF8A9EC-C81C-4E15-9420-6831BA7D5F8E@dsl-only.net> <54248467.4050900@freebsd.org> <D7D2B0EF-7EEB-48E7-9485-990BACC709EE@dsl-only.net> <34AA4542-56A7-453E-A00E-868EE352C96C@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
The first send of this was big enough for the moderator to be involved. =
So I canceled and am sending with less history included.

[I'll note that I seem to have trouble typing 0xdbb290 vs. 0xbdd290. The =
actual value is 0xdbb290. The references to the incorrect typing should =
say 0xbdd290, which is the wrong value. But I've had both types of =
references listing the wrong text... in various notes.]

=3D=3D=3D
Mark Millard
markmi@dsl-only.net

On Sep 26, 2014, at 10:11 PM, Mark Millard <markmi@dsl-only.net> wrote:

The openfirmware peer crash (i.e., the before Copyright notice crash) =
happens during/just-after the MMU setup and the peer pfwcall is the =
first ofwcall where pmap_bootstrapped is non-zero at the time. In other =
words: the very first ofwcall in the new context fails.

And this failure involves some of the same code area that I got a =
backtrace for and reported as a separate crash (with the trace listed). =
As a reminder for that backtrace that has a difference failure point:

.pvo_vaddr_compare+0x14, instruction ld r0, r4, 0x58 [or ld r0,88(r4) in =
an alternate notation]
.pvo_tree_RB_FIND+0x38
.moea64_dev_direct_mapped_0x90
.pmap_dev_direct_mapped+0x84 ("_dev" was missing in earlier note)
.bs_remap_earlyboot_0x6c
.moea64_late_bootstrap+0x178
.moea64_bootstrap_native+0x120
.pmap_bootstrap+0xac
.powerpc_init+0x514
btext+0xa8

As for the sequence of ofwcall's that I reported: starting at the last =
OF_finddevice before the OF_instance_to_package that I reported in the =
sequence of ofwcall's from quiesce until the crash...

moea64_late_bootstrap does

        chosen =3D OF_finddevice("/chosen");
        if (chosen !=3D -1 && OF_getprop(chosen, "mmu", &mmui, 4) !=3D =
-1) {
            mmu =3D OF_instance_to_package(mmui);
            if (mmu =3D=3D -1 || (sz =3D OF_getproplen(mmu, =
"translations")) =3D=3D -1)
                sz =3D 0;
            if (sz > 6144 /* tmpstksz - 2 KB headroom */)
                panic("moea64_bootstrap: too many ofw translations");
                       =20
            if (sz > 0)
                moea64_add_ofw_mappings(mmup, mmu, sz);
        }

with moea64_add_ofw_mappings called. Then...

moea64_add_ofw_mappings does...

        bzero(translations, sz);
        OF_getprop(OF_finddevice("/"), "#address-cells", &acells,
            sizeof(acells));
        if (OF_getprop(mmu, "translations", trans_cells, sz) =3D=3D -1)
                panic("moea64_bootstrap: can't get ofw translations");

And it is the next ofwcall after that last OF_getprop that fails. (It =
happens to be a peer request.) Adding a dump of the pmap_bootstrapped =
value with the ofwcall name in my hack for reporting things about the =
crash confirmed that peer ofwcall as the first with pmap_bootstrapped =
non-zero.

I will note here that it is somewhat later than the above code that =
pvo_vaddr_compare ends up executing via bs_remap_earlyboot. That earlier =
moea64_late_bootstrap code continues after the } from the first if above =
with:

        /*
         * Calculate the last available physical address.
         */
        for (i =3D 0; phys_avail[i + 2] !=3D 0; i +=3D 2)
                ;
        Maxmem =3D powerpc_btop(phys_avail[i + 1]);

        /*
         * Initialize MMU and remap early physical mappings
         */
        MMU_CPU_BOOTSTRAP(mmup,0);
        mtmsr(mfmsr() | PSL_DR | PSL_IR);
        pmap_bootstrapped++;
        bs_remap_earlyboot();

(and more). I've not found the peer call yet but it may well be after =
the pvo_vaddr_compare shown above as far as execution order goes.





=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Sep 25, 2014, at 2:41 PM, Mark Millard <markmi at dsl-only.net> =
wrote:

The first boot after make -8 kernel without quiesce also died during =
peer, I'd guess the same one.

Looks like quiesce does not matter for the issue. (But it is handy for =
identifying which peer fails.)



=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Sep 25, 2014, at 2:08 PM, Nathan Whitehorn <nwhitehorn at =
freebsd.org> wrote:

Can you comment out the call to quiesce? It may not be necessary on your =
system.
-Nathan

On 09/25/14 13:17, Mark Millard wrote:
> The "before copyright" hang/exception is during the first openfirmware =
"peer" after "quiesce". The ofw_restore_trap_vec(save_trap_init) =
completes fine, the ofwcall(args) is made but it does not return =
normally.
>=20
> Ignoring the ofwcall's from before quiesce, the sequence of ofwcall's =
is:
>=20
> quiesce
> finddevice
> parent
> getprop
> getprop
> getprop
> finddevice
> getprop
> instance-to-package
> getproplen
> finddevice
> getprop
> getprop
> peer
>=20
> And when the boot fails before the copyright that ofwcall for peer =
ends up resulting in the register dump with no register pointing to the =
kernel's normal stack area.
>=20
> I still have no clue what is happening during peer. =
ofw_restore_trap_vec(save_trap_init) is being called and is returning =
before ofwcall is used. For all I know some uses of peer could require =
not being quiesce'd in order for peer to be reliable.
>=20
> In the form of my display indicating what executed the text reported =
ends in:
>=20
> <peer>^
>=20
> where the ^ indicates the stage that last completed in the call =
sequence inside openfirmware_core. This information is displayed by the
>=20
> x/s ofw_name_history
>=20
> in the automatically created default script for DDB. I read the =
sequence backwards from the end marker (here ^), following the =
wraparound if there is that much text and if I care to go back that far.
>=20
> FreeBSD FBSDG5M1 10.1-BETA2 FreeBSD 10.1-BETA2 #11 r271944M: Thu Sep =
25 12:14:05 PDT 2014     root@FBSDG5M1:/usr/obj/usr/src/sys/GENERIC64  =
powerpc
>=20
> My current hacks to get this information are:
>=20
> Index: /usr/src/sys/ddb/db_script.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /usr/src/sys/ddb/db_script.c (revision 271944)
> +++ /usr/src/sys/ddb/db_script.c (working copy)
> @@ -319,10 +319,25 @@
>  {
>   char scriptname[DB_MAXSCRIPTNAME];
> =20
> + /* HACK!!! : Additional lines to force a basic default script to =
exist.
> +  * Will dump information even if ddb input is not available for =
early crash.
> +  * Used to get more information about PowerMac G5 "before Copyright" =
hangs.
> +  */
> + struct ddb_script *dsp =3D =
db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT);
> + if (!dsp) db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; =
bt; x/s ofw_name_history");
> +
>   snprintf(scriptname, sizeof(scriptname), "%s.%s",
>       DB_SCRIPT_KDBENTER_PREFIX, eventname);
>   if (db_script_exec(scriptname, 0) =3D=3D ENOENT)
>   (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> +
> + /* HACK!!! : Additional lines to always use the default script,
> +  *           even if scriptname existed and was executed.
> +  * Will dump information even if ddb input is not available for =
early crash.
> +  * Used to get more information about PowerMac G5 "before Copyright" =
hangs.
> +  */
> + else
> + (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
>  }
> =20
>  /*-
> Index: /usr/src/sys/powerpc/conf/GENERIC64
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /usr/src/sys/powerpc/conf/GENERIC64 (revision 271944)
> +++ /usr/src/sys/powerpc/conf/GENERIC64 (working copy)
> @@ -76,6 +76,8 @@
>  # Debugging support.  Always need this:
>  options   KDB # Enable kernel debugger support.
>  options   KDB_TRACE # Print a stack trace for a panic.
> +options   DDB
> +options   GDB
> =20
>  # Make an SMP-capable kernel by default
>  options   SMP # Symmetric MultiProcessor Kernel
> Index: /usr/src/sys/powerpc/ofw/ofw_machdep.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /usr/src/sys/powerpc/ofw/ofw_machdep.c (revision 271944)
> +++ /usr/src/sys/powerpc/ofw/ofw_machdep.c (working copy)
> @@ -324,6 +324,12 @@
>   openfirmware(&args);
>  }
> =20
> +/* Part of HACK to have record of ofw call names */
> +#define ofw_name_history_record_size 256
> +char ofw_name_history[ofw_name_history_record_size+1] =3D {}; /* =
Initially: automatically '\0' filled */
> +char * ofw_name_history_pos =3D ofw_name_history;
> +/* End Part of HACK */
> +
>  static int
>  openfirmware_core(void *args)
>  {
> @@ -330,6 +336,42 @@
>   int result;
>   register_t oldmsr;
> =20
> + { /* HACK to have record of ofw call names */
> + struct argtype_prefix {
> + cell_t name;
> + };
> +
> + char *name =3D (char*) (uintptr_t) (((struct =
argtype_prefix*)args)->name);
> +=20
> + int i;
> +
> + *ofw_name_history_pos =3D '<';
> +
> + for(i=3D0; (*name) && i!=3D20; i++) {
> + ofw_name_history_pos++;
> + if (ofw_name_history_pos =3D=3D =
&ofw_name_history[ofw_name_history_record_size]) {
> + ofw_name_history_pos =3D ofw_name_history;
> + }
> + *ofw_name_history_pos =3D *name;
> +
> + name++;
> + }
> +
> + ofw_name_history_pos++;
> + if (ofw_name_history_pos =3D=3D =
&ofw_name_history[ofw_name_history_record_size]) {
> + ofw_name_history_pos =3D ofw_name_history;
> + }
> + *ofw_name_history_pos =3D '>';
> +
> + ofw_name_history_pos++;
> + if (ofw_name_history_pos =3D=3D =
&ofw_name_history[ofw_name_history_record_size]) {
> + ofw_name_history_pos =3D ofw_name_history;
> + }
> + *ofw_name_history_pos =3D '@';
> +
> + ofw_name_history[ofw_name_history_record_size] =3D '\0'; /* Paranoia =
*/
> + } /* HACK end */
> +
>   /*
>    * Turn off exceptions - we really don't want to end up
>    * anywhere unexpected with PCPU set to something strange
> @@ -337,14 +379,22 @@
>    */
>   oldmsr =3D intr_disable();
> =20
> + *ofw_name_history_pos =3D '#'; /* HACK */
> +
>   ofw_sprg_prepare();
> =20
> + *ofw_name_history_pos =3D '$'; /* HACK */
> +
>   /* Save trap vectors */
>   ofw_save_trap_vec(save_trap_of);
> =20
> + *ofw_name_history_pos =3D '%'; /* HACK */
> +
>   /* Restore initially saved trap vectors */
>   ofw_restore_trap_vec(save_trap_init);
> =20
> + *ofw_name_history_pos =3D '^'; /* HACK */
> +
>  #if defined(AIM) && !defined(__powerpc64__)
>   /*
>    * Clear battable[] translations
> @@ -357,13 +407,21 @@
> =20
>   result =3D ofwcall(args);
> =20
> + *ofw_name_history_pos =3D '&'; /* HACK */
> +
>   /* Restore trap vecotrs */
>   ofw_restore_trap_vec(save_trap_of);
> =20
> + *ofw_name_history_pos =3D '*'; /* HACK */
> +
>   ofw_sprg_restore();
> =20
> + *ofw_name_history_pos =3D '~'; /* HACK */
> +
>   intr_restore(oldmsr);
> =20
> + *ofw_name_history_pos =3D '!'; /* HACK */
> +
>   return (result);
>  }
>=20
>=20
>=20
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
> On Sep 25, 2014, at 3:46 AM, Mark Millard <markmi at dsl-only.net> =
wrote:
>=20
> One source code oddity that I notice is the following mixed use of =
ofw_real_mode: always tested vs. never tested (#if 0 ... #endif) ...
>=20
>> /*
>>  * Saved SPRG0-3 from OpenFirmware. Will be restored prior to the =
callback.
>>  */
>> register_t      ofw_sprg0_save;
>>=20
>> static __inline void
>> ofw_sprg_prepare(void)
>> {
>>         if (ofw_real_mode)
>>                 return;
>>=20
>>         /*
>>          * Assume that interrupt are disabled at this point, or
>>          * SPRG1-3 could be trashed
>>          */
>>         __asm __volatile("mfsprg0 %0\n\t"
>>                          "mtsprg0 %1\n\t"
>>                          "mtsprg1 %2\n\t"
>>                          "mtsprg2 %3\n\t"
>>                          "mtsprg3 %4\n\t"
>>                          : "=3D&r"(ofw_sprg0_save)
>>                          : "r"(ofmsr[1]),
>>                          "r"(ofmsr[2]),
>>                          "r"(ofmsr[3]),
>>                          "r"(ofmsr[4]));
>> }
>>        =20
>> static __inline void
>> ofw_sprg_restore(void)
>> {
>> #if 0
>>         if (ofw_real_mode)
>>                 return;
>> #endif
>>=20
>>         /*
>>          * Note that SPRG1-3 contents are irrelevant. They are =
scratch
>>          * registers used in the early portion of trap handling when
>>          * interrupts are disabled.
>>          *
>>          * PCPU data cannot be used until this routine is called !
>>          */
>>         __asm __volatile("mtsprg0 %0" :: "r"(ofw_sprg0_save));
>> }
>=20
> It would seem that for ofw_real_mode !=3D 0 that ofw_sprg_prepare =
would never set up ofw_sprg0_save (via mfsprg0) for the later =
ofw_sprg_restore's always-executed mtsprg0 that is based on =
ofw_sprg0_save.
>=20
> register_t seems to trace back to __int64_t --and that would leave =
ofw_sprg0_save initialized to zero as a global and that would have to be =
okay as the SPRG0 value to restore in such a case. (I have not tracked =
down what any of the per-processor values for SPRG0 are/should-be.)
>=20
>=20
>=20
>=20
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
> On Sep 25, 2014, at 2:12 AM, Mark Millard <markmi at dsl-only.net> =
wrote:
>=20
> The register dump that has no kernel stack addresses in any registers =
does have register contents suggesting a ofwcall use, matching up =
reasonably with the code I looked at that is related to ofwcall. ofwcall =
is only reached via openfirmware_core from what I can tell. (If there =
are other paths into openfirmware than via ofwcall then the register =
dump suggests that they are not in use around the crash.)
>=20
> And openfirmware_core has logic for exception vector swapping, going =
both directions:
>=20
>> static int
>> openfirmware_core(void *args)
>> {
>>         int             result;
>>         register_t      oldmsr;
>> =20
>>         /*
>>          * Turn off exceptions - we really don't want to end up
>>          * anywhere unexpected with PCPU set to something strange
>>          * or the stack pointer wrong.
>>          */
>>         oldmsr =3D intr_disable();
>> =20
>>         ofw_sprg_prepare();
>> =20
>>         /* Save trap vectors */
>>         ofw_save_trap_vec(save_trap_of);
>> =20
>>         /* Restore initially saved trap vectors */
>>         ofw_restore_trap_vec(save_trap_init);
>> =20
>> #if defined(AIM) && !defined(__powerpc64__)
>>         /*
>>          * Clear battable[] translations
>>          */
>>         if (!(cpu_features & PPC_FEATURE_64))
>>                 __asm __volatile("mtdbatu 2, %0\n"
>>                                  "mtdbatu 3, %0" : : "r" (0));
>>         isync();
>> #endif
>>=20
>>         result =3D ofwcall(args);
>>=20
>>         /* Restore trap vecotrs */
>>         ofw_restore_trap_vec(save_trap_of);
>>=20
>>         ofw_sprg_restore();
>>=20
>>         intr_restore(oldmsr);
>>=20
>>         return (result);
>> }
>=20
> In turn openfirmware_core is used only by ofw_rendezvous_dispatch and =
in turn that is used only by openfirmware: only PCPU_GET(cpuid) =3D=3D 0 =
does the above. save_trap_init is initialized by powerpc_init using =
ofw_save_trap_vec.
>=20
> [Note that ofw_restore_trap_vec uses __syncicache which does not use =
dcbf after the bcopy but instead uses dcbst: That is part of what lead =
my investigation into the distinction --and so to my more overall dcbst =
vs. dcbf use questions after proving dcbf would not be sufficient for a =
fix to the specific boot issue.]
>=20
> Unless the initialization of save_trap_init ends up with the wrong =
contents for openfirmware it would appear that the exception vectors are =
kept tracking by the above code. But the above does assume that the =
openfirmware vectors are unchanged after save_trap_init is initialized: =
there is no attempt at tracking of any potential updates to the =
openfirmware exception vectors.
>=20
> I would infer then that after ofw_restore_trap_vec(save_trap_of) is =
executed is when the exception that DDB reports happened: That is when =
FreeBSD's exception vectors are again in place. But a stack pointer into =
the kernel stack is not then in place in any register (based on DDB's =
register dump): stack handling is messed up already by the point of the =
reported exception. And that may actually be why an illegal instruction =
at address zero was reached: an incorrect stack context used to get an =
address to execute at.
>=20
>=20
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
> On Sep 24, 2014, at 8:36 AM, Nathan Whitehorn <nwhitehorn at =
freebsd.org> wrote:
>=20
> There shouldn't be any exceptions at that point, nested or otherwise. =
What I suspect is happening is that Open Firmware has turned them on for =
some bizarre reason, taken one, and ended up in the kernel's handlers =
but with the Open Firmware environment. Saving and restoring the OF =
interrupt vectors would be a possible solution; flattening the device =
tree in loader so that the kernel doesn't call Open Firmware at all =
would be another. I think Justin may have tried the first at some point.
> -Nathan
>=20
> On 09/24/14 02:04, Mark Millard wrote:
>> Now that I've had a kernel/boot crash with a successful DDB bt and =
show registers (a different submittal) it makes for a good =
comparison/contrast with what DDB reports for this "before copyright" =
crash.
>>=20
>> Something unique to the "before copyright" context is...
>>=20
>> No registers are reported to have values that point into the range =
between tmpstk and esym.
>>=20
>> In other words: There is no valid stack pointer reported as far as I =
can tell. r1 has the value 0 instead of being a handling a valid stack =
address. tmpstk=3D0xbd7000 and esym=3D0xbdb000 (example for one of my =
WITH_DEBUG_FILES=3D and options DDB and GDB builds of 10.1-BETA2). That =
at least gives a ball park on the range to expect for pointing into the =
stack even with some build variation.
>>=20
>> It leaves me wondering if the DDB report is for a nested exception =
handling. That could explain why lr points to u_trap+0x10 and srr0 =
points to k_trap+0x28 when normally srr0 would point to the the failing =
instruction (or the instruction after) and lr to where that routine =
would normally return to.
>>=20
>> The register values that are reported for my 10.1-BETA2 builds that =
crash before the copyright notice are:
>>=20
>> r0: 0
>> r1: 0
>> r2: 0xc81538 vop_unlock_desc
>> r3: 0xd18868
>> r4: 0x894b58
>> r5: 0
>> r6: 0xc1dee0 M_AUDITBSM
>> r7: 0xe3f818 ofw_real_mode
>> r8: 0x1
>> r9: 0xe0f580 __pcpu
>> r10: 0x1c35ec0
>> r11: 0
>> r12: 0x10000000
>> r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as =
0xdbb290.)
>> r14-r19: all 0
>> r20: 0x10c1000
>> r21: 0x4
>> r22: 0x180abd4
>> r23: 0x1803a28
>> r24: 0xc000000000008760
>> r25: 0xcc89b8 smp_no...
>> r26: 0xcea108 ofw_rend...
>> r27: 0x894b58 ofwcall+0xa8
>> r28: 0x894b58 ofwcall+0xa8
>> r29: 2400022
>> r30: 9000000000001032
>> r31: 0xbb7d38
>>=20
>> srr0: 0x102720 k_trap+0x28
>> srr1: 9000000000001032
>> lr: 0x1026f0 u_trap+0x10
>> ctr: 0xff846d78
>> cr: 2000deb0
>> xer: 0
>> dar: f...d50 (lots of f's)
>> dsisr: 42000000
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>> =3D=3D=3D
>> Mark Millard
>> markmi at dsl-only.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7008CDAA-2DA2-419F-9BEC-AD823ECBFCCC>