Date: Wed, 18 Feb 2015 07:45:38 -0800 From: Nathan Whitehorn <nwhitehorn@freebsd.org> To: freebsd-ppc@freebsd.org Subject: Re: PowerMac G5 powerpc64: new context where repeatedly booting varies between failing and working Message-ID: <54E4B3A2.9020106@freebsd.org> In-Reply-To: <836A3016-D41B-45CB-AD4B-946767212026@dsl-only.net> References: <7CA43EE3-8C11-4FBD-9F8A-42DF08B82362@dsl-only.net> <ABDD60F1-72C0-41E0-8DFB-4CFDCA9ACA82@dsl-only.net> <C355D814-D486-4644-B9C6-92992092FD55@dsl-only.net> <5FE82152-BBF7-4C6D-932D-AEE70546CACA@dsl-only.net> <36C14790-8E66-4C9D-9F29-A137FB49439D@dsl-only.net> <836A3016-D41B-45CB-AD4B-946767212026@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format. --------------040607000109080202030001 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Interesting. I'm assuming this is due to a bug in the 32-/64-bit ABI thunking that is required to call into Open Firmware. Could you see if the attached patch helps? -Nathan On 02/18/15 04:51, Mark Millard wrote: > I modified openfirmware_core to check on the status of the pointer value between most of its stages. With this I've also seen later failures than the usual one, such as after a OF_finddevice use has its ofwcall return. > > And the change nails down the stage greatly for at what point it corrupts memory when it does fail... > > // OKAY HERE > result = ofwcall(args); > // SOMETIMES CORRUPTED HERE > > Unfortunately to get this far ofwcall is my variant in order to, for example, enable recovery/retry from observed bad r1/r3 register problems that happened super-early on return from openfirmware in a high percentage of my boot attempts. I have yet to see how close to normal I can get ofwcall to be while still allowing this type of test. > > > The relevant detection code in openfirmware_core is... > > /* HACK */ > extern void** authnone_create(void); > ... > static __inline void > ofw_restore_trap_vec(char *restore_trap_vec) > { > if (!ofw_real_mode) > return; > > bcopy(restore_trap_vec, (void *)EXC_RST, EXC_LAST - EXC_RST); > __syncicache(EXC_RSVD, EXC_LAST - EXC_RSVD); > } > ... > static int > openfirmware_core(void *args) > { > int result; > register_t oldmsr; > > /* HACK */ > void** jnk1pp; > void** jnk2pp; > void* jnk = *authnone_create() > if (jnk == *authnone_create()) jnk = *authnone_create(); > > /* > * Turn off exceptions - we really don't want to end up > * anywhere unexpected with PCPU set to something strange > * or the stack pointer wrong. > */ > oldmsr = intr_disable(); > > /* HACK */ > if (jnk == *authnone_create()) jnk = *authnone_create(); > > ofw_sprg_prepare(); > > /* HACK */ > if (jnk == *authnone_create()) jnk = *authnone_create(); > > /* Save trap vectors */ > ofw_save_trap_vec(save_trap_of); > > /* HACK */ > if (jnk == *authnone_create()) jnk = *authnone_create(); > > /* Restore initially saved trap vectors */ > ofw_restore_trap_vec(save_trap_init); > > /* HACK */ > jnk1pp = authnone_create(); > > #if defined(AIM) && !defined(__powerpc64__) > /* > * Clear battable[] translations > */ > if (!(cpu_features & PPC_FEATURE_64)) > __asm __volatile("mtdbatu 2, %0\n" > "mtdbatu 3, %0" : : "r" (0)); > isync(); > #endif > > result = ofwcall(args); > > /* HACK */ > jnk2pp = authnone_create(); > > /* Restore trap vecotrs */ > ofw_restore_trap_vec(save_trap_of); > > /* HACK */ > if (jnk != *jnk1pp) jnk = *authnone_create(); > if (jnk != *jnk2pp) jnk = *authnone_create(); > /* Note: *jnk2pp above is what detects the bad pointer value when it goes bad */ > if (jnk == *authnone_create()) jnk = *authnone_create(); > > ofw_sprg_restore(); > > /* HACK */ > if (jnk == *authnone_create()) jnk = *authnone_create(); > > intr_restore(oldmsr); > > /* HACK */ > if (jnk == *authnone_create()) jnk = *authnone_create(); > > return (result); > } > > In the code this translates to... > > 00000000008a671c <.openfirmware_core+0x168> bl 00000000007a3de4 <.authnone_create> > 00000000008a6720 <.openfirmware_core+0x16c> crmove 4*cr7+so,4*cr7+so > 00000000008a6724 <.openfirmware_core+0x170> mr r28,r3 > > Note: The above loads r28 with a good address that later does not fail when later dereferenced (while FreeBSD's exception vectors are in place). > > 00000000008a6728 <.openfirmware_core+0x174> mr r3,r29 > 00000000008a672c <.openfirmware_core+0x178> bl 00000000008ac930 <.ofwcall> > 00000000008a6730 <.openfirmware_core+0x17c> crmove 4*cr7+so,4*cr7+so > 00000000008a6734 <.openfirmware_core+0x180> mr r26,r3 > 00000000008a6738 <.openfirmware_core+0x184> bl 00000000007a3de4 <.authnone_create> > 00000000008a673c <.openfirmware_core+0x188> crmove 4*cr7+so,4*cr7+so > 00000000008a6740 <.openfirmware_core+0x18c> mr r29,r3 > > Note: The above loads r29 with the bad address that is later detected by referencing it. This is the corrupted pointer value. > > 00000000008a6744 <.openfirmware_core+0x190> ld r3,21216(r2) > 00000000008a6748 <.openfirmware_core+0x194> lwz r0,0(r3) > 00000000008a674c <.openfirmware_core+0x198> cmpwi cr7,r0,0 > 00000000008a6750 <.openfirmware_core+0x19c> beq+ cr7,00000000008a6778 <.openfirmware_core+0x1c4> > 00000000008a6754 <.openfirmware_core+0x1a0> addi r3,r3,16 > 00000000008a6758 <.openfirmware_core+0x1a4> li r4,256 > 00000000008a675c <.openfirmware_core+0x1a8> li r5,11776 > 00000000008a6760 <.openfirmware_core+0x1ac> bl 00000000008c158c <.bcopy> > 00000000008a6764 <.openfirmware_core+0x1b0> crmove 4*cr7+so,4*cr7+so > 00000000008a6768 <.openfirmware_core+0x1b4> li r3,0 > 00000000008a676c <.openfirmware_core+0x1b8> li r4,12032 > 00000000008a6770 <.openfirmware_core+0x1bc> bl 00000000008d5358 <.__syncicache> > > Note: At this point it is back to FreeBSD exception vectors so kernel debug display will work for bad pointer detection tests. > > 00000000008a6774 <.openfirmware_core+0x1c0> crmove 4*cr7+so,4*cr7+so > 00000000008a6778 <.openfirmware_core+0x1c4> ld r0,0(r28) > > Note: The above dereference of the before ofwcall pointer value (in r28) does not detect a bad pointer. > > 00000000008a677c <.openfirmware_core+0x1c8> cmpd cr7,r0,r30 > 00000000008a6780 <.openfirmware_core+0x1cc> beq- cr7,00000000008a6790 <.openfirmware_core+0x1dc> > 00000000008a6784 <.openfirmware_core+0x1d0> bl 00000000007a3de4 <.authnone_create> > 00000000008a6788 <.openfirmware_core+0x1d4> crmove 4*cr7+so,4*cr7+so > 00000000008a678c <.openfirmware_core+0x1d8> ld r30,0(r3) > 00000000008a6790 <.openfirmware_core+0x1dc> ld r0,0(r29) > > It is that last instruction (.openfirmware_core+0x1dc) that "detects" the bad pointer and leads to a kernel debugger display of some of the corrupted memory, including the stored pointer that the above code accessed and dereferenced to detect the problem. > > So the pointer was good just before the ofwcall and was bad just after it. > > === > Mark Millard > markmi at dsl-only.net > > On 2015-Feb-17, at 09:34 PM, Mark Millard <markmi at dsl-only.net> wrote: > > [I had sent Nathan W. and Justin H. a picture of a display of a boot-time corrupted memory region. This time I tried to find the start and end of the region and I'm documenting in a textual form more appropriate to the list. I have also removed prior Email history from this Email but there is much context one must check that history for.] > > Several of the new values put in place by the .got memory corruption reported below match up with .opd or other types of addresses reported by objdump for my /boot/kernel10.1S/kernel. They are noted below as I list detailed differences. > > I made the early-boot-crash display a larger range and the span of the corruption seemed to go as follows for the corruption of part of the .got area. Also I induced a deference of the bad pointer as soon as it is discovered after the OF_peer(0) in question returns so later code would not be involved when it crashes. (Crash early, crash often...) > > > Overall structure: > > 0xd2da37 and before as far as I looked: no corruption found. > > The area from 0xd2da38-0xd2dc9F: largely corrupted. 0x268 or 616 bytes or so in this corrupted range. 616=77*8. > > After that range: good again as far as I looked. > > > The details: > > Warning: The below is based on hand transcribed information from screen pictures that I took. > > Showing pair of lines (good then corrupted), using x/x style lines: > > 0xd2da30: 0, b4fd2c, 0, b4fd70 > 0xd2da30: 0, b4fd2c, 0, 0 > > 0xd2da40: 0, e28948, 0, e1e460 > 0xd2da40: 0, 24000042, 0, d00058 > (24000042 looks like a cr value?) > (0000000000d00058 l .opd 0000000000000018 ofw_rendezvous_dispatch) > > 0xd2da50: 0, bc7de8, 0, bc7e08 > 0xd2da50: 0, cde110, c0000000, 8740 > (0xc000000000008740 looks like a stack address?) > (0000000000cde110 g F .opd 0000000000000018 smp_no_rendevous_barrier) > > 0xd2da60: 0, cd8470, 0, bd2608 > 0xd2da60: 0, 1, 0, c3a30c > (0000000000c3a30c g .data 0000000000000000 ofw_sprg0_save) > > 0xd2da70: 0, bb5ea0, 0, b70870 > 0xd2da70: 0, 1c35ec0, 0, 0 > > 0xd2da80: 0, c49918, 0, bc7e18 > 0xd2da80: 0, 44000022, 0, de4b30 > (44000022 looks like a cr value?) > (0000000000de4b30 g O .bss 0000000000000460 thread0) > > 0xd2da90: 0, b720a0, 0, b71370 > 0xd2da90: 900000000, 1032, 0, ff846d78 > (9000000000001032 looks like a SRR1 value.) > (ff846d78 is openfirmware entry point?) > > 0xd2daa0: 0, bc7e30, 0, bc7e58 > 0xd2daa0: 0, e39080, 100000000, 3030 > (0000000000e39080 g O .bss 0000000000020000 __pcpu) > (1000000000003030 looks like a SRR1 value?) > > 0xd2dab0: 0, bc7e80, 0, bc7eb0 > 0xd2dab0: c0000000, 83b0, 0, c3a280 > (0xc0000000000083b0 looks like a stack address?) > (c3a280 is inside my PowerMac G5 specific hack's ofwstk area: c392a0 up to 0x3a2a0) > (I've been gathering evidence about early-boot G5 crashes.) > > 0xd2dac0: 0, bc7ed0, 0, cf2960 > 0xd2dac0: 0, c40000, 0, c40000 > > 0xd2dad0: 0, bc7f00, 0, bc7f28 > 0xd2dad0: 0, c40000, 0, c40000 > > 0xd2dae0: 0, b72400, 0, bc7f28 > 0xd2dae0: c0000000, 8740, 0, cde110 > (0xc000000000008740 looks like a stack address?) > (0000000000cde110 g F .opd 0000000000000018 smp_no_rendevous_barrier) > > 0xd2daf0: 0, cf2b28, 0, b716a0 > 0xd2daf0: 0, d00058, 0, cde110 > (d00058 was also at 0xd2da4c and was followed by cde110 there.) > (0000000000cde110 g F .opd 0000000000000018 smp_no_rendevous_barrier) > > 0xd2db00: 0, cf2b88, 0, cf2b70 > 0xd2db00: 0, e6c280, 0, 0 > (e6c280 is inside the emergency_buffer.7752 area: e6c278 up to e6c378) > > 0xd2db10: 0, cf2b58, 0, 8480 > 0xd2db10: 900000000, 1032, c0000000, 8740 > (9000000000001032 looks like a SRR1 value?) > (0xc000000000008740 looks like a stack address?) > > 0xd2db20: 0, c2d920, 0, cf2b10 > 0xd2db20: 0, c2d920, 0, cf2b10 (yep: unchanged!) > > 0xd2db30: 0, b71718, 0, c49888 > 0xd2db30: 0, ff846734, 10000000, 3030 > (ff846734 would seem to be an openfirmware code address?) > (1000000000003030 looks like a SRR1 value?) > > 0xd2db40: 0, c498a0, 0, c54000 > 0xd2db40: 0, c498a0, 0, ff846d78 > (Yep: c498a0 was unchanged) > (ff846d78 is openfirmware entry point?) > > 0xd2db50: 0, e313a8, 0, e31608 > 0xd2db50: 24000042, e313a8, 0, 0 > (24000042 looks like a cr value?) > (Trying to store to address 0x2400004200e313a8 for a specific > type of 10.1-STABLE build is how the problem was originally > noticed.) > > 0xd2db60: 0, c31f80, 0, bc81e8 > 0xd2db60: 0, c31f80, 0, 0 > (Yep: 0x0000000000c31f80 is unchanged.) > > 0xd2db70: 0, e31408, 0, bc8228 > 0xd2db70: 200000, e31408, 0, bc8228 > (Yep: Only the 0x200000 was a change.) > > 0xd2db80: 0, c32488, 0, bc8238 > 0xd2db80: 0, 1, 10000000, 3030 > (1000000000003030 looks like a SRR1 value?) > > 0xd2db90: 0, e1e460, 0, c31fc0 > 0xd2db90: 0, 0, 0, 7ff7e800 > > 0xd2dba0: 0, e31608, 0, bc8260 > 0xd2dba0: 0, 1000000a, 0, bc8260 > (Yep: 0x0000000000bc8260 unchanged.) > > 0xd2dbb0: 0, e1e460, 0, e1fa60 > 0xd2dbb0: 0, e1e460, 0, e1fa60 (yep: unchanged!) > > 0xd2dbc0: 0, bc8288, 0, c32488 > 0xd2dbc0: 111081, 0, fd3c2000, 0 > (fd3c2000 in openfirmware area?) > > 0xd2dbd0: 0, e3153c, 0, bc8298 > 0xd2dbd0: 10, 0, 0, 0 > > Now a few unchanged: 0xd2de0-0xd2dc1F > > Then a change in the pattern of corruptions for the rest of the corrupted area: > > 0xd2dc20: 0, bc8288, 0, bc82e8 > 0xd2dc20: 0, bc8288, 127f500, bc82e8 > > Note how bc8288 and bc82e8 did not change. > From here on those two columns are not > corrupted but the other two are. > > 0xd2dc30: 0, bc8300, 0, c32488 > 0xd2dc30: 8000000, bc8300, e7d540, c32488 > > 0xd2dc40: 0, b4fef0, 0, e31558 > 0xd2dc40: ecc40, b4fef0, 84eec80, e31558 > > 0xd2dc50: 0, bc8308, 0, cf2f00 > 0xd2dc50: 1e85440, bc8308, 8766200, cf2f00 > > 0xd2dc60: 0, bc8310, 0, bc8350 > 0xd2dc60: fb9040, bc8310, 93bb000, bc8350 > > 0xd2dc70: 0, c32038, 0, de5718 > 0xd2dc70: 94f6b00, c32038, 8632600, de5718 > > 0xd2dc80: 0, de7768, 0, bc3760 > 0xd2dc80: 1fc0f40, de7768, 10f4b40, bc3760 > > 0xd2dc90: 0, de7768, 0, e1fa00 > 0xd2dc90: 99e5700, cfc658, 228740, e1fa00 > > And after that things match for as far as I've looked: no corruptions. > > > > > > === > Mark Millard > markmi at dsl-only.net > > > > _______________________________________________ > freebsd-ppc@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-ppc > To unsubscribe, send any mail to "freebsd-ppc-unsubscribe@freebsd.org" > --------------040607000109080202030001 Content-Type: text/plain; charset=us-ascii; name="ofwcall.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="ofwcall.diff" Index: ofwcall64.S =================================================================== --- ofwcall64.S (revision 278935) +++ ofwcall64.S (working copy) @@ -106,7 +106,7 @@ /* Get OF stack pointer */ ld %r7,TOC_REF(ofwstk)(%r2) - addi %r7,%r7,OFWSTKSZ-32 + addi %r7,%r7,OFWSTKSZ-64 /* * Set the MSR to the OF value. This has the side effect of disabling @@ -126,9 +126,9 @@ */ mr %r5,%r1 mr %r1,%r7 - std %r5,8(%r1) /* Save real stack pointer */ - std %r2,16(%r1) /* Save old TOC */ - std %r6,24(%r1) /* Save old MSR */ + std %r5,40(%r1) /* Save real stack pointer */ + std %r2,48(%r1) /* Save old TOC */ + std %r6,56(%r1) /* Save old MSR */ li %r5,0 stw %r5,4(%r1) stw %r5,0(%r1) @@ -138,9 +138,9 @@ bctrl /* Reload stack pointer and MSR from the OFW stack */ - ld %r6,24(%r1) - ld %r2,16(%r1) - ld %r1,8(%r1) + ld %r6,56(%r1) + ld %r2,48(%r1) + ld %r1,40(%r1) /* Now set the real MSR */ mtmsrd %r6 --------------040607000109080202030001--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54E4B3A2.9020106>