Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Sep 2014 04:31:49 +0000
From:      "Joe Nosay" <superbisquit@gmail.com>
To:        "Adrian Chadd" <adrian@freebsd.org>, "ISO 8859 1 Q Ermal Lu E7i" <eri@freebsd.org>, "chromium" <chromium@freebsd.org>, "ppc" <ppc@freebsd.org>, "FreeBSD Hackers" <freebsd-hackers@freebsd.org>, "CeDeROM" <cederom@tlen.pl>, "USAR VET" <usvet81@hotmail.com>, "gecko" <gecko@freebsd.org>
Subject:   =?ISO-8859-1?Q?FW=3A=09Adrian_Chadd?=
Message-ID:  <3769C3A941E7152E07B05F1542B29085@edu.lahti.fi>

next in thread | raw e-mail | index | archive | help
http://antiq.co.il/koyzzu/afqiqruhwbgmoaewqtdyitufwiupqnntlbhvhnh
From owner-freebsd-ppc@FreeBSD.ORG  Thu Sep 25 09:13:03 2014
Return-Path: <owner-freebsd-ppc@FreeBSD.ORG>
Delivered-To: freebsd-ppc@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id EE2AB213
 for <freebsd-ppc@freebsd.org>; Thu, 25 Sep 2014 09:13:03 +0000 (UTC)
Received: from asp.reflexion.net (outbound-242.asp.reflexion.net
 [69.84.129.242])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 2C18ECB4
 for <freebsd-ppc@freebsd.org>; Thu, 25 Sep 2014 09:13:02 +0000 (UTC)
Received: (qmail 801 invoked from network); 25 Sep 2014 09:12:55 -0000
Received: from unknown (HELO mail-cs-04.app.dca.reflexion.local) (10.81.19.4)
 by 0 (rfx-qmail) with SMTP; 25 Sep 2014 09:12:55 -0000
Received: by mail-cs-04.app.dca.reflexion.local
 (Reflexion email security v7.30.7) with SMTP;
 Thu, 25 Sep 2014 05:12:55 -0400 (EDT)
Received: (qmail 17020 invoked from network); 25 Sep 2014 09:12:53 -0000
Received: from unknown (HELO iron2.pdx.net) (69.64.224.71)
 by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP;
 25 Sep 2014 09:12:53 -0000
X-No-Relay: not in my network
X-No-Relay: not in my network
X-No-Relay: not in my network
Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net
 [98.246.178.138])
 by iron2.pdx.net (Postfix) with ESMTPSA id D522D1C402C;
 Thu, 25 Sep 2014 02:12:51 -0700 (PDT)
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal
 instruction 0" before-copyright hang on PowerMac G5's
From: Mark Millard <markmi@dsl-only.net>
In-Reply-To: <5422E513.6010806@freebsd.org>
Date: Thu, 25 Sep 2014 02:12:48 -0700
Message-Id: <1C02D0D4-14B8-465F-B493-4D3A64E4C35C@dsl-only.net>
References: <1118046C-0FF7-49FC-82DA-DB9A7A310991@dsl-only.net>
 <2ED3DB50-B985-4382-8FF2-3B44E7E65453@dsl-only.net>
 <CAHSQbTBXxrgXQdNeCs=C5wJaT_bmh9FU836O6VnJDbJuqCUujw@mail.gmail.com>
 <6D729F43-662A-429E-9503-0148EC3250B1@dsl-only.net>
 <72535F89-3942-45A6-B351-7F746209ED9F@dsl-only.net>
 <0703EF26-6E33-4446-9273-BBFD0CB72893@dsl-only.net>
 <37575F94-763C-43BF-8DD9-F648F4A7C09F@dsl-only.net>
 <5422E513.6010806@freebsd.org>
To: Nathan Whitehorn <nwhitehorn@freebsd.org>
X-Mailer: Apple Mail (2.1878.6)
Content-Type: text/plain;
	charset=windows-1252
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1
Cc: Justin Hibbits <chmeeedalf@gmail.com>,
 FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
X-BeenThere: freebsd-ppc@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Porting FreeBSD to the PowerPC <freebsd-ppc.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-ppc/>;
List-Post: <mailto:freebsd-ppc@freebsd.org>
List-Help: <mailto:freebsd-ppc-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Sep 2014 09:13:04 -0000

The register dump that has no kernel stack addresses in any registers =
does have register contents suggesting a ofwcall use, matching up =
reasonably with the code I looked at that is related to ofwcall. ofwcall =
is only reached via openfirmware_core from what I can tell. (If there =
are other paths into openfirmware than via ofwcall then the register =
dump suggests that they are not in use around the crash.)

And openfirmware_core has logic for exception vector swapping, going =
both directions:

> static int
> openfirmware_core(void *args)
> {
>         int             result;
>         register_t      oldmsr;
> =20
>         /*
>          * Turn off exceptions - we really don't want to end up
>          * anywhere unexpected with PCPU set to something strange
>          * or the stack pointer wrong.
>          */
>         oldmsr =3D intr_disable();
> =20
>         ofw_sprg_prepare();
> =20
>         /* Save trap vectors */
>         ofw_save_trap_vec(save_trap_of);
> =20
>         /* Restore initially saved trap vectors */
>         ofw_restore_trap_vec(save_trap_init);
> =20
> #if defined(AIM) && !defined(__powerpc64__)
>         /*
>          * Clear battable[] translations
>          */
>         if (!(cpu_features & PPC_FEATURE_64))
>                 __asm __volatile("mtdbatu 2, %0\n"
>                                  "mtdbatu 3, %0" : : "r" (0));
>         isync();
> #endif
>=20
>         result =3D ofwcall(args);
>=20
>         /* Restore trap vecotrs */
>         ofw_restore_trap_vec(save_trap_of);
>=20
>         ofw_sprg_restore();
>=20
>         intr_restore(oldmsr);
>=20
>         return (result);
> }

In turn openfirmware_core is used only by ofw_rendezvous_dispatch and in =
turn that is used only by openfirmware: only PCPU_GET(cpuid) =3D=3D 0 =
does the above. save_trap_init is initialized by powerpc_init using =
ofw_save_trap_vec.

[Note that ofw_restore_trap_vec uses __syncicache which does not use =
dcbf after the bcopy but instead uses dcbst: That is part of what lead =
my investigation into the distinction --and so to my more overall dcbst =
vs. dcbf use questions after proving dcbf would not be sufficient for a =
fix to the specific boot issue.]

Unless the initialization of save_trap_init ends up with the wrong =
contents for openfirmware it would appear that the exception vectors are =
kept tracking by the above code. But the above does assume that the =
openfirmware vectors are unchanged after save_trap_init is initialized: =
there is no attempt at tracking of any potential updates to the =
openfirmware exception vectors.

I would infer then that after ofw_restore_trap_vec(save_trap_of) is =
executed is when the exception that DDB reports happened: That is when =
FreeBSD's exception vectors are again in place. But a stack pointer into =
the kernel stack is not then in place in any register (based on DDB's =
register dump): stack handling is messed up already by the point of the =
reported exception. And that may actually be why an illegal instruction =
at address zero was reached: an incorrect stack context used to get an =
address to execute at.




=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Sep 24, 2014, at 8:36 AM, Nathan Whitehorn <nwhitehorn at =
freebsd.org> wrote:

There shouldn't be any exceptions at that point, nested or otherwise. =
What I suspect is happening is that Open Firmware has turned them on for =
some bizarre reason, taken one, and ended up in the kernel's handlers =
but with the Open Firmware environment. Saving and restoring the OF =
interrupt vectors would be a possible solution; flattening the device =
tree in loader so that the kernel doesn't call Open Firmware at all =
would be another. I think Justin may have tried the first at some point.
-Nathan

On 09/24/14 02:04, Mark Millard wrote:
> Now that I've had a kernel/boot crash with a successful DDB bt and =
show registers (a different submittal) it makes for a good =
comparison/contrast with what DDB reports for this "before copyright" =
crash.
>=20
> Something unique to the "before copyright" context is...
>=20
> No registers are reported to have values that point into the range =
between tmpstk and esym.
>=20
> In other words: There is no valid stack pointer reported as far as I =
can tell. r1 has the value 0 instead of being a handling a valid stack =
address. tmpstk=3D0xbd7000 and esym=3D0xbdb000 (example for one of my =
WITH_DEBUG_FILES=3D and options DDB and GDB builds of 10.1-BETA2). That =
at least gives a ball park on the range to expect for pointing into the =
stack even with some build variation.
>=20
> It leaves me wondering if the DDB report is for a nested exception =
handling. That could explain why lr points to u_trap+0x10 and srr0 =
points to k_trap+0x28 when normally srr0 would point to the the failing =
instruction (or the instruction after) and lr to where that routine =
would normally return to.
>=20
> The register values that are reported for my 10.1-BETA2 builds that =
crash before the copyright notice are:
>=20
> r0: 0
> r1: 0
> r2: 0xc81538 vop_unlock_desc
> r3: 0xd18868
> r4: 0x894b58
> r5: 0
> r6: 0xc1dee0 M_AUDITBSM
> r7: 0xe3f818 ofw_real_mode
> r8: 0x1
> r9: 0xe0f580 __pcpu
> r10: 0x1c35ec0
> r11: 0
> r12: 0x10000000
> r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as =
0xdbb290.)
> r14-r19: all 0
> r20: 0x10c1000
> r21: 0x4
> r22: 0x180abd4
> r23: 0x1803a28
> r24: 0xc000000000008760
> r25: 0xcc89b8 smp_no...
> r26: 0xcea108 ofw_rend...
> r27: 0x894b58 ofwcall+0xa8
> r28: 0x894b58 ofwcall+0xa8
> r29: 2400022
> r30: 9000000000001032
> r31: 0xbb7d38
>=20
> srr0: 0x102720 k_trap+0x28
> srr1: 9000000000001032
> lr: 0x1026f0 u_trap+0x10
> ctr: 0xff846d78
> cr: 2000deb0
> xer: 0
> dar: f...d50 (lots of f's)
> dsisr: 42000000
>=20
>=20
>=20
>=20
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
> On Sep 20, 2014, at 3:42 PM, Mark Millard <markmi at dsl-only.net> =
wrote:
>=20
> [I corrected the SSR0 in the subject to be SRR0.]
>=20
> I did miss a register in my list (it matched the shown r30 value). And =
it turns out to probably be very important to interpreting what the =
"show registers" is reporting:
>=20
> SRR1: 0x9000000000001032
>=20
> But bits 43-46 of SRR1 are supposed to indicate which type of Program =
Exception, using a single binary 1 to so. No such 1's are present.
>=20
> Illegal instruction would have been bit 44 being 1. (PowerPC has the =
upper bit numbered zero and increases from there.)
>=20
> So the ddb "show registers" is apparently not reporting the status as =
of when the "stopped at 0 illegal instruction 0" happened. Thus other =
things are also likely not from that exact time frame.
>=20
>=20
>=20
> And I misinterpreted the LR value status: The LR value was just left =
over from the restore_kernsrs returning when it finished. Execution then =
flowed into k_trap. Nothing unusual involved.
>=20
>=20
>=20
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi@dsl-only.net
>=20
> On Sep 18, 2014, at 8:57 PM, Mark Millard <markmi@dsl-only.net> wrote:
>=20
> I modified DDB to automatically "show registers" even at the early =
"before Copyright" crash time. The end of this note will show the =
/usr/src/sys/ddb/db_script.c diff for the hack. While I also had DDB bt, =
the bt does not actually print a back trace for this context. (It might =
for others.)
>=20
> The registers give interesting context despite the lack of a back =
trace. I do not know if it will be sufficient to be of much immediate =
help if someone used the information to start looking at the problem.
>=20
> I'll start with register lr: 0x1026f0 u_trap+0x10.
>=20
> /usr/src/sys/powerpc/aim/trap_subr64.S has:
>=20
> s_trap:
>         bf      17,k_trap               /* branch if PSL_PR is false =
*/
>         GET_CPUINFO(%r1)
> u_trap:
>         ld      %r1,PC_CURPCB(%r1)
>         mr      %r27,%r28               /* Save LR, r29 */
>         mtsprg2 %r29
>         bl      restore_kernsrs         /* enable kernel mapping */
>         mfsprg2 %r29
>         mr      %r28,%r27
>=20
> /*
>  * Now the common trap catching code.
>  */
> k_trap:
>         FRAME_SETUP(PC_TEMPSAVE)
> /* Call C interrupt dispatcher: */
> trapagain:
>=20
> and so this appears to indicate a pending return to execute the =
"mfsprg2 %r29" after "bl restore_kernsrs", which indicates that =
restore_kernsrs should be active.
>=20
> But register srr0 indicates: 0x102720 k_trap+0x28. (So apparently in =
FRAME_SETUP(PC_TEMPSAVE) someplace.)
>=20
> So it appears to me that the processor got to the k_trap code during =
the supposed restore_kernsrs time frame. (But I'm no expert at these =
sorts of things or for the processor.)
>=20
> I'll list the other register values:
>=20
> r0:  0
> r1:  0
> r2:  0xc1be80 M_AUDITBSM
> r3:  0xb16138
> r4:  0x8926e8 .ofwcall+0xa8
> r5:  0
> r6:  0xbb5f90
> r7:  0xe3d118 ofw_real_mode
> r8:  0x1
> r9:  0xe0ce80 __pcpu
> r10: 0x1c35ec9
> r11: 0
> r12: 0x10000000
> r13: db890    thread0
> r14-r19: all 0
> r20: 0x10bc000
> r21: 0x4
> r22: 0x1801db4
> r23: 0x1803a28
> r24: 0xc000000000008760
> r25: 0xcc6908 smp_no_rendevous_barrier
> r26: 0xec79e0 ofw_rendezvous_dispatch (yep one has v and the other zv)
> r27: 0x8926e8 .ofwcall+0xa8
> r28: 0x8926e8 .ofwcall+0xa8 (yep: same value)
> r29: 0x24000022
> r30: 0x9000000000001032
> r31: 0xc7f488 vop_unlock_desc
>=20
> ctr: 0xff846d78
> cr:  0x2000d7b0
> xer: 0
> dar: 0xfffffffffffffd50
> dsisr: 0x42000000
>=20
> (Hopefully this manual transcription from the screen display is =
complete --and also accurate for what it does present.)
>=20
>=20
>=20
>=20
> The personal HACK to /usr/src/sys/ddb/db_script.c's =
db_script_kdbenter(...) to have it show registers and try bt...
>=20
> $ cd /usr/src/sys/ddb/
> $ svnlite diff .
> Index: db_script.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- db_script.c (revision 271610)
> +++ db_script.c (working copy)
> @@ -319,10 +319,25 @@
>  {
>   char scriptname[DB_MAXSCRIPTNAME];
> =20
> + /* HACK!!! : Additional lines to force a basic default script to =
exist.
> +  * Will dump information even if ddb input is not available for =
early crash.
> +  * Used to get more information about PowerMac G5 "before Copyright" =
hangs.
> +  */
> + struct ddb_script *dsp =3D =
db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT);
> + if (!dsp) db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; =
bt");
> +
>   snprintf(scriptname, sizeof(scriptname), "%s.%s",
>       DB_SCRIPT_KDBENTER_PREFIX, eventname);
>   if (db_script_exec(scriptname, 0) =3D=3D ENOENT)
>   (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> +
> + /* HACK!!! : Additional lines to always use the default script,
> +  *           even if scriptname existed and was executed.
> +  * Will dump information even if ddb input is not available for =
early crash.
> +  * Used to get more information about PowerMac G5 "before Copyright" =
hangs.
> +  */
> + else
> + (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
>  }
> =20
>  /*-
>=20
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
> On Sep 16, 2014, at 9:28 PM, Mark Millard <markmi at dsl-only.net> =
wrote:
>=20
> In part I sent directly to you because of a past exchange (July-27) =
where you had written:
>=20
>> Nathan and I both speculate that it's
>> dropping into Open Firmware (we make extensive use of OFW), and then
>> messing something up, taking a page fault or something.
>=20
> The specific text that I report and its uniformity when it is produced =
seems to add a little information beyond a speculated "page fault or =
something" and so might eventually help a little. As I understand the =
text it is reporting execution reaching address zero without any prior =
un-handled exceptions or other such that would stop it. A corrupted =
stack (pointer) so a bad return address or some such? I'd guess there =
are no explicit jumps to address zero so I expect that indirection is =
likely involved, with the content for the indirection messed up.
>=20
> I really wish that I had a logic analyzer configuration for this. I've =
not found a way to make the failing context visible so far and the extra =
way of looking at things might have helped.
>=20
>=20
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi@dsl-only.net
>=20
> On Sep 16, 2014, at 8:28 PM, Justin Hibbits <chmeeedalf@gmail.com> =
wrote:
>=20
> Hi mark,
>=20
> I see this on my G5, and I think it's due to the amount of RAM in the =
machine. More than 4gb seems to confuse open firmware when called by =
FreeBSD. There is some effort to remove the need of the callbacks but =
thus far it's not far along. The good news is that after it boots it's =
solid except when switching vtys, buy earlier this year or last year I =
added a sysctl hack to disable the call into open firmware on vty switch =
(don't recall offhand and not at my computer right now, but if you grep =
the sysctl output for reset and ofw you can find it).
>=20
> -Justin
>=20
> On Sep 16, 2014 8:01 PM, "Mark Millard" <markmi@dsl-only.net> wrote:
> I've now spent time with rebooting and power-off/power-on for all 3 =
PowerMac G5's (one PowerMac7,2 and two PowerMac11,2's) and all 3 get the
>=20
>> GDB: no debug ports present
>> KDB: debugger backends: DDB
>> KDB: current backend: DDB
>> [ thread pid -1 tid 1006665719 ]
>> Stopped at 0: illegal instruction 0
>> db>
>=20
> when they fail just before the Copyright notice would normally be =
displayed. None fail any earlier. At that spot none have failed any =
other way. It is the same SSD in all 3. (Happens with other SSD's as =
well.) Overall there is a mix of Radeon and NVIDIA display boards. =
Besides the SSD use and RAM upgrades the rest is stock equipment. scons =
used, not vt. (I've yet to try vt.)
>=20
> Seeing a failure after the Copyright notice as been fairly rare in all =
my experiments from when I started last April or so. The ones that I've =
noted had Data Storage Interrupt reported. So far no examples of the =
above have been reported after the Copyright notice. So I'd guess that =
they are separate issues. Of course it seems that only in the last few =
days would I have seen the above sort of thing if it did happen after =
the Copyright notice: The prior history does not count for judgements =
about that.
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
> On Sep 16, 2014, at 8:15 AM, Mark Millard <markmi@dsl-only.net> wrote:
>=20
> Using 10.1-BETA1 I added "options DDB" and "options GDB" to =
powerpc64's GENERIC64. (I also used WITH_DEBUG_FILES=3D, WITHOUT_CLANG=3D,=
 and WITH_DEBUG=3D in /etc/make.conf.) So buildworld, kernel was =
basically just set up to have more of a debugging context around =
(including for any ports builds).
>=20
> The result was new information about the PowerMac G5 boot hangups: The =
screen is no longer blank when the G5 is hung up without there being a =
Copyright notice yet. It says...
>=20
>> GDB: no debug ports present
>> KDB: debugger backends: DDB
>> KDB: current backend: DDB
>> [ thread pid -1 tid 1006665719 ]
>> Stopped at 0: illegal instruction 0
>> db>
>=20
> (I had no ability to input at that point.) Normally the Copyright =
notice would have displayed instead of "[...]" and what follows. (I do =
not claim to have all the spacing, capitalization, and such correct =
above.)
>=20
> That text is constant from hang to hang when it hangs just before it =
would normally output the Copyright notice: The numbers do not vary, =
much less the other text. It has never failed until after the two KDB =
messages are present. So far I've only tested one PowerMac G5, booting =
over and over for a few hours.
>=20
>=20
>=20
> (I do not claim to be set up for remote kernel debugging. I just =
decided to let GDB go along for the ride when I added DDB.)
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
>=20
>=20
>=20
>=20
>=20





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3769C3A941E7152E07B05F1542B29085>