From owner-freebsd-ppc@FreeBSD.ORG Sun Oct 12 23:51:27 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 13A09307 for ; Sun, 12 Oct 2014 23:51:27 +0000 (UTC) Received: from mail-pd0-x229.google.com (mail-pd0-x229.google.com [IPv6:2607:f8b0:400e:c02::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D8C13A8D for ; Sun, 12 Oct 2014 23:51:26 +0000 (UTC) Received: by mail-pd0-f169.google.com with SMTP id w10so4673645pde.14 for ; Sun, 12 Oct 2014 16:51:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-type; bh=Eg4CapnacWI28KcwQxIeni2PDaM4wMPgqqFj9AQ4fLU=; b=XpbjkNMrCqBORWq+VyB29ZVs9I2VXI10T5LOCIEgwVPtcZJ5WxuLwsYvF1/eF69oGQ xx2M/jKK3fxyfZBVCoOuuVZsTCag0iMkC1ufOlQMOHARUZB92mwFX7EixcprLW+L0+8R EenQYXO25gy2vb34i36Tws9FQ17I2o47hEK/+vRjn7EZYu8c7i6cWQTPTBhdNG/pz7N6 PLtuEg+WFh2LfuXoU+SqrjlLeq11ERrMjQUUPqmcQ0VKyTJcuAQDenpJ6k833D4wBuqq MNXWrq9qvKynybby9TFDEYoZT4Emu6TGsuu6IC4IKj91Bn2vKFxz3snDPq0eSvCw45y0 tQ6A== X-Received: by 10.68.225.74 with SMTP id ri10mr10289966pbc.65.1413157886391; Sun, 12 Oct 2014 16:51:26 -0700 (PDT) Received: from zhabar.attlocal.net (107-222-186-3.lightspeed.sntcca.sbcglobal.net. [107.222.186.3]) by mx.google.com with ESMTPSA id qf3sm9350998pbc.96.2014.10.12.16.51.24 for (version=SSLv3 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 12 Oct 2014 16:51:25 -0700 (PDT) Date: Sun, 12 Oct 2014 16:51:18 -0700 From: Justin Hibbits To: Matthew Rezny Subject: Re: PowerMac G5 spurious sensor readings Message-ID: <20141012165118.0b44f322@zhabar.attlocal.net> In-Reply-To: <20130222220918.00005998@unknown> References: <51169.1358483910@hexaneinc.com> <20130222220918.00005998@unknown> X-Mailer: Claws Mail 3.10.1 (GTK+ 2.24.22; powerpc64-portbld-freebsd11.0) MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/81.W0_NboPiiUUlKsNAW0Y8"; protocol="application/pgp-signature" Cc: freebsd-ppc@freebsd.org X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 12 Oct 2014 23:51:27 -0000 --Sig_/81.W0_NboPiiUUlKsNAW0Y8 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 22 Feb 2013 22:09:18 +0100 Matthew Rezny wrote: > On Fri, 18 Jan 2013 05:38:30 +0100 > Matthew Rezny wrote: >=20 > > On Thu 13/01/17 21:59 , Matthew Rezny wrote:: > >=20 > > >I have a G5 of the first model (PowerMac7,2) on which I've been > > >using FreeBSD/ppc64 for over a year. Today, it suddenly rebooted. > > >Not the first time by any means, but this is the first time I found > > >the following log message: Jan 17 17:32:19 powermac kernel: > > >WARNING: Current temperature (MLB MAX6690 AMB:127.8 C) exceeds > > >critical temperature (80.0 C)! Shutting down! > > > > > >This is the first time I have seen such a message. After reboot, > > >that sensor shows a temperature near 30C, which seems appropriate. > > >The reading of 127.8C looks suspiciously like a max value. My only > > >guess is there was a bad read that resulted in the sensor value > > >going over the threshold. That raises a question in my mind as to > > >whether there is any filtering or sanity checking of the data. > > >Could a single bad read cause the threshold to be exceeded and > > >trigger shutdown immediately, or would the excessive value have to > > >be returned from that sensor multiple times for it to be believed > > >an acted upon? > > > > > >$ uname -a > > >FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39 > > >UTC 2012 > > >root@anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64 > > >powerpc > > > > > >The build is a bit old, though I wouldn't expect too much change to > > >the code in question since then. I will update to 9.1-RELEASE or > > >-STABLE in the next few days, but as this is a problem that has > > >happened once in over a year, I wouldn't call it resolved just by a > > >quick failure to reproduce after updating. > > > > > >I was already planning to do an update after the box has completed > > >it's current task. I noticed a problem with excessive output > > >causing the console to hang. A couple days ago I found the machine > > >apparently hung in that the keyboard and mouse were not responsive, > > >but I found it was still alive on the network and I could ssh in to > > >reboot. The only clues were no buffer space for dmesg to output > > >anything before reboot, and a rather full /var/log/messages file > > >which had exhausted the drive. Under the same workload (and after > > >freeing some drive space), the problem reoccurred in a matter of > > >hours, but this time with me watching. While running ddrescue > > >against a drive with some bad sectors, read errors flood the > > >console in spurts. When some dozens of read errors are displayed > > >at once, the console scrolls whole pages by in a fraction of a > > >second, and then goes dead. Messages that should go to console are > > >not shown on screen but are in the log. Attempts to switch virtual > > >console or to reboot are not successful, but ssh access continues > > >to work and the box is clearly still processing other workloads. > > >The only sign of life from the console are the messages about > > >flushing buffers just before completion of the reboot commanded > > >via ssh. > > > > >=20 > > Just a few hours later, it strikes again. > > Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB > > MAX6690 AMB: 127.0 C) exceeds critical temperature (80.0 C)! > > Shutting down! > >=20 > > I took a peek in smu.c and powermac_thermal.c. In the former, > > smu_sensor_read() has a check for an error returned from > > smu_run_cmd() but no checks on the returned data. In the later, > > pmac_therm_manage_fans() invokes smu_sensor_read() and considers the > > returned value as valid if greater than zero. No other sanity checks > > are performed. > >=20 > > Looking at the datasheet[1] for max6690, I see that 127C is the > > maximum readable temperature, which is represented as 01111111. The > > value 10000000 is documented as representing a diode fault. As there > > is no upper range check, the diode fault condition will be > > interpreted as slightly over 127C. I think it would be appropriate > > to treat as invalid any raw sensor value with the MSB set. > > Additionally, the check on line 105 of pmac_therm_manage_fans > > should really be "if (temp >=3D 0)" rather than just "if (temp > 0)" > > as a value of 0 is a valid value for zero degrees and all actual > > errors are represented as a value of -1. > >=20 > > I have not looked at the datasheets for other relevant sensors, but > > being that there are no range checks in any of the cases in > > smu_sensor_read(), I currently consider them all suspect pending > > review. > >=20 > > [1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page > > 11, Table 2) > >=20 >=20 > It has been some time since I first looked at this, but something in > the back of my mind said my first glance at it was flawed, so I > recently revisited the matter. >=20 > I realize I was looking in the wrong place previously. After reading > through powermac_thermal.c I had done a search for sensor_read() and > found a match in smu.c so I started from there. However, there is no > SMU on this early model G5. Searching more than just the same > directory, I found sensor_read() occurs in many places. The pertinent > function is really in sys/dev/iicbus/max6690.c as > max6690_sensor_read(). >=20 > Looking at the correct function, I see multiple problems. > max6690_read() is called twice, both times storing the return value > in err. The value of err is checked only once, after both calls, so > if the first call returns an error but the second succeeds, the error > indicator is overwritten and calculations will be done with bad data. > The check for err < 0 should be done after each call to > max6690_read(). After that basic check, which would only indicate an > error if there was a problem with the I2C bus transaction, > computations are done without any further data validation. As stated > before, the sensor may signal an error condition by returning the > value 10000000. There is no check for that reserved value, so when > the sensor attempts to indicate an error, we end up with a > temperature of 127C being returned from max6690_sensor_read(). That > value causes havoc for obvious reasons. >=20 > This should be easy to clean up but I don't have time at the moment to > do the testing of the changes on the one machine to which this is > applicable. If nobody else deals with it, I'll get around to doing so > and post a patch when that time comes. Found this in my email archives. I committed the error checking for the integer part as r273016. Thanks! - Justin --Sig_/81.W0_NboPiiUUlKsNAW0Y8 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBAgAGBQJUOxP4AAoJEDDHhY43vi25cjkH/joyaGFo6yKPQFwBjKhI/zrw sAZfrJcclR61KM47/WwBsfhL8TDtHelNVkMyBIxhlAHelvzQGsSQooNGu5anJ2cw J9LKoqStOGuxQx0KErbLsD9UN+7eMD/l0OlKIW99RLqzRkpNm5kwyAPfgxeRIzQa p13rpDAMXMNFt8DYGfKVi/9zJz5ZrbHBqumVjRlhpIEL/sHfJDeg2cU2OE4YQHxG W4g/PfleMh2iZNTC/OeQVcDu6ChhDbX9jNyOEanKgIQwckIuTQX8ZBQ1OgURBYTU FyYG7ttH1sthwujSsp8DKrazxHhzIFfB59H9D/zN6+gUPLHFQkaUsZRJBDDn0jY= =F88h -----END PGP SIGNATURE----- --Sig_/81.W0_NboPiiUUlKsNAW0Y8-- From owner-freebsd-ppc@FreeBSD.ORG Mon Oct 13 00:53:59 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A999B9AA for ; Mon, 13 Oct 2014 00:53:59 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 50BE81000 for ; Mon, 13 Oct 2014 00:53:58 +0000 (UTC) Received: (qmail 5456 invoked from network); 13 Oct 2014 00:53:51 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 13 Oct 2014 00:53:51 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Sun, 12 Oct 2014 20:53:51 -0400 (EDT) Received: (qmail 5003 invoked from network); 13 Oct 2014 00:53:51 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 13 Oct 2014 00:53:51 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 7C47E1C4015; Sun, 12 Oct 2014 17:53:45 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence From: Mark Millard X-Priority: 1 Date: Sun, 12 Oct 2014 17:53:49 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> To: FreeBSD PowerPC ML , Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Oct 2014 00:53:59 -0000 NOTE: I make no claim that any of the below hacks for ofwcall are = appropriate code for FreeBSD's general context. I only claim that it = seems to make the specific PowerMac G5 problem go away, gives solid = evidence for at least some of what is going on (justifying the = investigative and testing hacks) and so gives evidence for an = appropriate, more general FreeBSD solution. The big issue is: The PowerMac G5 openfirmware does not always preserve = the %r1 value (the stack pointer contents) that it is initially given, = at least when the early "before copyright" crash problem is happening = but possibly other times as well. I had the following investigative code in ofwcall, snapshotting the = value of %r1 before and after openfirmware's code is used: lis %r4,openfirmware_entry@ha ld %r4,openfirmware_entry@l(%r4) ... mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... /* Finally, branch to OF */ mtctr %r4 bctrl mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... then the DDB show registers from the crash that I'd hacked in would show = these values instead of the zeros they otherwise always display, in = addition to what the show registers has always shown for r1. The results were like the following example for every such crash: r17 =3D 0xC31400 ofwstk+0xfe0 r18 =3D 0xd24450 r1 =3D 0xd24450 Because of that %r1 value the later code such as: /* Reload stack pointer and MSR from the OFW stack */ ld %r6,24(%r1) ld %r2,16(%r1) ld %r1,8(%r1) gets garbage-in/garbage-out results, including %r6 being values like = 0xbc0568 instead of the value saved msr to later be restored: = 0x9000000000001032. So one PowerMac G5 specific hack involved in my working-boots context is = to force the original %r1 value to be used (based on %r17 being a = before-call copy, similar to the above): ld %r6,24(%r17) ld %r2,16(%r17) ld %r1,8(%r17) But the exception report from DDB has had problems in part because sprg0 = still has the openfirmware value at the time even though the exception = is after openfirmware returned (the wrong value results in the register = for GET_CPUINFO(). So I hacked in a before-exception restore = of FreeBSD's sprg0 inside ofwcall to make the exception handler code = have that much FreeBSD context available at the exception (if it occurs, = anyway). This was really just to help with information gathering, = although I've not tested only having the %r17 changes. So overall PowerMac G5 specific hacking the ofwcall code to have instead = (based on what was reported above): root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S Index: /usr/src/sys/powerpc/ofw/ofwcall64.S =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) @@ -52,6 +52,12 @@ GLOBAL(rtas_entry) .llong 0 /* RTAS entry point */ =20 + /* HACK: part of having sprg0 in place for trap */ +ofwsprg0save: + .space 8 /* sizeof(register_t) */ +GLOBAL(ofw_sprg0_save) + .llong 0 + /* * Open Firmware Real-mode Entry Point. This is a huge pain. */ @@ -97,6 +103,10 @@ lis %r4,openfirmware_entry@ha ld %r4,openfirmware_entry@l(%r4) =20 + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ + lis %r14,ofw_sprg0_save@ha + ld %r14,ofw_sprg0_save@l(%r14) + /* * Set the MSR to the OF value. This has the side effect of = disabling * exceptions, which is important for the next few steps. @@ -123,14 +133,27 @@ stw %r5,4(%r1) stw %r5,0(%r1) =20 + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ + lis %r6,ofwsprg0save@ha + std %r14,ofwsprg0save@l(%r6) + + /* HACK: part of IGNORING the later %r1 value from openfirmware = */ + mr %r17,%r1 + /* Finally, branch to OF */ mtctr %r4 bctrl =20 + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ + lis %r6,ofwsprg0save@ha + ld %r6,ofwsprg0save@l(%r6) + mtsprg0 %r6 + /* Reload stack pointer and MSR from the OFW stack */ - ld %r6,24(%r1) - ld %r2,16(%r1) - ld %r1,8(%r1) + /* HACKED to ignore the %r1 value that results from = openfirmware's call */ + ld %r6,24(%r17) + ld %r2,16(%r17) + ld %r1,8(%r17) =20 /* Now set the real MSR */ mtmsrd %r6 This results in no crashes happening so far in my testing, not even the = 16 GByte RAM machine that crashed so much. NOTE: owf_machdep.c was changed to use "extern register_t = ofw_sprg0_save;" to match the above. I still have ps3 disabled in GENERIC64 so that I can also have the sc = options in GENERIC64. And the DDB and GDB options are still present as = well. And I still have my hack to force a DDB script that does show registers = and shows the ofwcall history information that I hacked in, even for the = very early crashes before input is possible. Not that I'm now getting = such executions of the script. (A before possible-crash backtrace is = also shown by the added code. That still shows up.) I'll probably next switch to reverting the DDB related code changes and = to removing the DDB/GDB options and see how that goes. =3D=3D=3D Mark Millard markmi at dsl-only.net From owner-freebsd-ppc@FreeBSD.ORG Mon Oct 13 02:25:54 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EE29BAD8 for ; Mon, 13 Oct 2014 02:25:54 +0000 (UTC) Received: from c.mail.sonic.net (c.mail.sonic.net [64.142.111.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D3147A61 for ; Mon, 13 Oct 2014 02:25:54 +0000 (UTC) Received: from zeppelin.tachypleus.net (polaris.tachypleus.net [75.101.50.44]) (authenticated bits=0) by c.mail.sonic.net (8.14.9/8.14.9) with ESMTP id s9D2Pifw020508 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Sun, 12 Oct 2014 19:25:45 -0700 Message-ID: <543B3828.8070806@freebsd.org> Date: Sun, 12 Oct 2014 19:25:44 -0700 From: Nathan Whitehorn User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: Mark Millard , FreeBSD PowerPC ML Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> In-Reply-To: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Sonic-CAuth: UmFuZG9tSVag0IIaP3xZSHVekSlxQORoItEIPEpBgFJpLEXdVzT9hsTwCC4usinvLyy5uOoNq3booSO0MVtpMt17tXAmnFaOUd14WhQEdCM= X-Sonic-ID: C;DljwO4BS5BG8pYR6lZB5Vg== M;tAtkPIBS5BG8pYR6lZB5Vg== X-Spam-Flag: No X-Sonic-Spam-Details: 0.0/5.0 by cerberusd Cc: Justin Hibbits X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Oct 2014 02:25:55 -0000 Interesting. If OF is changing the value of r1, there must be some problem with the ABI thunk the 64-bit kernel uses or a problem with trap handlers. This is obviously not systematic if loader and the kernel up to that point have no problems. Does a 32-bit kernel have the same problems on your hardware? That would test whether it is the ABI translation. -Nathan On 10/12/14 17:53, Mark Millard wrote: > NOTE: I make no claim that any of the below hacks for ofwcall are appropriate code for FreeBSD's general context. I only claim that it seems to make the specific PowerMac G5 problem go away, gives solid evidence for at least some of what is going on (justifying the investigative and testing hacks) and so gives evidence for an appropriate, more general FreeBSD solution. > > > The big issue is: The PowerMac G5 openfirmware does not always preserve the %r1 value (the stack pointer contents) that it is initially given, at least when the early "before copyright" crash problem is happening but possibly other times as well. > > I had the following investigative code in ofwcall, snapshotting the value of %r1 before and after openfirmware's code is used: > > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > ... > mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... > /* Finally, branch to OF */ > mtctr %r4 > bctrl > mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... > > then the DDB show registers from the crash that I'd hacked in would show these values instead of the zeros they otherwise always display, in addition to what the show registers has always shown for r1. > > The results were like the following example for every such crash: > > r17 = 0xC31400 ofwstk+0xfe0 > r18 = 0xd24450 > r1 = 0xd24450 > > Because of that %r1 value the later code such as: > > /* Reload stack pointer and MSR from the OFW stack */ > ld %r6,24(%r1) > ld %r2,16(%r1) > ld %r1,8(%r1) > > gets garbage-in/garbage-out results, including %r6 being values like 0xbc0568 instead of the value saved msr to later be restored: 0x9000000000001032. > > So one PowerMac G5 specific hack involved in my working-boots context is to force the original %r1 value to be used (based on %r17 being a before-call copy, similar to the above): > > ld %r6,24(%r17) > ld %r2,16(%r17) > ld %r1,8(%r17) > > But the exception report from DDB has had problems in part because sprg0 still has the openfirmware value at the time even though the exception is after openfirmware returned (the wrong value results in the register for GET_CPUINFO(). So I hacked in a before-exception restore of FreeBSD's sprg0 inside ofwcall to make the exception handler code have that much FreeBSD context available at the exception (if it occurs, anyway). This was really just to help with information gathering, although I've not tested only having the %r17 changes. > > So overall PowerMac G5 specific hacking the ofwcall code to have instead (based on what was reported above): > > root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S > Index: /usr/src/sys/powerpc/ofw/ofwcall64.S > =================================================================== > --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) > +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) > @@ -52,6 +52,12 @@ > GLOBAL(rtas_entry) > .llong 0 /* RTAS entry point */ > > + /* HACK: part of having sprg0 in place for trap */ > +ofwsprg0save: > + .space 8 /* sizeof(register_t) */ > +GLOBAL(ofw_sprg0_save) > + .llong 0 > + > /* > * Open Firmware Real-mode Entry Point. This is a huge pain. > */ > @@ -97,6 +103,10 @@ > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > > + /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */ > + lis %r14,ofw_sprg0_save@ha > + ld %r14,ofw_sprg0_save@l(%r14) > + > /* > * Set the MSR to the OF value. This has the side effect of disabling > * exceptions, which is important for the next few steps. > @@ -123,14 +133,27 @@ > stw %r5,4(%r1) > stw %r5,0(%r1) > > + /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */ > + lis %r6,ofwsprg0save@ha > + std %r14,ofwsprg0save@l(%r6) > + > + /* HACK: part of IGNORING the later %r1 value from openfirmware */ > + mr %r17,%r1 > + > /* Finally, branch to OF */ > mtctr %r4 > bctrl > > + /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */ > + lis %r6,ofwsprg0save@ha > + ld %r6,ofwsprg0save@l(%r6) > + mtsprg0 %r6 > + > /* Reload stack pointer and MSR from the OFW stack */ > - ld %r6,24(%r1) > - ld %r2,16(%r1) > - ld %r1,8(%r1) > + /* HACKED to ignore the %r1 value that results from openfirmware's call */ > + ld %r6,24(%r17) > + ld %r2,16(%r17) > + ld %r1,8(%r17) > > /* Now set the real MSR */ > mtmsrd %r6 > > This results in no crashes happening so far in my testing, not even the 16 GByte RAM machine that crashed so much. > > NOTE: owf_machdep.c was changed to use "extern register_t ofw_sprg0_save;" to match the above. > > I still have ps3 disabled in GENERIC64 so that I can also have the sc options in GENERIC64. And the DDB and GDB options are still present as well. > > And I still have my hack to force a DDB script that does show registers and shows the ofwcall history information that I hacked in, even for the very early crashes before input is possible. Not that I'm now getting such executions of the script. (A before possible-crash backtrace is also shown by the added code. That still shows up.) > > I'll probably next switch to reverting the DDB related code changes and to removing the DDB/GDB options and see how that goes. > > > === > Mark Millard > markmi at dsl-only.net > > From owner-freebsd-ppc@FreeBSD.ORG Mon Oct 13 06:20:41 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B310E874 for ; Mon, 13 Oct 2014 06:20:41 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 477882EA for ; Mon, 13 Oct 2014 06:20:40 +0000 (UTC) Received: (qmail 16914 invoked from network); 13 Oct 2014 06:20:37 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 13 Oct 2014 06:20:37 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Mon, 13 Oct 2014 02:20:37 -0400 (EDT) Received: (qmail 17354 invoked from network); 13 Oct 2014 06:20:23 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 13 Oct 2014 06:20:23 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 2E3DE1C4015; Sun, 12 Oct 2014 23:20:16 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence From: Mark Millard In-Reply-To: <543B3828.8070806@freebsd.org> Date: Sun, 12 Oct 2014 23:20:21 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Oct 2014 06:20:41 -0000 Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when !pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. A nice thing about what I've found is that I can now figure out how to = use a comparison of the before and after stack pointers and to force = DDB's involvement if and only if they are not equal. That would also report %r1 differences that happen to not to produce = failures (if there are such). (There has to be some explanation for why = sometimes it works and sometimes it does not, say, unstable = initializations, race conditions, or something meeting both criteria.) Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. I should have mentioned some things about the kind of evidence I have = vs. do not (yet) have: A) The property defining the only context where I have observed the %r1 = issue is as noted above. In all but one of the ofwcall failure cases it was the first ofwcall in = that !pmap_bootstrapped context that had the problem. The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. B) While I've not been building debug code variants for powerpc/GENERIC = I've never seen the powerpc/GENERIC code fail to boot the G5's. And I = have spent some sessions doing reboot after reboot to see if I'd get = some failures (in addition to some other more normal uses). C) So far I've only been looking at "show registers" when it gets a = boot-time exception that a DDB processes with the automatic script: the = crashes. I do not (yet) have any observations of what things look like = during such points for successful boots. (I'm figuring out ways to get = and see the evidence spanning early boot time as I go.) And so I've only = been looking with such special debug code where I knew I could reproduce = the failures (3 PowerMac G5's when using variants of = powerpc64/GENERIC64.) In fact if the hack that I put in place completely masks the problem = then I currently would not ever observe any problem-specific information = from the successful boots. Thus the before/after comparison would seem = to be next for my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn = wrote: Interesting. If OF is changing the value of r1, there must be some = problem with the ABI thunk the 64-bit kernel uses or a problem with trap = handlers. This is obviously not systematic if loader and the kernel up = to that point have no problems. Does a 32-bit kernel have the same = problems on your hardware? That would test whether it is the ABI = translation. -Nathan On 10/12/14 17:53, Mark Millard wrote: > NOTE: I make no claim that any of the below hacks for ofwcall are = appropriate code for FreeBSD's general context. I only claim that it = seems to make the specific PowerMac G5 problem go away, gives solid = evidence for at least some of what is going on (justifying the = investigative and testing hacks) and so gives evidence for an = appropriate, more general FreeBSD solution. >=20 >=20 > The big issue is: The PowerMac G5 openfirmware does not always = preserve the %r1 value (the stack pointer contents) that it is initially = given, at least when the early "before copyright" crash problem is = happening but possibly other times as well. >=20 > I had the following investigative code in ofwcall, snapshotting the = value of %r1 before and after openfirmware's code is used: >=20 > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > ... > mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... > /* Finally, branch to OF */ > mtctr %r4 > bctrl > mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >=20 > then the DDB show registers from the crash that I'd hacked in would = show these values instead of the zeros they otherwise always display, in = addition to what the show registers has always shown for r1. >=20 > The results were like the following example for every such crash: >=20 > r17 =3D 0xC31400 ofwstk+0xfe0 > r18 =3D 0xd24450 > r1 =3D 0xd24450 >=20 > Because of that %r1 value the later code such as: >=20 > /* Reload stack pointer and MSR from the OFW stack */ > ld %r6,24(%r1) > ld %r2,16(%r1) > ld %r1,8(%r1) >=20 > gets garbage-in/garbage-out results, including %r6 being values like = 0xbc0568 instead of the value saved msr to later be restored: = 0x9000000000001032. >=20 > So one PowerMac G5 specific hack involved in my working-boots context = is to force the original %r1 value to be used (based on %r17 being a = before-call copy, similar to the above): >=20 > ld %r6,24(%r17) > ld %r2,16(%r17) > ld %r1,8(%r17) >=20 > But the exception report from DDB has had problems in part because = sprg0 still has the openfirmware value at the time even though the = exception is after openfirmware returned (the wrong value results in the = register for GET_CPUINFO(). So I hacked in a before-exception = restore of FreeBSD's sprg0 inside ofwcall to make the exception handler = code have that much FreeBSD context available at the exception (if it = occurs, anyway). This was really just to help with information = gathering, although I've not tested only having the %r17 changes. >=20 > So overall PowerMac G5 specific hacking the ofwcall code to have = instead (based on what was reported above): >=20 > root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S > Index: /usr/src/sys/powerpc/ofw/ofwcall64.S > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) > +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) > @@ -52,6 +52,12 @@ > GLOBAL(rtas_entry) > .llong 0 /* RTAS entry point */ > + /* HACK: part of having sprg0 in place for trap */ > +ofwsprg0save: > + .space 8 /* sizeof(register_t) */ > +GLOBAL(ofw_sprg0_save) > + .llong 0 > + > /* > * Open Firmware Real-mode Entry Point. This is a huge pain. > */ > @@ -97,6 +103,10 @@ > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r14,ofw_sprg0_save@ha > + ld %r14,ofw_sprg0_save@l(%r14) > + > /* > * Set the MSR to the OF value. This has the side effect of = disabling > * exceptions, which is important for the next few steps. > @@ -123,14 +133,27 @@ > stw %r5,4(%r1) > stw %r5,0(%r1) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + std %r14,ofwsprg0save@l(%r6) > + > + /* HACK: part of IGNORING the later %r1 value from openfirmware = */ > + mr %r17,%r1 > + > /* Finally, branch to OF */ > mtctr %r4 > bctrl > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + ld %r6,ofwsprg0save@l(%r6) > + mtsprg0 %r6 > + > /* Reload stack pointer and MSR from the OFW stack */ > - ld %r6,24(%r1) > - ld %r2,16(%r1) > - ld %r1,8(%r1) > + /* HACKED to ignore the %r1 value that results from = openfirmware's call */ > + ld %r6,24(%r17) > + ld %r2,16(%r17) > + ld %r1,8(%r17) > /* Now set the real MSR */ > mtmsrd %r6 >=20 > This results in no crashes happening so far in my testing, not even = the 16 GByte RAM machine that crashed so much. >=20 > NOTE: owf_machdep.c was changed to use "extern register_t = ofw_sprg0_save;" to match the above. >=20 > I still have ps3 disabled in GENERIC64 so that I can also have the sc = options in GENERIC64. And the DDB and GDB options are still present as = well. >=20 > And I still have my hack to force a DDB script that does show = registers and shows the ofwcall history information that I hacked in, = even for the very early crashes before input is possible. Not that I'm = now getting such executions of the script. (A before possible-crash = backtrace is also shown by the added code. That still shows up.) >=20 > I'll probably next switch to reverting the DDB related code changes = and to removing the DDB/GDB options and see how that goes. >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 >=20 From owner-freebsd-ppc@FreeBSD.ORG Mon Oct 13 06:35:00 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3DD69B73 for ; Mon, 13 Oct 2014 06:35:00 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C94C5661 for ; Mon, 13 Oct 2014 06:34:59 +0000 (UTC) Received: (qmail 13931 invoked from network); 13 Oct 2014 06:34:58 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 13 Oct 2014 06:34:58 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Mon, 13 Oct 2014 02:34:58 -0400 (EDT) Received: (qmail 9035 invoked from network); 13 Oct 2014 06:34:58 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 13 Oct 2014 06:34:58 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id B2FBA1C4015; Sun, 12 Oct 2014 23:34:50 -0700 (PDT) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Content-Type: text/plain; charset=windows-1252 From: Mark Millard X-Priority: 1 In-Reply-To: <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> Date: Sun, 12 Oct 2014 23:34:56 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Oct 2014 06:35:00 -0000 Fixing stupid typos that reverse what I should have said: removing the = !'s in front of pmap_bootstrapped (from a copy/paste sequence error)... Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. ... The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. =3D=3D=3D Mark Millard markmi@dsl-only.net On Oct 12, 2014, at 11:20 PM, Mark Millard wrote: Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when !pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. A nice thing about what I've found is that I can now figure out how to = use a comparison of the before and after stack pointers and to force = DDB's involvement if and only if they are not equal. That would also report %r1 differences that happen to not to produce = failures (if there are such). (There has to be some explanation for why = sometimes it works and sometimes it does not, say, unstable = initializations, race conditions, or something meeting both criteria.) Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. I should have mentioned some things about the kind of evidence I have = vs. do not (yet) have: A) The property defining the only context where I have observed the %r1 = issue is as noted above. In all but one of the ofwcall failure cases it was the first ofwcall in = that !pmap_bootstrapped context that had the problem. The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. B) While I've not been building debug code variants for powerpc/GENERIC = I've never seen the powerpc/GENERIC code fail to boot the G5's. And I = have spent some sessions doing reboot after reboot to see if I'd get = some failures (in addition to some other more normal uses). C) So far I've only been looking at "show registers" when it gets a = boot-time exception that a DDB processes with the automatic script: the = crashes. I do not (yet) have any observations of what things look like = during such points for successful boots. (I'm figuring out ways to get = and see the evidence spanning early boot time as I go.) And so I've only = been looking with such special debug code where I knew I could reproduce = the failures (3 PowerMac G5's when using variants of = powerpc64/GENERIC64.) In fact if the hack that I put in place completely masks the problem = then I currently would not ever observe any problem-specific information = from the successful boots. Thus the before/after comparison would seem = to be next for my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn = wrote: Interesting. If OF is changing the value of r1, there must be some = problem with the ABI thunk the 64-bit kernel uses or a problem with trap = handlers. This is obviously not systematic if loader and the kernel up = to that point have no problems. Does a 32-bit kernel have the same = problems on your hardware? That would test whether it is the ABI = translation. -Nathan On 10/12/14 17:53, Mark Millard wrote: > NOTE: I make no claim that any of the below hacks for ofwcall are = appropriate code for FreeBSD's general context. I only claim that it = seems to make the specific PowerMac G5 problem go away, gives solid = evidence for at least some of what is going on (justifying the = investigative and testing hacks) and so gives evidence for an = appropriate, more general FreeBSD solution. >=20 >=20 > The big issue is: The PowerMac G5 openfirmware does not always = preserve the %r1 value (the stack pointer contents) that it is initially = given, at least when the early "before copyright" crash problem is = happening but possibly other times as well. >=20 > I had the following investigative code in ofwcall, snapshotting the = value of %r1 before and after openfirmware's code is used: >=20 > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > ... > mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... > /* Finally, branch to OF */ > mtctr %r4 > bctrl > mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >=20 > then the DDB show registers from the crash that I'd hacked in would = show these values instead of the zeros they otherwise always display, in = addition to what the show registers has always shown for r1. >=20 > The results were like the following example for every such crash: >=20 > r17 =3D 0xC31400 ofwstk+0xfe0 > r18 =3D 0xd24450 > r1 =3D 0xd24450 >=20 > Because of that %r1 value the later code such as: >=20 > /* Reload stack pointer and MSR from the OFW stack */ > ld %r6,24(%r1) > ld %r2,16(%r1) > ld %r1,8(%r1) >=20 > gets garbage-in/garbage-out results, including %r6 being values like = 0xbc0568 instead of the value saved msr to later be restored: = 0x9000000000001032. >=20 > So one PowerMac G5 specific hack involved in my working-boots context = is to force the original %r1 value to be used (based on %r17 being a = before-call copy, similar to the above): >=20 > ld %r6,24(%r17) > ld %r2,16(%r17) > ld %r1,8(%r17) >=20 > But the exception report from DDB has had problems in part because = sprg0 still has the openfirmware value at the time even though the = exception is after openfirmware returned (the wrong value results in the = register for GET_CPUINFO(). So I hacked in a before-exception = restore of FreeBSD's sprg0 inside ofwcall to make the exception handler = code have that much FreeBSD context available at the exception (if it = occurs, anyway). This was really just to help with information = gathering, although I've not tested only having the %r17 changes. >=20 > So overall PowerMac G5 specific hacking the ofwcall code to have = instead (based on what was reported above): >=20 > root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S > Index: /usr/src/sys/powerpc/ofw/ofwcall64.S > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) > +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) > @@ -52,6 +52,12 @@ > GLOBAL(rtas_entry) > .llong 0 /* RTAS entry point */ > + /* HACK: part of having sprg0 in place for trap */ > +ofwsprg0save: > + .space 8 /* sizeof(register_t) */ > +GLOBAL(ofw_sprg0_save) > + .llong 0 > + > /* > * Open Firmware Real-mode Entry Point. This is a huge pain. > */ > @@ -97,6 +103,10 @@ > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r14,ofw_sprg0_save@ha > + ld %r14,ofw_sprg0_save@l(%r14) > + > /* > * Set the MSR to the OF value. This has the side effect of = disabling > * exceptions, which is important for the next few steps. > @@ -123,14 +133,27 @@ > stw %r5,4(%r1) > stw %r5,0(%r1) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + std %r14,ofwsprg0save@l(%r6) > + > + /* HACK: part of IGNORING the later %r1 value from openfirmware = */ > + mr %r17,%r1 > + > /* Finally, branch to OF */ > mtctr %r4 > bctrl > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + ld %r6,ofwsprg0save@l(%r6) > + mtsprg0 %r6 > + > /* Reload stack pointer and MSR from the OFW stack */ > - ld %r6,24(%r1) > - ld %r2,16(%r1) > - ld %r1,8(%r1) > + /* HACKED to ignore the %r1 value that results from = openfirmware's call */ > + ld %r6,24(%r17) > + ld %r2,16(%r17) > + ld %r1,8(%r17) > /* Now set the real MSR */ > mtmsrd %r6 >=20 > This results in no crashes happening so far in my testing, not even = the 16 GByte RAM machine that crashed so much. >=20 > NOTE: owf_machdep.c was changed to use "extern register_t = ofw_sprg0_save;" to match the above. >=20 > I still have ps3 disabled in GENERIC64 so that I can also have the sc = options in GENERIC64. And the DDB and GDB options are still present as = well. >=20 > And I still have my hack to force a DDB script that does show = registers and shows the ofwcall history information that I hacked in, = even for the very early crashes before input is possible. Not that I'm = now getting such executions of the script. (A before possible-crash = backtrace is also shown by the added code. That still shows up.) >=20 > I'll probably next switch to reverting the DDB related code changes = and to removing the DDB/GDB options and see how that goes. >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 >=20 From owner-freebsd-ppc@FreeBSD.ORG Mon Oct 13 10:39:45 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B8A0BB02 for ; Mon, 13 Oct 2014 10:39:45 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 56123F72 for ; Mon, 13 Oct 2014 10:39:44 +0000 (UTC) Received: (qmail 2179 invoked from network); 13 Oct 2014 10:39:43 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 13 Oct 2014 10:39:43 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Mon, 13 Oct 2014 06:39:43 -0400 (EDT) Received: (qmail 18777 invoked from network); 13 Oct 2014 10:39:42 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 13 Oct 2014 10:39:42 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 9BA741C405B; Mon, 13 Oct 2014 03:39:40 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] From: Mark Millard In-Reply-To: <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> Date: Mon, 13 Oct 2014 03:39:38 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Oct 2014 10:39:45 -0000 While I do not yet have "show register" or other information displayed = when %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have = now had two cases happen for the same, unmodified boot SSD in the same = PowerMac G5: A) Boots without failure or finding any changes to %r1 for before vs. = after openfirmware calls. B) I had it stop the boot after the code finds that %r1 had instead = changed. The usual before-copyright-notice sort of timing for where it = stopped, after pmap_bootstrapped became true. (I need "show register" or = other such to have more detail.) I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next. Both both do. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 11:34 PM, Mark Millard = wrote: Fixing stupid typos that reverse what I should have said: removing the = !'s in front of pmap_bootstrapped (from a copy/paste sequence error)... Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. ... The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 11:20 PM, Mark Millard = wrote: Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when !pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. A nice thing about what I've found is that I can now figure out how to = use a comparison of the before and after stack pointers and to force = DDB's involvement if and only if they are not equal. That would also report %r1 differences that happen to not to produce = failures (if there are such). (There has to be some explanation for why = sometimes it works and sometimes it does not, say, unstable = initializations, race conditions, or something meeting both criteria.) Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. I should have mentioned some things about the kind of evidence I have = vs. do not (yet) have: A) The property defining the only context where I have observed the %r1 = issue is as noted above. In all but one of the ofwcall failure cases it was the first ofwcall in = that !pmap_bootstrapped context that had the problem. The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. B) While I've not been building debug code variants for powerpc/GENERIC = I've never seen the powerpc/GENERIC code fail to boot the G5's. And I = have spent some sessions doing reboot after reboot to see if I'd get = some failures (in addition to some other more normal uses). C) So far I've only been looking at "show registers" when it gets a = boot-time exception that a DDB processes with the automatic script: the = crashes. I do not (yet) have any observations of what things look like = during such points for successful boots. (I'm figuring out ways to get = and see the evidence spanning early boot time as I go.) And so I've only = been looking with such special debug code where I knew I could reproduce = the failures (3 PowerMac G5's when using variants of = powerpc64/GENERIC64.) In fact if the hack that I put in place completely masks the problem = then I currently would not ever observe any problem-specific information = from the successful boots. Thus the before/after comparison would seem = to be next for my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn wrote: Interesting. If OF is changing the value of r1, there must be some = problem with the ABI thunk the 64-bit kernel uses or a problem with trap = handlers. This is obviously not systematic if loader and the kernel up = to that point have no problems. Does a 32-bit kernel have the same = problems on your hardware? That would test whether it is the ABI = translation. -Nathan On 10/12/14 17:53, Mark Millard wrote: > NOTE: I make no claim that any of the below hacks for ofwcall are = appropriate code for FreeBSD's general context. I only claim that it = seems to make the specific PowerMac G5 problem go away, gives solid = evidence for at least some of what is going on (justifying the = investigative and testing hacks) and so gives evidence for an = appropriate, more general FreeBSD solution. >=20 >=20 > The big issue is: The PowerMac G5 openfirmware does not always = preserve the %r1 value (the stack pointer contents) that it is initially = given, at least when the early "before copyright" crash problem is = happening but possibly other times as well. >=20 > I had the following investigative code in ofwcall, snapshotting the = value of %r1 before and after openfirmware's code is used: >=20 > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > ... > mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... > /* Finally, branch to OF */ > mtctr %r4 > bctrl > mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >=20 > then the DDB show registers from the crash that I'd hacked in would = show these values instead of the zeros they otherwise always display, in = addition to what the show registers has always shown for r1. >=20 > The results were like the following example for every such crash: >=20 > r17 =3D 0xC31400 ofwstk+0xfe0 > r18 =3D 0xd24450 > r1 =3D 0xd24450 >=20 > Because of that %r1 value the later code such as: >=20 > /* Reload stack pointer and MSR from the OFW stack */ > ld %r6,24(%r1) > ld %r2,16(%r1) > ld %r1,8(%r1) >=20 > gets garbage-in/garbage-out results, including %r6 being values like = 0xbc0568 instead of the value saved msr to later be restored: = 0x9000000000001032. >=20 > So one PowerMac G5 specific hack involved in my working-boots context = is to force the original %r1 value to be used (based on %r17 being a = before-call copy, similar to the above): >=20 > ld %r6,24(%r17) > ld %r2,16(%r17) > ld %r1,8(%r17) >=20 > But the exception report from DDB has had problems in part because = sprg0 still has the openfirmware value at the time even though the = exception is after openfirmware returned (the wrong value results in the = register for GET_CPUINFO(). So I hacked in a before-exception = restore of FreeBSD's sprg0 inside ofwcall to make the exception handler = code have that much FreeBSD context available at the exception (if it = occurs, anyway). This was really just to help with information = gathering, although I've not tested only having the %r17 changes. >=20 > So overall PowerMac G5 specific hacking the ofwcall code to have = instead (based on what was reported above): >=20 > root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S > Index: /usr/src/sys/powerpc/ofw/ofwcall64.S > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) > +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) > @@ -52,6 +52,12 @@ > GLOBAL(rtas_entry) > .llong 0 /* RTAS entry point */ > + /* HACK: part of having sprg0 in place for trap */ > +ofwsprg0save: > + .space 8 /* sizeof(register_t) */ > +GLOBAL(ofw_sprg0_save) > + .llong 0 > + > /* > * Open Firmware Real-mode Entry Point. This is a huge pain. > */ > @@ -97,6 +103,10 @@ > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r14,ofw_sprg0_save@ha > + ld %r14,ofw_sprg0_save@l(%r14) > + > /* > * Set the MSR to the OF value. This has the side effect of = disabling > * exceptions, which is important for the next few steps. > @@ -123,14 +133,27 @@ > stw %r5,4(%r1) > stw %r5,0(%r1) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + std %r14,ofwsprg0save@l(%r6) > + > + /* HACK: part of IGNORING the later %r1 value from openfirmware = */ > + mr %r17,%r1 > + > /* Finally, branch to OF */ > mtctr %r4 > bctrl > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + ld %r6,ofwsprg0save@l(%r6) > + mtsprg0 %r6 > + > /* Reload stack pointer and MSR from the OFW stack */ > - ld %r6,24(%r1) > - ld %r2,16(%r1) > - ld %r1,8(%r1) > + /* HACKED to ignore the %r1 value that results from = openfirmware's call */ > + ld %r6,24(%r17) > + ld %r2,16(%r17) > + ld %r1,8(%r17) > /* Now set the real MSR */ > mtmsrd %r6 >=20 > This results in no crashes happening so far in my testing, not even = the 16 GByte RAM machine that crashed so much. >=20 > NOTE: owf_machdep.c was changed to use "extern register_t = ofw_sprg0_save;" to match the above. >=20 > I still have ps3 disabled in GENERIC64 so that I can also have the sc = options in GENERIC64. And the DDB and GDB options are still present as = well. >=20 > And I still have my hack to force a DDB script that does show = registers and shows the ofwcall history information that I hacked in, = even for the very early crashes before input is possible. Not that I'm = now getting such executions of the script. (A before possible-crash = backtrace is also shown by the added code. That still shows up.) >=20 > I'll probably next switch to reverting the DDB related code changes = and to removing the DDB/GDB options and see how that goes. >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 >=20 From owner-freebsd-ppc@FreeBSD.ORG Mon Oct 13 14:57:15 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2B6255A8 for ; Mon, 13 Oct 2014 14:57:15 +0000 (UTC) Received: from mail.titaniotecnologia.com.br (mail2.tsemredes.com.br [177.52.170.133]) by mx1.freebsd.org (Postfix) with ESMTP id 6A738E46 for ; Mon, 13 Oct 2014 14:57:12 +0000 (UTC) Received: (qmail 7187 invoked from network); 13 Oct 2014 11:57:16 -0300 Received: by simscan 1.4.0 ppid: 7180, pid: 7182, t: 0.0747s scanners: attach: 1.4.0 clamav: 0.98.1/m:55/d:18859 Received: from unknown (HELO ?192.168.25.169?) (felipe@felipeoliva.eti.br@189.58.62.139) de/crypted with TLSv1.2: DHE-RSA-AES128-SHA [128/128] DN=unknown by mail.titaniotecnologia.com.br with ESMTPSA; 13 Oct 2014 11:57:15 -0300 Message-ID: <543BE83E.9080106@felipeoliva.eti.br> Date: Mon, 13 Oct 2014 11:57:02 -0300 From: "Felipe N. Oliva" User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: Cross compile RB1100ahx2 References: <54381648.5030900@felipeoliva.eti.br> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Oct 2014 14:57:15 -0000 Compiling kernel file /usr/src/sys/powerpc/conf/MPC85XX and booting from network, return this: *RouterBOOT booter 3.10** ** **RouterBoard 1100AHx2** ** **CPU frequency: 1066 MHz** ** Memory size: 2048 MiB** ** NAND size: 128 MiB** ** **Press any key within 2 seconds to enter setup..** **trying bootp protocol.... OK** **Got IP address: 192.168.10.200** **resolved mac address 08:00:27:BF:9B:62** **Gateway: 192.168.10.1** **transfer started .................................................. transfer ok, time=3.70s** **setting up elf image... OK** **jumping to kernel code* And break. On 11/10/2014 18:10, Adrian Chadd wrote: > Damn, I keep meaning to acquire one of these to run up as a test 11n AP. > > Nathan? ANy ideas? > > > -a > > > On 10 October 2014 10:24, Felipe N. Oliva wrote: >> Regards, >> >> I want to do cross compiling from amd64 to powerpc(RouterBOARD 1100ahx2 >> processor P2020). >> >> Has anyone done this? Works well? Any tips? >> >> Thank you, >> -- >> Felipe N. Oliva >> >> _______________________________________________ >> freebsd-ppc@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-ppc >> To unsubscribe, send any mail to "freebsd-ppc-unsubscribe@freebsd.org" From owner-freebsd-ppc@FreeBSD.ORG Tue Oct 14 11:47:24 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 841BFC5B for ; Tue, 14 Oct 2014 11:47:24 +0000 (UTC) Received: from mail.titaniotecnologia.com.br (mail2.tsemredes.com.br [177.52.170.133]) by mx1.freebsd.org (Postfix) with ESMTP id C28AEFAC for ; Tue, 14 Oct 2014 11:47:22 +0000 (UTC) Received: (qmail 14013 invoked from network); 14 Oct 2014 08:47:31 -0300 Received: by simscan 1.4.0 ppid: 14006, pid: 14008, t: 0.0587s scanners: attach: 1.4.0 clamav: 0.98.1/m:55/d:18859 Received: from unknown (HELO ?192.168.25.169?) (felipe@felipeoliva.eti.br@189.58.58.57) de/crypted with TLSv1.2: DHE-RSA-AES128-SHA [128/128] DN=unknown by mail.titaniotecnologia.com.br with ESMTPSA; 14 Oct 2014 08:47:31 -0300 Message-ID: <543D0D46.2020402@felipeoliva.eti.br> Date: Tue, 14 Oct 2014 08:47:18 -0300 From: "Felipe N. Oliva" User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: freebsd-ppc@freebsd.org Subject: Re: Cross compile RB1100ahx2 References: <54381648.5030900@felipeoliva.eti.br> <543BE83E.9080106@felipeoliva.eti.br> In-Reply-To: <543BE83E.9080106@felipeoliva.eti.br> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2014 11:47:24 -0000 Regards, I adapted this article: https://wiki.freebsd.org/FreeBSD/BuildingMIPS Building: make -j4 TARGET=powerpc TARGET_ARCH=powerpc buildworld make -j4 TARGET=powerpc TARGET_ARCH=powerpc buildkernel KERNCONF=MPC85XX make -j4 TARGET=powerpc TARGET_ARCH=powerpc DESTDIR="/teste/powerpc/root" installkernel KERNCONF=MPC85XX make -j4 TARGET=powerpc TARGET_ARCH=powerpc DESTDIR="/teste/powerpc/root" installworld make -j4 TARGET=powerpc TARGET_ARCH=powerpc DESTDIR="/teste/powerpc/root" distribution DHCPD: subnet 192.168.10.0 netmask 255.255.255.0 { deny unknown-clients; option routers 192.168.10.1; option root-path "192.168.10.196:/teste/powerpc/root/"; # tftp server address and kernel path next-server 192.168.10.196; filename "root/boot/kernel"; } host rb1100ahx2 { hardware ethernet 4c:5e:0c:3b:a5:c2; # the mac address of the board fixed-address 192.168.10.200; # pick an unused address } /etc/exports: /teste/powerpc/root/ -maproot=root -network 192.168.10/24 TFTPD: tftp dgram udp wait root /usr/libexec/tftpd tftpd -l -s /teste/powerpc /etc/rc.conf: dhcpd_enable="YES" inetd_enable="YES" rpcbind_enable="YES" rpc_statd_enable="YES" rpc_lockd_enable="YES" nfs_server_enable="YES" mountd_enable="YES" Maybe the question is the serial console. Any idea? Thanks, Felipe N. Oliva On 13/10/2014 11:57, Felipe N. Oliva wrote: > Compiling kernel file /usr/src/sys/powerpc/conf/MPC85XX and booting > from network, return this: > > *RouterBOOT booter 3.10** > ** > **RouterBoard 1100AHx2** > ** > **CPU frequency: 1066 MHz** > ** Memory size: 2048 MiB** > ** NAND size: 128 MiB** > ** > **Press any key within 2 seconds to enter setup..** > **trying bootp protocol.... OK** > **Got IP address: 192.168.10.200** > **resolved mac address 08:00:27:BF:9B:62** > **Gateway: 192.168.10.1** > **transfer started .................................................. > transfer ok, time=3.70s** > **setting up elf image... OK** > **jumping to kernel code* > > And break. > > On 11/10/2014 18:10, Adrian Chadd wrote: >> Damn, I keep meaning to acquire one of these to run up as a test 11n AP. >> >> Nathan? ANy ideas? >> >> >> -a >> >> >> On 10 October 2014 10:24, Felipe N. Oliva >> wrote: >>> Regards, >>> >>> I want to do cross compiling from amd64 to powerpc(RouterBOARD 1100ahx2 >>> processor P2020). >>> >>> Has anyone done this? Works well? Any tips? >>> >>> Thank you, >>> -- >>> Felipe N. Oliva >>> >>> _______________________________________________ >>> freebsd-ppc@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-ppc >>> To unsubscribe, send any mail to "freebsd-ppc-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-ppc@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-ppc > To unsubscribe, send any mail to "freebsd-ppc-unsubscribe@freebsd.org" From owner-freebsd-ppc@FreeBSD.ORG Tue Oct 14 16:14:21 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8320880 for ; Tue, 14 Oct 2014 16:14:21 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 14FDB144 for ; Tue, 14 Oct 2014 16:14:20 +0000 (UTC) Received: (qmail 30133 invoked from network); 14 Oct 2014 16:14:12 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 14 Oct 2014 16:14:12 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Tue, 14 Oct 2014 12:14:12 -0400 (EDT) Received: (qmail 11084 invoked from network); 14 Oct 2014 16:14:12 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 14 Oct 2014 16:14:12 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id DFF7A1C4052; Tue, 14 Oct 2014 09:14:08 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] From: Mark Millard In-Reply-To: Date: Tue, 14 Oct 2014 09:14:10 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2014 16:14:21 -0000 Additional notes from additional experiments... (So far from one G5.) I got back trace, show registers, and my openfirmware-history list going = for failure reporting based on explicit before vs. after tests of %r1 = values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). An interesting property resulted: every time %r1 had changed from having = the before-value (stack pointer value) %r1 instead ended up with a value = equal to what openfirmware put in %r3. And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... vgapci0: Boot video device ... pcib1: ...=20 with back trace (from OF_peer down): .OF_peer+0x8c .cpcht_attach+0x884 .device_attach+0x3ac .device_probe_and_attach+0x3c .bus_generic_new_pass+0x12c .bus_generic_new_pass+0x114 .bus_generic_new_pass+0x114 (yep: listed twice) .bus_set_pass+0xc0 .root_bus_configure+0x14 .mi_startup+0x10c btext+0xbc %r1 before: 0xc30400 ofwstk+0xfe0 %r1 after: 0xd23450 %r3 after: 0xd23450 FreeBSD msr to restore: 0x9000000000001032 ofmsr[0] to restore: 0x1000000000003030 The same after-openfirmware %r1 and %r3 values that had been showing up = for the before-copyright examples of ofwcall failures. And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. I later did some experiments where I had it report but not stop when the = after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) I have not yet investigated making analogous powerpc/GENERIC code and = builds. Nor have I dealt with having it report more detail about the peer = requests that fail. Nor have I seen examples of what "not failing/%r1-unchanged" looks like = overall. I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens. =3D=3D=3D Mark Millard markmi@dsl-only.net On Oct 13, 2014, at 3:39 AM, Mark Millard wrote: While I do not yet have "show register" or other information displayed = when %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have = now had two cases happen for the same, unmodified boot SSD in the same = PowerMac G5: A) Boots without failure or finding any changes to %r1 for before vs. = after openfirmware calls. B) I had it stop the boot after the code finds that %r1 had instead = changed. The usual before-copyright-notice sort of timing for where it = stopped, after pmap_bootstrapped became true. (I need "show register" or = other such to have more detail.) I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next. Both both do. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 11:34 PM, Mark Millard = wrote: Fixing stupid typos that reverse what I should have said: removing the = !'s in front of pmap_bootstrapped (from a copy/paste sequence error)... Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. ... The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 11:20 PM, Mark Millard = wrote: Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when !pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. A nice thing about what I've found is that I can now figure out how to = use a comparison of the before and after stack pointers and to force = DDB's involvement if and only if they are not equal. That would also report %r1 differences that happen to not to produce = failures (if there are such). (There has to be some explanation for why = sometimes it works and sometimes it does not, say, unstable = initializations, race conditions, or something meeting both criteria.) Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. I should have mentioned some things about the kind of evidence I have = vs. do not (yet) have: A) The property defining the only context where I have observed the %r1 = issue is as noted above. In all but one of the ofwcall failure cases it was the first ofwcall in = that !pmap_bootstrapped context that had the problem. The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. B) While I've not been building debug code variants for powerpc/GENERIC = I've never seen the powerpc/GENERIC code fail to boot the G5's. And I = have spent some sessions doing reboot after reboot to see if I'd get = some failures (in addition to some other more normal uses). C) So far I've only been looking at "show registers" when it gets a = boot-time exception that a DDB processes with the automatic script: the = crashes. I do not (yet) have any observations of what things look like = during such points for successful boots. (I'm figuring out ways to get = and see the evidence spanning early boot time as I go.) And so I've only = been looking with such special debug code where I knew I could reproduce = the failures (3 PowerMac G5's when using variants of = powerpc64/GENERIC64.) In fact if the hack that I put in place completely masks the problem = then I currently would not ever observe any problem-specific information = from the successful boots. Thus the before/after comparison would seem = to be next for my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn wrote: Interesting. If OF is changing the value of r1, there must be some = problem with the ABI thunk the 64-bit kernel uses or a problem with trap = handlers. This is obviously not systematic if loader and the kernel up = to that point have no problems. Does a 32-bit kernel have the same = problems on your hardware? That would test whether it is the ABI = translation. -Nathan On 10/12/14 17:53, Mark Millard wrote: > NOTE: I make no claim that any of the below hacks for ofwcall are = appropriate code for FreeBSD's general context. I only claim that it = seems to make the specific PowerMac G5 problem go away, gives solid = evidence for at least some of what is going on (justifying the = investigative and testing hacks) and so gives evidence for an = appropriate, more general FreeBSD solution. >=20 >=20 > The big issue is: The PowerMac G5 openfirmware does not always = preserve the %r1 value (the stack pointer contents) that it is initially = given, at least when the early "before copyright" crash problem is = happening but possibly other times as well. >=20 > I had the following investigative code in ofwcall, snapshotting the = value of %r1 before and after openfirmware's code is used: >=20 > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > ... > mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... > /* Finally, branch to OF */ > mtctr %r4 > bctrl > mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >=20 > then the DDB show registers from the crash that I'd hacked in would = show these values instead of the zeros they otherwise always display, in = addition to what the show registers has always shown for r1. >=20 > The results were like the following example for every such crash: >=20 > r17 =3D 0xC31400 ofwstk+0xfe0 > r18 =3D 0xd24450 > r1 =3D 0xd24450 >=20 > Because of that %r1 value the later code such as: >=20 > /* Reload stack pointer and MSR from the OFW stack */ > ld %r6,24(%r1) > ld %r2,16(%r1) > ld %r1,8(%r1) >=20 > gets garbage-in/garbage-out results, including %r6 being values like = 0xbc0568 instead of the value saved msr to later be restored: = 0x9000000000001032. >=20 > So one PowerMac G5 specific hack involved in my working-boots context = is to force the original %r1 value to be used (based on %r17 being a = before-call copy, similar to the above): >=20 > ld %r6,24(%r17) > ld %r2,16(%r17) > ld %r1,8(%r17) >=20 > But the exception report from DDB has had problems in part because = sprg0 still has the openfirmware value at the time even though the = exception is after openfirmware returned (the wrong value results in the = register for GET_CPUINFO(). So I hacked in a before-exception = restore of FreeBSD's sprg0 inside ofwcall to make the exception handler = code have that much FreeBSD context available at the exception (if it = occurs, anyway). This was really just to help with information = gathering, although I've not tested only having the %r17 changes. >=20 > So overall PowerMac G5 specific hacking the ofwcall code to have = instead (based on what was reported above): >=20 > root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S > Index: /usr/src/sys/powerpc/ofw/ofwcall64.S > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) > +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) > @@ -52,6 +52,12 @@ > GLOBAL(rtas_entry) > .llong 0 /* RTAS entry point */ > + /* HACK: part of having sprg0 in place for trap */ > +ofwsprg0save: > + .space 8 /* sizeof(register_t) */ > +GLOBAL(ofw_sprg0_save) > + .llong 0 > + > /* > * Open Firmware Real-mode Entry Point. This is a huge pain. > */ > @@ -97,6 +103,10 @@ > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r14,ofw_sprg0_save@ha > + ld %r14,ofw_sprg0_save@l(%r14) > + > /* > * Set the MSR to the OF value. This has the side effect of = disabling > * exceptions, which is important for the next few steps. > @@ -123,14 +133,27 @@ > stw %r5,4(%r1) > stw %r5,0(%r1) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + std %r14,ofwsprg0save@l(%r6) > + > + /* HACK: part of IGNORING the later %r1 value from openfirmware = */ > + mr %r17,%r1 > + > /* Finally, branch to OF */ > mtctr %r4 > bctrl > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + ld %r6,ofwsprg0save@l(%r6) > + mtsprg0 %r6 > + > /* Reload stack pointer and MSR from the OFW stack */ > - ld %r6,24(%r1) > - ld %r2,16(%r1) > - ld %r1,8(%r1) > + /* HACKED to ignore the %r1 value that results from = openfirmware's call */ > + ld %r6,24(%r17) > + ld %r2,16(%r17) > + ld %r1,8(%r17) > /* Now set the real MSR */ > mtmsrd %r6 >=20 > This results in no crashes happening so far in my testing, not even = the 16 GByte RAM machine that crashed so much. >=20 > NOTE: owf_machdep.c was changed to use "extern register_t = ofw_sprg0_save;" to match the above. >=20 > I still have ps3 disabled in GENERIC64 so that I can also have the sc = options in GENERIC64. And the DDB and GDB options are still present as = well. >=20 > And I still have my hack to force a DDB script that does show = registers and shows the ofwcall history information that I hacked in, = even for the very early crashes before input is possible. Not that I'm = now getting such executions of the script. (A before possible-crash = backtrace is also shown by the added code. That still shows up.) >=20 > I'll probably next switch to reverting the DDB related code changes = and to removing the DDB/GDB options and see how that goes. >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 >=20 From owner-freebsd-ppc@FreeBSD.ORG Tue Oct 14 16:53:33 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DA949AC6; Tue, 14 Oct 2014 16:53:32 +0000 (UTC) Received: from mail-la0-x22f.google.com (mail-la0-x22f.google.com [IPv6:2a00:1450:4010:c03::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1BC887BD; Tue, 14 Oct 2014 16:53:31 +0000 (UTC) Received: by mail-la0-f47.google.com with SMTP id pv20so8990263lab.34 for ; Tue, 14 Oct 2014 09:53:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=jrymTU4r2dgpsF+1CBg91zSfc6OluZuMyMpwCwm62GE=; b=m+WbN1lCtLDd/q+T72gAYtM/Er2QC8qYxN2WCJWYjpkV+6CjCvvxqSvuxu8crxSt5i /69g8IV0NUMMR92RZSDicHTqP4gCRAe1G/EDAWGImxdyxZTLMvEOt86f7kR9DsSN9HH/ EvFX97HJW7wDxI8+jwsnmoTmP4OokhoJFdLSEfKB2viJMNwF+QbuKzcGF2WZFkz5Cc+v Uxj5k+iIVhUuf8Ep6Eu1zUY6JTyMMT+ohVo32BRwbDMzet/aAKZU26vJcQ37qMTCOS79 A8TfkwpoumqPh415puwCkrfFF+nU+8cWsPyZGe0zKyU6xW4Xt5LkV+r5s1Ll+jdgIBZS MRxw== MIME-Version: 1.0 X-Received: by 10.112.146.5 with SMTP id sy5mr4281508lbb.97.1413305608356; Tue, 14 Oct 2014 09:53:28 -0700 (PDT) Received: by 10.25.0.75 with HTTP; Tue, 14 Oct 2014 09:53:28 -0700 (PDT) In-Reply-To: References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> Date: Tue, 14 Oct 2014 09:53:28 -0700 Message-ID: Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] From: Justin Hibbits To: Mark Millard Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2014 16:53:33 -0000 Interesting. Perhaps, instead of using %r1, and relying purely on the stack we use yet another (non-volatile) register to hold the MSR. Once we reload the MSR we can get back the saved registers, because the stack will be valid again. Nathan, thoughts? - Justin On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard wrote: > Additional notes from additional experiments... (So far from one G5.) > > I got back trace, show registers, and my openfirmware-history list going = for failure reporting based on explicit before vs. after tests of %r1 value= s. (Explicit breakpoint call for unequal, being careful to save/restore %r3= around the call.) I filled several registers with potentially interesting = values that would otherwise have had zero as a value (%r15-%r19, although %= r15 is redundant with %r6 currently). > > An interesting property resulted: every time %r1 had changed from having = the before-value (stack pointer value) %r1 instead ended up with a value eq= ual to what openfirmware put in %r3. > > And more then that: For builds with the same ofwstk position the %r3 valu= e involved was fixed for the failures, for example when 0x30400=3Dofwstk+0x= fe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failur= es. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd244= 50 instead. Yep: offset by the same amount as ofwstk. > > And I got one example where the openfirmware %r1-value-change failure was= instead much later in the boot, well after pmap_bootstrapped went true: It= was just after the message lines... > > vgapci0: Boot video device ... > pcib1: ... > > with back trace (from OF_peer down): > > .OF_peer+0x8c > .cpcht_attach+0x884 > .device_attach+0x3ac > .device_probe_and_attach+0x3c > .bus_generic_new_pass+0x12c > .bus_generic_new_pass+0x114 > .bus_generic_new_pass+0x114 (yep: listed twice) > .bus_set_pass+0xc0 > .root_bus_configure+0x14 > .mi_startup+0x10c > btext+0xbc > > %r1 before: 0xc30400 ofwstk+0xfe0 > %r1 after: 0xd23450 > %r3 after: 0xd23450 > FreeBSD msr to restore: 0x9000000000001032 > ofmsr[0] to restore: 0x1000000000003030 > > The same after-openfirmware %r1 and %r3 values that had been showing up f= or the before-copyright examples of ofwcall failures. > > And note that it again was a peer request. All the ofwcall-tied boot-fail= ures have been for peer requests as far as I remember. > > I later did some experiments where I had it report but not stop when the = after-value was different from the before-value for %r1. When this happened= for these types of tests it seem to be an isolated example: later calls no= rmally have the stack pointer value still in %r1 after openfirmware returns= . In more detail: At most one report was made for such a boot, the rest of = the boot went fine. (Of course to get that far my hacked ofwcall code avoid= s using the after-openfirmware %r1 value to extract the 3 saved values to b= e restored from the bottom of ofwstk.) > > > > I was not successful at using "capture on" in DDB for this early-boot con= text. (It hangs things after the first report.) So I've been limited to one= screen's report and only when I have it stop at the end of the report (so = it does not scroll away). (No input to DDB available that early.) Otherwise= the information just scrolls by rather quickly for reading any detail. Sti= ll it was useful to see that other reports were not produced after the firs= t (when there was a first). (I can not claim multiple are impossible. It ju= st appears at least infrequent.) > > I have not yet investigated making analogous powerpc/GENERIC code and bui= lds. > > Nor have I dealt with having it report more detail about the peer request= s that fail. > > Nor have I seen examples of what "not failing/%r1-unchanged" looks like o= verall. > > I still have no examples of unstable/incomplete initialization(s) or race= condition(s) to explain why both ways can and do occur from one attempt to= the next --or that difference peer requests in the sequence can be where t= he problem happens. > > =3D=3D=3D > Mark Millard > markmi@dsl-only.net > > On Oct 13, 2014, at 3:39 AM, Mark Millard wrote: > > While I do not yet have "show register" or other information displayed wh= en %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have now had= two cases happen for the same, unmodified boot SSD in the same PowerMac G5= : > > A) Boots without failure or finding any changes to %r1 for before vs. aft= er openfirmware calls. > > B) I had it stop the boot after the code finds that %r1 had instead chang= ed. The usual before-copyright-notice sort of timing for where it stopped, = after pmap_bootstrapped became true. (I need "show register" or other such = to have more detail.) > > > I still have no examples of unstable/incomplete initialization(s) or race= condition(s) to explain why both ways can and do occur from one attempt to= the next. Both both do. > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 11:34 PM, Mark Millard wrote= : > > Fixing stupid typos that reverse what I should have said: removing the !'= s in front of pmap_bootstrapped (from a copy/paste sequence error)... > > Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the= G5's except for when pmap_bootstrapped in (variants of) powerpc64/GENERIC6= 4. (Only covers when I had enough debug context in place to know that much.= Similarly for other notes.) These ofwcall related failures are the vast ma= jority of the boot failures that I've seen. > > ... > > The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as repor= ted by the ofwcall history list in my debug/DDB hacks). But this was before= the %r1 before and after code was in place: that is a recent addition to m= y investigation. > > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 11:20 PM, Mark Millard wrote= : > > Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the= G5's except for when !pmap_bootstrapped in (variants of) powerpc64/GENERIC= 64. (Only covers when I had enough debug context in place to know that much= . Similarly for other notes.) These ofwcall related failures are the vast m= ajority of the boot failures that I've seen. > > A nice thing about what I've found is that I can now figure out how to us= e a comparison of the before and after stack pointers and to force DDB's in= volvement if and only if they are not equal. > > That would also report %r1 differences that happen to not to produce fail= ures (if there are such). (There has to be some explanation for why sometim= es it works and sometimes it does not, say, unstable initializations, race = conditions, or something meeting both criteria.) > > Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) > > I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. > > > I should have mentioned some things about the kind of evidence I have vs.= do not (yet) have: > > > A) The property defining the only context where I have observed the %r1 i= ssue is as noted above. > > In all but one of the ofwcall failure cases it was the first ofwcall in t= hat !pmap_bootstrapped context that had the problem. > > The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as repo= rted by the ofwcall history list in my debug/DDB hacks). But this was befor= e the %r1 before and after code was in place: that is a recent addition to = my investigation. > > > B) While I've not been building debug code variants for powerpc/GENERIC I= 've never seen the powerpc/GENERIC code fail to boot the G5's. And I have s= pent some sessions doing reboot after reboot to see if I'd get some failure= s (in addition to some other more normal uses). > > > C) So far I've only been looking at "show registers" when it gets a boot-= time exception that a DDB processes with the automatic script: the crashes.= I do not (yet) have any observations of what things look like during such = points for successful boots. (I'm figuring out ways to get and see the evid= ence spanning early boot time as I go.) And so I've only been looking with = such special debug code where I knew I could reproduce the failures (3 Powe= rMac G5's when using variants of powerpc64/GENERIC64.) > > In fact if the hack that I put in place completely masks the problem then= I currently would not ever observe any problem-specific information from t= he successful boots. Thus the before/after comparison would seem to be next= for my investigation. > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn = wrote: > > Interesting. If OF is changing the value of r1, there must be some proble= m with the ABI thunk the 64-bit kernel uses or a problem with trap handlers= . This is obviously not systematic if loader and the kernel up to that poin= t have no problems. Does a 32-bit kernel have the same problems on your har= dware? That would test whether it is the ABI translation. > -Nathan > > On 10/12/14 17:53, Mark Millard wrote: >> NOTE: I make no claim that any of the below hacks for ofwcall are approp= riate code for FreeBSD's general context. I only claim that it seems to mak= e the specific PowerMac G5 problem go away, gives solid evidence for at lea= st some of what is going on (justifying the investigative and testing hacks= ) and so gives evidence for an appropriate, more general FreeBSD solution. >> >> >> The big issue is: The PowerMac G5 openfirmware does not always preserve = the %r1 value (the stack pointer contents) that it is initially given, at l= east when the early "before copyright" crash problem is happening but possi= bly other times as well. >> >> I had the following investigative code in ofwcall, snapshotting the valu= e of %r1 before and after openfirmware's code is used: >> >> lis %r4,openfirmware_entry@ha >> ld %r4,openfirmware_entry@l(%r4) >> ... >> mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... >> /* Finally, branch to OF */ >> mtctr %r4 >> bctrl >> mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >> >> then the DDB show registers from the crash that I'd hacked in would show= these values instead of the zeros they otherwise always display, in additi= on to what the show registers has always shown for r1. >> >> The results were like the following example for every such crash: >> >> r17 =3D 0xC31400 ofwstk+0xfe0 >> r18 =3D 0xd24450 >> r1 =3D 0xd24450 >> >> Because of that %r1 value the later code such as: >> >> /* Reload stack pointer and MSR from the OFW stack */ >> ld %r6,24(%r1) >> ld %r2,16(%r1) >> ld %r1,8(%r1) >> >> gets garbage-in/garbage-out results, including %r6 being values like 0xb= c0568 instead of the value saved msr to later be restored: 0x90000000000010= 32. >> >> So one PowerMac G5 specific hack involved in my working-boots context is= to force the original %r1 value to be used (based on %r17 being a before-c= all copy, similar to the above): >> >> ld %r6,24(%r17) >> ld %r2,16(%r17) >> ld %r1,8(%r17) >> >> But the exception report from DDB has had problems in part because sprg0= still has the openfirmware value at the time even though the exception is = after openfirmware returned (the wrong value results in the register for GE= T_CPUINFO(). So I hacked in a before-exception restore of FreeBSD= 's sprg0 inside ofwcall to make the exception handler code have that much F= reeBSD context available at the exception (if it occurs, anyway). This was = really just to help with information gathering, although I've not tested on= ly having the %r17 changes. >> >> So overall PowerMac G5 specific hacking the ofwcall code to have instead= (based on what was reported above): >> >> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S >> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) >> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) >> @@ -52,6 +52,12 @@ >> GLOBAL(rtas_entry) >> .llong 0 /* RTAS entry point */ >> + /* HACK: part of having sprg0 in place for trap */ >> +ofwsprg0save: >> + .space 8 /* sizeof(register_t) */ >> +GLOBAL(ofw_sprg0_save) >> + .llong 0 >> + >> /* >> * Open Firmware Real-mode Entry Point. This is a huge pain. >> */ >> @@ -97,6 +103,10 @@ >> lis %r4,openfirmware_entry@ha >> ld %r4,openfirmware_entry@l(%r4) >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r14,ofw_sprg0_save@ha >> + ld %r14,ofw_sprg0_save@l(%r14) >> + >> /* >> * Set the MSR to the OF value. This has the side effect of disabl= ing >> * exceptions, which is important for the next few steps. >> @@ -123,14 +133,27 @@ >> stw %r5,4(%r1) >> stw %r5,0(%r1) >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r6,ofwsprg0save@ha >> + std %r14,ofwsprg0save@l(%r6) >> + >> + /* HACK: part of IGNORING the later %r1 value from openfirmware */ >> + mr %r17,%r1 >> + >> /* Finally, branch to OF */ >> mtctr %r4 >> bctrl >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r6,ofwsprg0save@ha >> + ld %r6,ofwsprg0save@l(%r6) >> + mtsprg0 %r6 >> + >> /* Reload stack pointer and MSR from the OFW stack */ >> - ld %r6,24(%r1) >> - ld %r2,16(%r1) >> - ld %r1,8(%r1) >> + /* HACKED to ignore the %r1 value that results from openfirmware's= call */ >> + ld %r6,24(%r17) >> + ld %r2,16(%r17) >> + ld %r1,8(%r17) >> /* Now set the real MSR */ >> mtmsrd %r6 >> >> This results in no crashes happening so far in my testing, not even the = 16 GByte RAM machine that crashed so much. >> >> NOTE: owf_machdep.c was changed to use "extern register_t ofw_sprg0_save= ;" to match the above. >> >> I still have ps3 disabled in GENERIC64 so that I can also have the sc op= tions in GENERIC64. And the DDB and GDB options are still present as well. >> >> And I still have my hack to force a DDB script that does show registers = and shows the ofwcall history information that I hacked in, even for the ve= ry early crashes before input is possible. Not that I'm now getting such ex= ecutions of the script. (A before possible-crash backtrace is also shown by= the added code. That still shows up.) >> >> I'll probably next switch to reverting the DDB related code changes and = to removing the DDB/GDB options and see how that goes. >> >> >> =3D=3D=3D >> Mark Millard >> markmi at dsl-only.net >> >> > > > > > From owner-freebsd-ppc@FreeBSD.ORG Tue Oct 14 17:18:16 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 80F4E45F for ; Tue, 14 Oct 2014 17:18:16 +0000 (UTC) Received: from d.mail.sonic.net (d.mail.sonic.net [64.142.111.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5F18FA2E for ; Tue, 14 Oct 2014 17:18:16 +0000 (UTC) Received: from aurora.physics.berkeley.edu (aurora.Physics.Berkeley.EDU [128.32.117.67]) (authenticated bits=0) by d.mail.sonic.net (8.14.9/8.14.9) with ESMTP id s9EHI533001596 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Tue, 14 Oct 2014 10:18:05 -0700 Message-ID: <543D5ACD.20901@freebsd.org> Date: Tue, 14 Oct 2014 10:18:05 -0700 From: Nathan Whitehorn User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0 MIME-Version: 1.0 To: Justin Hibbits , Mark Millard Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Sonic-CAuth: UmFuZG9tSVb7HpbyZUpWplY3E+RjckjcUutTBCd2jqceh4xADO7lx/BBRov3PYeiGJ3NGszno09EctxUF74yTYShThhpvLfp7OB3b21s2ag= X-Sonic-ID: C;8gZwD8ZT5BG2o3yTE+W37Q== M;7IGgD8ZT5BG2o3yTE+W37Q== X-Spam-Flag: No X-Sonic-Spam-Details: 0.0/5.0 by cerberusd Cc: FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2014 17:18:16 -0000 r1 *must be* preserved by the standard and for anything to work. It's being corrupted somehow (Mark's comment about r3 is illuminating), and if r1 is being corrupted, you can't rely on anything. I suspect it might be an exception handling issue since it's non-deterministic, but it's hard to tell. It could also be triggered by the way we've set up the OF stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. > > Nathan, thoughts? > > - Justin > > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard wrote: >> Additional notes from additional experiments... (So far from one G5.) >> >> I got back trace, show registers, and my openfirmware-history list going for failure reporting based on explicit before vs. after tests of %r1 values. (Explicit breakpoint call for unequal, being careful to save/restore %r3 around the call.) I filled several registers with potentially interesting values that would otherwise have had zero as a value (%r15-%r19, although %r15 is redundant with %r6 currently). >> >> An interesting property resulted: every time %r1 had changed from having the before-value (stack pointer value) %r1 instead ended up with a value equal to what openfirmware put in %r3. >> >> And more then that: For builds with the same ofwstk position the %r3 value involved was fixed for the failures, for example when 0x30400=ofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failures. When 0x31400=ofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd24450 instead. Yep: offset by the same amount as ofwstk. >> >> And I got one example where the openfirmware %r1-value-change failure was instead much later in the boot, well after pmap_bootstrapped went true: It was just after the message lines... >> >> vgapci0: Boot video device ... >> pcib1: ... >> >> with back trace (from OF_peer down): >> >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >> >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >> >> The same after-openfirmware %r1 and %r3 values that had been showing up for the before-copyright examples of ofwcall failures. >> >> And note that it again was a peer request. All the ofwcall-tied boot-failures have been for peer requests as far as I remember. >> >> I later did some experiments where I had it report but not stop when the after-value was different from the before-value for %r1. When this happened for these types of tests it seem to be an isolated example: later calls normally have the stack pointer value still in %r1 after openfirmware returns. In more detail: At most one report was made for such a boot, the rest of the boot went fine. (Of course to get that far my hacked ofwcall code avoids using the after-openfirmware %r1 value to extract the 3 saved values to be restored from the bottom of ofwstk.) >> >> >> >> I was not successful at using "capture on" in DDB for this early-boot context. (It hangs things after the first report.) So I've been limited to one screen's report and only when I have it stop at the end of the report (so it does not scroll away). (No input to DDB available that early.) Otherwise the information just scrolls by rather quickly for reading any detail. Still it was useful to see that other reports were not produced after the first (when there was a first). (I can not claim multiple are impossible. It just appears at least infrequent.) >> >> I have not yet investigated making analogous powerpc/GENERIC code and builds. >> >> Nor have I dealt with having it report more detail about the peer requests that fail. >> >> Nor have I seen examples of what "not failing/%r1-unchanged" looks like overall. >> >> I still have no examples of unstable/incomplete initialization(s) or race condition(s) to explain why both ways can and do occur from one attempt to the next --or that difference peer requests in the sequence can be where the problem happens. >> >> === >> Mark Millard >> markmi@dsl-only.net >> >> On Oct 13, 2014, at 3:39 AM, Mark Millard wrote: >> >> While I do not yet have "show register" or other information displayed when %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have now had two cases happen for the same, unmodified boot SSD in the same PowerMac G5: >> >> A) Boots without failure or finding any changes to %r1 for before vs. after openfirmware calls. >> >> B) I had it stop the boot after the code finds that %r1 had instead changed. The usual before-copyright-notice sort of timing for where it stopped, after pmap_bootstrapped became true. (I need "show register" or other such to have more detail.) >> >> >> I still have no examples of unstable/incomplete initialization(s) or race condition(s) to explain why both ways can and do occur from one attempt to the next. Both both do. >> >> >> >> === >> Mark Millard >> markmi at dsl-only.net >> >> On Oct 12, 2014, at 11:34 PM, Mark Millard wrote: >> >> Fixing stupid typos that reverse what I should have said: removing the !'s in front of pmap_bootstrapped (from a copy/paste sequence error)... >> >> Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the G5's except for when pmap_bootstrapped in (variants of) powerpc64/GENERIC64. (Only covers when I had enough debug context in place to know that much. Similarly for other notes.) These ofwcall related failures are the vast majority of the boot failures that I've seen. >> >> ... >> >> The only other ofwcall failure that I've seen happened only once and was where prior ofwcall's with pmap_bootstrapped had already happened (as reported by the ofwcall history list in my debug/DDB hacks). But this was before the %r1 before and after code was in place: that is a recent addition to my investigation. >> >> >> >> >> === >> Mark Millard >> markmi at dsl-only.net >> >> On Oct 12, 2014, at 11:20 PM, Mark Millard wrote: >> >> Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the G5's except for when !pmap_bootstrapped in (variants of) powerpc64/GENERIC64. (Only covers when I had enough debug context in place to know that much. Similarly for other notes.) These ofwcall related failures are the vast majority of the boot failures that I've seen. >> >> A nice thing about what I've found is that I can now figure out how to use a comparison of the before and after stack pointers and to force DDB's involvement if and only if they are not equal. >> >> That would also report %r1 differences that happen to not to produce failures (if there are such). (There has to be some explanation for why sometimes it works and sometimes it does not, say, unstable initializations, race conditions, or something meeting both criteria.) >> >> Which in turn makes the general technique appropriate to powerpc/GENERIC contexts as well. (Coding details may vary.) >> >> I can not promise how quickly I'll get to any specific part of this. But I should gradually progress on it. >> >> >> I should have mentioned some things about the kind of evidence I have vs. do not (yet) have: >> >> >> A) The property defining the only context where I have observed the %r1 issue is as noted above. >> >> In all but one of the ofwcall failure cases it was the first ofwcall in that !pmap_bootstrapped context that had the problem. >> >> The only other ofwcall failure that I've seen happened only once and was where prior ofwcall's with !pmap_bootstrapped had already happened (as reported by the ofwcall history list in my debug/DDB hacks). But this was before the %r1 before and after code was in place: that is a recent addition to my investigation. >> >> >> B) While I've not been building debug code variants for powerpc/GENERIC I've never seen the powerpc/GENERIC code fail to boot the G5's. And I have spent some sessions doing reboot after reboot to see if I'd get some failures (in addition to some other more normal uses). >> >> >> C) So far I've only been looking at "show registers" when it gets a boot-time exception that a DDB processes with the automatic script: the crashes. I do not (yet) have any observations of what things look like during such points for successful boots. (I'm figuring out ways to get and see the evidence spanning early boot time as I go.) And so I've only been looking with such special debug code where I knew I could reproduce the failures (3 PowerMac G5's when using variants of powerpc64/GENERIC64.) >> >> In fact if the hack that I put in place completely masks the problem then I currently would not ever observe any problem-specific information from the successful boots. Thus the before/after comparison would seem to be next for my investigation. >> >> >> >> === >> Mark Millard >> markmi at dsl-only.net >> >> On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn wrote: >> >> Interesting. If OF is changing the value of r1, there must be some problem with the ABI thunk the 64-bit kernel uses or a problem with trap handlers. This is obviously not systematic if loader and the kernel up to that point have no problems. Does a 32-bit kernel have the same problems on your hardware? That would test whether it is the ABI translation. >> -Nathan >> >> On 10/12/14 17:53, Mark Millard wrote: >>> NOTE: I make no claim that any of the below hacks for ofwcall are appropriate code for FreeBSD's general context. I only claim that it seems to make the specific PowerMac G5 problem go away, gives solid evidence for at least some of what is going on (justifying the investigative and testing hacks) and so gives evidence for an appropriate, more general FreeBSD solution. >>> >>> >>> The big issue is: The PowerMac G5 openfirmware does not always preserve the %r1 value (the stack pointer contents) that it is initially given, at least when the early "before copyright" crash problem is happening but possibly other times as well. >>> >>> I had the following investigative code in ofwcall, snapshotting the value of %r1 before and after openfirmware's code is used: >>> >>> lis %r4,openfirmware_entry@ha >>> ld %r4,openfirmware_entry@l(%r4) >>> ... >>> mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... >>> /* Finally, branch to OF */ >>> mtctr %r4 >>> bctrl >>> mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >>> >>> then the DDB show registers from the crash that I'd hacked in would show these values instead of the zeros they otherwise always display, in addition to what the show registers has always shown for r1. >>> >>> The results were like the following example for every such crash: >>> >>> r17 = 0xC31400 ofwstk+0xfe0 >>> r18 = 0xd24450 >>> r1 = 0xd24450 >>> >>> Because of that %r1 value the later code such as: >>> >>> /* Reload stack pointer and MSR from the OFW stack */ >>> ld %r6,24(%r1) >>> ld %r2,16(%r1) >>> ld %r1,8(%r1) >>> >>> gets garbage-in/garbage-out results, including %r6 being values like 0xbc0568 instead of the value saved msr to later be restored: 0x9000000000001032. >>> >>> So one PowerMac G5 specific hack involved in my working-boots context is to force the original %r1 value to be used (based on %r17 being a before-call copy, similar to the above): >>> >>> ld %r6,24(%r17) >>> ld %r2,16(%r17) >>> ld %r1,8(%r17) >>> >>> But the exception report from DDB has had problems in part because sprg0 still has the openfirmware value at the time even though the exception is after openfirmware returned (the wrong value results in the register for GET_CPUINFO(). So I hacked in a before-exception restore of FreeBSD's sprg0 inside ofwcall to make the exception handler code have that much FreeBSD context available at the exception (if it occurs, anyway). This was really just to help with information gathering, although I've not tested only having the %r17 changes. >>> >>> So overall PowerMac G5 specific hacking the ofwcall code to have instead (based on what was reported above): >>> >>> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S >>> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S >>> =================================================================== >>> --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) >>> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) >>> @@ -52,6 +52,12 @@ >>> GLOBAL(rtas_entry) >>> .llong 0 /* RTAS entry point */ >>> + /* HACK: part of having sprg0 in place for trap */ >>> +ofwsprg0save: >>> + .space 8 /* sizeof(register_t) */ >>> +GLOBAL(ofw_sprg0_save) >>> + .llong 0 >>> + >>> /* >>> * Open Firmware Real-mode Entry Point. This is a huge pain. >>> */ >>> @@ -97,6 +103,10 @@ >>> lis %r4,openfirmware_entry@ha >>> ld %r4,openfirmware_entry@l(%r4) >>> + /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */ >>> + lis %r14,ofw_sprg0_save@ha >>> + ld %r14,ofw_sprg0_save@l(%r14) >>> + >>> /* >>> * Set the MSR to the OF value. This has the side effect of disabling >>> * exceptions, which is important for the next few steps. >>> @@ -123,14 +133,27 @@ >>> stw %r5,4(%r1) >>> stw %r5,0(%r1) >>> + /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */ >>> + lis %r6,ofwsprg0save@ha >>> + std %r14,ofwsprg0save@l(%r6) >>> + >>> + /* HACK: part of IGNORING the later %r1 value from openfirmware */ >>> + mr %r17,%r1 >>> + >>> /* Finally, branch to OF */ >>> mtctr %r4 >>> bctrl >>> + /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */ >>> + lis %r6,ofwsprg0save@ha >>> + ld %r6,ofwsprg0save@l(%r6) >>> + mtsprg0 %r6 >>> + >>> /* Reload stack pointer and MSR from the OFW stack */ >>> - ld %r6,24(%r1) >>> - ld %r2,16(%r1) >>> - ld %r1,8(%r1) >>> + /* HACKED to ignore the %r1 value that results from openfirmware's call */ >>> + ld %r6,24(%r17) >>> + ld %r2,16(%r17) >>> + ld %r1,8(%r17) >>> /* Now set the real MSR */ >>> mtmsrd %r6 >>> >>> This results in no crashes happening so far in my testing, not even the 16 GByte RAM machine that crashed so much. >>> >>> NOTE: owf_machdep.c was changed to use "extern register_t ofw_sprg0_save;" to match the above. >>> >>> I still have ps3 disabled in GENERIC64 so that I can also have the sc options in GENERIC64. And the DDB and GDB options are still present as well. >>> >>> And I still have my hack to force a DDB script that does show registers and shows the ofwcall history information that I hacked in, even for the very early crashes before input is possible. Not that I'm now getting such executions of the script. (A before possible-crash backtrace is also shown by the added code. That still shows up.) >>> >>> I'll probably next switch to reverting the DDB related code changes and to removing the DDB/GDB options and see how that goes. >>> >>> >>> === >>> Mark Millard >>> markmi at dsl-only.net >>> >>> >> >> >> >> From owner-freebsd-ppc@FreeBSD.ORG Tue Oct 14 22:19:00 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5381979D for ; Tue, 14 Oct 2014 22:19:00 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E9BB6E74 for ; Tue, 14 Oct 2014 22:18:59 +0000 (UTC) Received: (qmail 19155 invoked from network); 14 Oct 2014 22:18:57 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 14 Oct 2014 22:18:57 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Tue, 14 Oct 2014 18:18:57 -0400 (EDT) Received: (qmail 31376 invoked from network); 14 Oct 2014 22:18:56 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 14 Oct 2014 22:18:56 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 8E3761C4058; Tue, 14 Oct 2014 15:18:51 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] From: Mark Millard In-Reply-To: <543D5ACD.20901@freebsd.org> Date: Tue, 14 Oct 2014 15:18:55 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <543D5ACD.20901@freebsd.org> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2014 22:19:00 -0000 For openfirmware: is %r3 on return any more then a failed vs. not flag = with a particular failed-value? Is there any way to validate that %r3 = values for non-failure look reasonable vs. not looking reasonable? (For = all I know %r3 could also be corrupt.) I do not have any documentation for the PowerMac G5 openfirmware API = that is in use or the associated ABI as far as I remember. I do not know = if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if = there was some conversion going back and forth (as there is for FreeBSD, = at least for powerpc64). For openfirmware I derive properties from what = I see in FreeBSD's code (which has to be more explicit then when a = compiler's code generation happens to match at least large parts of an = ABI directly). As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's = ABI but FreeBSD does. If true I do not know what other differences that = there might be (even ignoring the 32 bit vs. 64 bit issues for the = kernels). But the point would be an existence proof of at least one = difference. My understanding is that %r1 was as in FreeBSD. I vaguely seem to remember that for Darwin/Mac OS X some register was = volatile in leaf functions but non-volatile otherwise, or at least when = nested functions were involved. And that brings to mind that the = condition code sets in cr might have had a mix of volatile and = non-volatile status despite being in one register? Did Darwin/Mac OS X = have something special for register usage for Thread-Specific Storage? = Position Independent Code? Indirect Calls? Frame Pointers? I may have some Darwin/Mac OS X information around but I doubt that it = is complete, especially for the 64-bit ABI or for privileged contexts. = For the 32-bit ABI (non-priviledged) I likely have the information about = the above possible ABI properties. I assume that openfirmware avoids the FPU and other such --but I do not = know. But it is privileged code. Are there any known sources of at least some of the information for the = the PowerMac G5 openfirmware ABI(s)? What are good references for the = FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)? [I cut off some of the older history.] =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn wrote: r1 *must be* preserved by the standard and for anything to work. It's = being corrupted somehow (Mark's comment about r3 is illuminating), and = if r1 is being corrupted, you can't rely on anything. I suspect it might = be an exception handling issue since it's non-deterministic, but it's = hard to tell. It could also be triggered by the way we've set up the OF = stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. >=20 > Nathan, thoughts? >=20 > - Justin >=20 > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard = wrote: >> Additional notes from additional experiments... (So far from one G5.) >>=20 >> I got back trace, show registers, and my openfirmware-history list = going for failure reporting based on explicit before vs. after tests of = %r1 values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). >>=20 >> An interesting property resulted: every time %r1 had changed from = having the before-value (stack pointer value) %r1 instead ended up with = a value equal to what openfirmware put in %r3. >>=20 >> And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. >>=20 >> And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... >>=20 >> vgapci0: Boot video device ... >> pcib1: ... >>=20 >> with back trace (from OF_peer down): >>=20 >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >>=20 >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >>=20 >> The same after-openfirmware %r1 and %r3 values that had been showing = up for the before-copyright examples of ofwcall failures. >>=20 >> And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. >>=20 >> I later did some experiments where I had it report but not stop when = the after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) >>=20 >>=20 >>=20 >> I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) >>=20 >> I have not yet investigated making analogous powerpc/GENERIC code and = builds. >>=20 >> Nor have I dealt with having it report more detail about the peer = requests that fail. >>=20 >> Nor have I seen examples of what "not failing/%r1-unchanged" looks = like overall. >>=20 >> I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens. From owner-freebsd-ppc@FreeBSD.ORG Wed Oct 15 06:30:34 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4114A1A8 for ; Wed, 15 Oct 2014 06:30:34 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CDA6136D for ; Wed, 15 Oct 2014 06:30:32 +0000 (UTC) Received: (qmail 16671 invoked from network); 15 Oct 2014 06:30:31 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 15 Oct 2014 06:30:31 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Wed, 15 Oct 2014 02:30:31 -0400 (EDT) Received: (qmail 29696 invoked from network); 15 Oct 2014 06:30:30 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 15 Oct 2014 06:30:30 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 534A31C4017; Tue, 14 Oct 2014 23:30:23 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too] From: Mark Millard In-Reply-To: <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> Date: Tue, 14 Oct 2014 23:30:28 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <543D5ACD.20901@freebsd.org> <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Oct 2014 06:30:34 -0000 I added including after-ofwcall %r1 and %r3 values to my ofwcall history = buffer that I have ddb report when there is a problem. This makes it apparent that %r3 has also been corrupted when %r1 has = been. I say that because the usual/normal %r3 value is 0 in what the code = records and reports and I gather from the FreeBSD source code that the = error indicator is -1. But all along I've been reporting %r3 values for = the crashes that look more like 0xd18868 or other such. Never a 0 or -1 = (0xfff...). And the %r3 crash values even move around when the ofwstk = changes place from build to build. (This "usual"/"error-check" mix suggests %r3 from openfirmware is a = multi-bit representation of a Boolean value, with one's complemented = alternative values and zero as one of the two bit patterns --when %r3 is = not corrupted.) I also got an example of a somewhat later than normal ofwcall failure: = about 23 ofwcall's later than normal. It was not a peer request: ... OF_finddevice+0x90 powermac_smp_get_bsp+0x20 platform_smp_get_bsp+0x78 cpu_mp_start+0x24 mp_startup+0x7c mi_startup+0x10c btext_0xbc So pmap_bootstrapped had been true for a while by this point. Available = memory had been displayed as of when this example stopped to report the = %r1 change. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 3:18 PM, Mark Millard = wrote: For openfirmware: is %r3 on return any more then a failed vs. not flag = with a particular failed-value? Is there any way to validate that %r3 = values for non-failure look reasonable vs. not looking reasonable? (For = all I know %r3 could also be corrupt.) I do not have any documentation for the PowerMac G5 openfirmware API = that is in use or the associated ABI as far as I remember. I do not know = if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if = there was some conversion going back and forth (as there is for FreeBSD, = at least for powerpc64). For openfirmware I derive properties from what = I see in FreeBSD's code (which has to be more explicit then when a = compiler's code generation happens to match at least large parts of an = ABI directly). As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's = ABI but FreeBSD does. If true I do not know what other differences that = there might be (even ignoring the 32 bit vs. 64 bit issues for the = kernels). But the point would be an existence proof of at least one = difference. My understanding is that %r1 was as in FreeBSD. I vaguely seem to remember that for Darwin/Mac OS X some register was = volatile in leaf functions but non-volatile otherwise, or at least when = nested functions were involved. And that brings to mind that the = condition code sets in cr might have had a mix of volatile and = non-volatile status despite being in one register? Did Darwin/Mac OS X = have something special for register usage for Thread-Specific Storage? = Position Independent Code? Indirect Calls? Frame Pointers? I may have some Darwin/Mac OS X information around but I doubt that it = is complete, especially for the 64-bit ABI or for privileged contexts. = For the 32-bit ABI (non-priviledged) I likely have the information about = the above possible ABI properties. I assume that openfirmware avoids the FPU and other such --but I do not = know. But it is privileged code. Are there any known sources of at least some of the information for the = the PowerMac G5 openfirmware ABI(s)? What are good references for the = FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)? [I cut off some of the older history.] =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn wrote: r1 *must be* preserved by the standard and for anything to work. It's = being corrupted somehow (Mark's comment about r3 is illuminating), and = if r1 is being corrupted, you can't rely on anything. I suspect it might = be an exception handling issue since it's non-deterministic, but it's = hard to tell. It could also be triggered by the way we've set up the OF = stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. >=20 > Nathan, thoughts? >=20 > - Justin >=20 > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard = wrote: >> Additional notes from additional experiments... (So far from one G5.) >>=20 >> I got back trace, show registers, and my openfirmware-history list = going for failure reporting based on explicit before vs. after tests of = %r1 values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). >>=20 >> An interesting property resulted: every time %r1 had changed from = having the before-value (stack pointer value) %r1 instead ended up with = a value equal to what openfirmware put in %r3. >>=20 >> And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. >>=20 >> And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... >>=20 >> vgapci0: Boot video device ... >> pcib1: ... >>=20 >> with back trace (from OF_peer down): >>=20 >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >>=20 >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >>=20 >> The same after-openfirmware %r1 and %r3 values that had been showing = up for the before-copyright examples of ofwcall failures. >>=20 >> And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. >>=20 >> I later did some experiments where I had it report but not stop when = the after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) >>=20 >>=20 >>=20 >> I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) >>=20 >> I have not yet investigated making analogous powerpc/GENERIC code and = builds. >>=20 >> Nor have I dealt with having it report more detail about the peer = requests that fail. >>=20 >> Nor have I seen examples of what "not failing/%r1-unchanged" looks = like overall. >>=20 >> I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens. From owner-freebsd-ppc@FreeBSD.ORG Wed Oct 15 08:40:15 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id F2CF55B4 for ; Wed, 15 Oct 2014 08:40:14 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8DB5C23B for ; Wed, 15 Oct 2014 08:40:13 +0000 (UTC) Received: (qmail 23498 invoked from network); 15 Oct 2014 08:40:11 -0000 Received: from unknown (HELO rtc-sm-01.app.dca.reflexion.local) (10.81.150.1) by 0 (rfx-qmail) with SMTP; 15 Oct 2014 08:40:11 -0000 Received: by rtc-sm-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Wed, 15 Oct 2014 04:40:11 -0400 (EDT) Received: (qmail 887 invoked from network); 15 Oct 2014 08:40:11 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 15 Oct 2014 08:40:11 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 0219A1C4017; Wed, 15 Oct 2014 01:40:09 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too] From: Mark Millard In-Reply-To: <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> Date: Wed, 15 Oct 2014 01:40:09 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <543D5ACD.20901@freebsd.org> <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Oct 2014 08:40:15 -0000 More information on the odd %r1 and %r3 value... The current and recent kernels that I've built get 0xd23450 for the = corrupted values in %r1 and %r3 after openfirmware returns. So I decided to look up what that might be... objdump -h /boot/kernel/kernel shows (.got: "global object table" or = some such?) ... Sections: Idx Name Size VMA LMA File off = Algn ... 35 .got 0002f5c0 0000000000cfb248 0000000000cfb248 00bfb248 = 2**3 CONTENTS, ALLOC, LOAD, DATA 36 .dynamic 000000d0 0000000000d2a808 0000000000d2a808 00c2a808 = 2**3 CONTENTS, ALLOC, LOAD, DATA ... and objdump -s -j .got /boot/kernel/kernel shows... d23438 00000000 00bbfd48 00000000 00bbfd60 .......H.......` d23448 00000000 00bbfd90 00000000 00bbfdb0 ................ d23458 00000000 00bbfdf0 00000000 00e17dd0 ..............}. Then for 0xbbfdb0 from the above: objdump -h /boot/kernel/kernel = shows... 6 .rodata.str1.8 000834a8 0000000000b4ddf8 0000000000b4ddf8 = 00a4ddf8 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 7 set_sysinit_set 00002538 0000000000bd12a0 0000000000bd12a0 = 00ad12a0 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA and objdump -s -j .rodata.str1.8 /boot/kernel/kernel shows... bbfda8 6f756e74 00000000 436f756e 74206f66 ount....Count of bbfdb8 2074696d 65732074 68726f74 746c696e times throttlin bbfdc8 67206261 73656420 6f6e2072 65717565 g based on reque bbfdd8 73742073 70616365 20686173 206f6363 st space has occ bbfde8 75727265 64000000 25733a20 6d617374 urred...%s: mast So 0xd23450 appears to possibly be a indirect reference to the string = "Count of times throttling based on request space has occurred" or = similar indirect content based on some offset from 0xd23450 indirectly = getting to something else through the .got section. That string that I = quoted is from /usr/src/sys/rpc/svc.c: SVCPOOL* svcpool_create(const char *name, struct sysctl_oid_list *sysctl_base) { ... SYSCTL_ADD_INT(&pool->sp_sysctl, sysctl_base, OID_AUTO, "request_space_throttle_count", CTLFLAG_RD, &pool->sp_space_throttle_count, 0, "Count of times throttling based on request space = has occurred"); } return pool; } (I have not done this lookup sequence across various FreeBSD updates and = rebuilds that also get 0xd23450 in %r1 and %r3. Nor with FreeBSD builds = that get some other corruption value. I do not know that the indirect = lookup would have always gotten to that same string.) =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 11:30 PM, Mark Millard = wrote: I added including after-ofwcall %r1 and %r3 values to my ofwcall history = buffer that I have ddb report when there is a problem. This makes it apparent that %r3 has also been corrupted when %r1 has = been. I say that because the usual/normal %r3 value is 0 in what the code = records and reports and I gather from the FreeBSD source code that the = error indicator is -1. But all along I've been reporting %r3 values for = the crashes that look more like 0xd18868 or other such. Never a 0 or -1 = (0xfff...). And the %r3 crash values even move around when the ofwstk = changes place from build to build. (This "usual"/"error-check" mix suggests %r3 from openfirmware is a = multi-bit representation of a Boolean value, with one's complemented = alternative values and zero as one of the two bit patterns --when %r3 is = not corrupted.) I also got an example of a somewhat later than normal ofwcall failure: = about 23 ofwcall's later than normal. It was not a peer request: ... OF_finddevice+0x90 powermac_smp_get_bsp+0x20 platform_smp_get_bsp+0x78 cpu_mp_start+0x24 mp_startup+0x7c mi_startup+0x10c btext_0xbc So pmap_bootstrapped had been true for a while by this point. Available = memory had been displayed as of when this example stopped to report the = %r1 change. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 3:18 PM, Mark Millard = wrote: For openfirmware: is %r3 on return any more then a failed vs. not flag = with a particular failed-value? Is there any way to validate that %r3 = values for non-failure look reasonable vs. not looking reasonable? (For = all I know %r3 could also be corrupt.) I do not have any documentation for the PowerMac G5 openfirmware API = that is in use or the associated ABI as far as I remember. I do not know = if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if = there was some conversion going back and forth (as there is for FreeBSD, = at least for powerpc64). For openfirmware I derive properties from what = I see in FreeBSD's code (which has to be more explicit then when a = compiler's code generation happens to match at least large parts of an = ABI directly). As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's = ABI but FreeBSD does. If true I do not know what other differences that = there might be (even ignoring the 32 bit vs. 64 bit issues for the = kernels). But the point would be an existence proof of at least one = difference. My understanding is that %r1 was as in FreeBSD. I vaguely seem to remember that for Darwin/Mac OS X some register was = volatile in leaf functions but non-volatile otherwise, or at least when = nested functions were involved. And that brings to mind that the = condition code sets in cr might have had a mix of volatile and = non-volatile status despite being in one register? Did Darwin/Mac OS X = have something special for register usage for Thread-Specific Storage? = Position Independent Code? Indirect Calls? Frame Pointers? I may have some Darwin/Mac OS X information around but I doubt that it = is complete, especially for the 64-bit ABI or for privileged contexts. = For the 32-bit ABI (non-priviledged) I likely have the information about = the above possible ABI properties. I assume that openfirmware avoids the FPU and other such --but I do not = know. But it is privileged code. Are there any known sources of at least some of the information for the = the PowerMac G5 openfirmware ABI(s)? What are good references for the = FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)? [I cut off some of the older history.] =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn wrote: r1 *must be* preserved by the standard and for anything to work. It's = being corrupted somehow (Mark's comment about r3 is illuminating), and = if r1 is being corrupted, you can't rely on anything. I suspect it might = be an exception handling issue since it's non-deterministic, but it's = hard to tell. It could also be triggered by the way we've set up the OF = stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. >=20 > Nathan, thoughts? >=20 > - Justin >=20 > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard = wrote: >> Additional notes from additional experiments... (So far from one G5.) >>=20 >> I got back trace, show registers, and my openfirmware-history list = going for failure reporting based on explicit before vs. after tests of = %r1 values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). >>=20 >> An interesting property resulted: every time %r1 had changed from = having the before-value (stack pointer value) %r1 instead ended up with = a value equal to what openfirmware put in %r3. >>=20 >> And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. >>=20 >> And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... >>=20 >> vgapci0: Boot video device ... >> pcib1: ... >>=20 >> with back trace (from OF_peer down): >>=20 >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >>=20 >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >>=20 >> The same after-openfirmware %r1 and %r3 values that had been showing = up for the before-copyright examples of ofwcall failures. >>=20 >> And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. >>=20 >> I later did some experiments where I had it report but not stop when = the after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) >>=20 >>=20 >>=20 >> I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) >>=20 >> I have not yet investigated making analogous powerpc/GENERIC code and = builds. >>=20 >> Nor have I dealt with having it report more detail about the peer = requests that fail. >>=20 >> Nor have I seen examples of what "not failing/%r1-unchanged" looks = like overall. >>=20 >> I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens. From owner-freebsd-ppc@FreeBSD.ORG Fri Oct 17 12:25:35 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BF372C0A for ; Fri, 17 Oct 2014 12:25:35 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5B14CFC6 for ; Fri, 17 Oct 2014 12:25:34 +0000 (UTC) Received: (qmail 11575 invoked from network); 17 Oct 2014 12:25:27 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 17 Oct 2014 12:25:27 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Fri, 17 Oct 2014 08:25:27 -0400 (EDT) Received: (qmail 3532 invoked from network); 17 Oct 2014 12:25:27 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 17 Oct 2014 12:25:27 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id A30FC1C4053; Fri, 17 Oct 2014 05:25:24 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too] From: Mark Millard In-Reply-To: Date: Fri, 17 Oct 2014 05:25:25 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <49920E63-CB4A-429C-AB3A-984075AE183D@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <543D5ACD.20901@freebsd.org> <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Oct 2014 12:25:35 -0000 I noticed for the normal PowerMac G5 failing place that %r2 was 0xd23050 = (generally in my recent builds: even when it works) and that the failing = %r1 and %r3 were then 0xd23450. In other words: %r2+0x400. An accident = or is that what the code is doing when it fails? So I cleared the %r2 value in ofwcall (leaving the TOC value in the = ofwstk place and using that to put the TOC back but having %r2=3Dzero = between). Failure then produced %0x400 for %r1 and %r3: again %r2+0x400. And %r2 = stayed at zero until the ofwcall code put the TOC value back. In other words: the failing values are from adding/oring 0x400 to %r2, = whatever its value. Trying %r2=3D0x400 instead of zero resulted in %r1 and %r3 having 0x800: = So %r2+0x400 instead of %r2 | 0x400. After this I adjusted ofwcall to clear %r0, %r2, %r4-%r12 before the = openfirmware code executes and to leave them alone after. (Thus I = changed the register usage.) This allows (probably) seeing what other = volatile registers may have changed, especially when past different = value reports are considered/compared with the results for this context. = %r19-%r28 also set to zero. %r13-%r18,%r29 deliberately record things to = report on failure (breakpoint use). (Prior tests showed %r14-%r19 as = zero after so this tests if they are left alone vs. set to zero.) (The = %r13-%r18 values also are generally put to some normal use as well.) = %r30-%r31 are also put to various uses (not so simple to tell if = preserved). Results beyond %r1/%r3's corruption: %r4 gets the lr value used by = openfirmware in returning to ofwcall: %r4 no longer zero. Past activity = shows %r0 is actually set to zero if it is initially non-zero instead, = which not visible for how I did this test. The other registers that were = intended to be easily tested were preserved, including %r2. [This only tests/classifies the failing openfirmware call. So the normal = test here is of the first ofwcall once pmap_bootstrapped is true. It = happens to be a peer request. Other requests could be different in the = details for all I know. I've not hit any other ofwcall failure points so = far.] The above notes should apply even if the "Count of times ..." string is = an accident of %r2's normal value at the time of the failure (0xd23050) = --rather than a deliberate attempted addressing of the string's = reference in the .got. [.got: global offset table used, for example, for = position independent code access to things in memory.] =3D=3D=3D Mark Millard markmi@dsl-only.net On Oct 15, 2014, at 1:40 AM, Mark Millard wrote: More information on the odd %r1 and %r3 value... The current and recent kernels that I've built get 0xd23450 for the = corrupted values in %r1 and %r3 after openfirmware returns. So I decided to look up what that might be... objdump -h /boot/kernel/kernel shows (.got: "global object table" or = some such?) ... Sections: Idx Name Size VMA LMA File off = Algn ... 35 .got 0002f5c0 0000000000cfb248 0000000000cfb248 00bfb248 = 2**3 CONTENTS, ALLOC, LOAD, DATA 36 .dynamic 000000d0 0000000000d2a808 0000000000d2a808 00c2a808 = 2**3 CONTENTS, ALLOC, LOAD, DATA ... and objdump -s -j .got /boot/kernel/kernel shows... d23438 00000000 00bbfd48 00000000 00bbfd60 .......H.......` d23448 00000000 00bbfd90 00000000 00bbfdb0 ................ d23458 00000000 00bbfdf0 00000000 00e17dd0 ..............}. Then for 0xbbfdb0 from the above: objdump -h /boot/kernel/kernel = shows... 6 .rodata.str1.8 000834a8 0000000000b4ddf8 0000000000b4ddf8 00a4ddf8 = 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 7 set_sysinit_set 00002538 0000000000bd12a0 0000000000bd12a0 = 00ad12a0 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA and objdump -s -j .rodata.str1.8 /boot/kernel/kernel shows... bbfda8 6f756e74 00000000 436f756e 74206f66 ount....Count of bbfdb8 2074696d 65732074 68726f74 746c696e times throttlin bbfdc8 67206261 73656420 6f6e2072 65717565 g based on reque bbfdd8 73742073 70616365 20686173 206f6363 st space has occ bbfde8 75727265 64000000 25733a20 6d617374 urred...%s: mast So 0xd23450 appears to possibly be a indirect reference to the string = "Count of times throttling based on request space has occurred" or = similar indirect content based on some offset from 0xd23450 indirectly = getting to something else through the .got section. That string that I = quoted is from /usr/src/sys/rpc/svc.c: SVCPOOL* svcpool_create(const char *name, struct sysctl_oid_list *sysctl_base) { ... SYSCTL_ADD_INT(&pool->sp_sysctl, sysctl_base, OID_AUTO, "request_space_throttle_count", CTLFLAG_RD, &pool->sp_space_throttle_count, 0, "Count of times throttling based on request space has = occurred"); } return pool; } (I have not done this lookup sequence across various FreeBSD updates and = rebuilds that also get 0xd23450 in %r1 and %r3. Nor with FreeBSD builds = that get some other corruption value. I do not know that the indirect = lookup would have always gotten to that same string.) =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 11:30 PM, Mark Millard = wrote: I added including after-ofwcall %r1 and %r3 values to my ofwcall history = buffer that I have ddb report when there is a problem. This makes it apparent that %r3 has also been corrupted when %r1 has = been. I say that because the usual/normal %r3 value is 0 in what the code = records and reports and I gather from the FreeBSD source code that the = error indicator is -1. But all along I've been reporting %r3 values for = the crashes that look more like 0xd18868 or other such. Never a 0 or -1 = (0xfff...). And the %r3 crash values even move around when the ofwstk = changes place from build to build. (This "usual"/"error-check" mix suggests %r3 from openfirmware is a = multi-bit representation of a Boolean value, with one's complemented = alternative values and zero as one of the two bit patterns --when %r3 is = not corrupted.) I also got an example of a somewhat later than normal ofwcall failure: = about 23 ofwcall's later than normal. It was not a peer request: ... OF_finddevice+0x90 powermac_smp_get_bsp+0x20 platform_smp_get_bsp+0x78 cpu_mp_start+0x24 mp_startup+0x7c mi_startup+0x10c btext_0xbc So pmap_bootstrapped had been true for a while by this point. Available = memory had been displayed as of when this example stopped to report the = %r1 change. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 3:18 PM, Mark Millard = wrote: For openfirmware: is %r3 on return any more then a failed vs. not flag = with a particular failed-value? Is there any way to validate that %r3 = values for non-failure look reasonable vs. not looking reasonable? (For = all I know %r3 could also be corrupt.) I do not have any documentation for the PowerMac G5 openfirmware API = that is in use or the associated ABI as far as I remember. I do not know = if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if = there was some conversion going back and forth (as there is for FreeBSD, = at least for powerpc64). For openfirmware I derive properties from what = I see in FreeBSD's code (which has to be more explicit then when a = compiler's code generation happens to match at least large parts of an = ABI directly). As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's = ABI but FreeBSD does. If true I do not know what other differences that = there might be (even ignoring the 32 bit vs. 64 bit issues for the = kernels). But the point would be an existence proof of at least one = difference. My understanding is that %r1 was as in FreeBSD. I vaguely seem to remember that for Darwin/Mac OS X some register was = volatile in leaf functions but non-volatile otherwise, or at least when = nested functions were involved. And that brings to mind that the = condition code sets in cr might have had a mix of volatile and = non-volatile status despite being in one register? Did Darwin/Mac OS X = have something special for register usage for Thread-Specific Storage? = Position Independent Code? Indirect Calls? Frame Pointers? I may have some Darwin/Mac OS X information around but I doubt that it = is complete, especially for the 64-bit ABI or for privileged contexts. = For the 32-bit ABI (non-priviledged) I likely have the information about = the above possible ABI properties. I assume that openfirmware avoids the FPU and other such --but I do not = know. But it is privileged code. Are there any known sources of at least some of the information for the = the PowerMac G5 openfirmware ABI(s)? What are good references for the = FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)? [I cut off some of the older history.] =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn wrote: r1 *must be* preserved by the standard and for anything to work. It's = being corrupted somehow (Mark's comment about r3 is illuminating), and = if r1 is being corrupted, you can't rely on anything. I suspect it might = be an exception handling issue since it's non-deterministic, but it's = hard to tell. It could also be triggered by the way we've set up the OF = stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. >=20 > Nathan, thoughts? >=20 > - Justin >=20 > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard = wrote: >> Additional notes from additional experiments... (So far from one G5.) >>=20 >> I got back trace, show registers, and my openfirmware-history list = going for failure reporting based on explicit before vs. after tests of = %r1 values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). >>=20 >> An interesting property resulted: every time %r1 had changed from = having the before-value (stack pointer value) %r1 instead ended up with = a value equal to what openfirmware put in %r3. >>=20 >> And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. >>=20 >> And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... >>=20 >> vgapci0: Boot video device ... >> pcib1: ... >>=20 >> with back trace (from OF_peer down): >>=20 >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >>=20 >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >>=20 >> The same after-openfirmware %r1 and %r3 values that had been showing = up for the before-copyright examples of ofwcall failures. >>=20 >> And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. >>=20 >> I later did some experiments where I had it report but not stop when = the after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) >>=20 >>=20 >>=20 >> I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) >>=20 >> I have not yet investigated making analogous powerpc/GENERIC code and = builds. >>=20 >> Nor have I dealt with having it report more detail about the peer = requests that fail. >>=20 >> Nor have I seen examples of what "not failing/%r1-unchanged" looks = like overall. >>=20 >> I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens. From owner-freebsd-ppc@FreeBSD.ORG Sat Oct 18 09:36:12 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CE7B53FF for ; Sat, 18 Oct 2014 09:36:12 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6C299953 for ; Sat, 18 Oct 2014 09:36:11 +0000 (UTC) Received: (qmail 18560 invoked from network); 18 Oct 2014 09:36:09 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 18 Oct 2014 09:36:09 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Sat, 18 Oct 2014 05:36:09 -0400 (EDT) Received: (qmail 6950 invoked from network); 18 Oct 2014 09:36:09 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 18 Oct 2014 09:36:09 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 4158F1C4053; Sat, 18 Oct 2014 02:36:07 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too] From: Mark Millard In-Reply-To: <49920E63-CB4A-429C-AB3A-984075AE183D@dsl-only.net> Date: Sat, 18 Oct 2014 02:36:07 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <0CEC8978-E208-4F57-8481-DD9C321EF673@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <543D5ACD.20901@freebsd.org> <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> <49920E63-CB4A-429C-AB3A-984075AE183D@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Oct 2014 09:36:12 -0000 I did an experiment recording interrupt numbers in with the ofwcall = history that I'd hacked in and have been using. I did this addition to = the history in powerpc_interrupt. I managed to see a data storage = interrupt recorded in the history from a forced failure somewhat after = the normal ofwcall failure place (as evidence that the mechanism was = good enough for early boot time, which is all I needed or was after). But when the %r1!=3D%r3 detection happens in ofwcall after openfirmware = and the information was reported no interrupt was listed in the history. = I suppose openfirmware could temporarily save, replace, and later = restore some interrupt handlers. That would mean I'd not catch the = interrupt activity for such times. But it does not appear that = powerpc_interrupt is involved if any interrupts are involved in the = ofwcall failures. It would need to be something that avoids = powerpc_interrupt. Otherwise I should have seen the report = ("!!") in the history that it listed. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 17, 2014, at 5:25 AM, Mark Millard = wrote: I noticed for the normal PowerMac G5 failing place that %r2 was 0xd23050 = (generally in my recent builds: even when it works) and that the failing = %r1 and %r3 were then 0xd23450. In other words: %r2+0x400. An accident = or is that what the code is doing when it fails? So I cleared the %r2 value in ofwcall (leaving the TOC value in the = ofwstk place and using that to put the TOC back but having %r2=3Dzero = between). Failure then produced %0x400 for %r1 and %r3: again %r2+0x400. And %r2 = stayed at zero until the ofwcall code put the TOC value back. In other words: the failing values are from adding/oring 0x400 to %r2, = whatever its value. Trying %r2=3D0x400 instead of zero resulted in %r1 and %r3 having 0x800: = So %r2+0x400 instead of %r2 | 0x400. After this I adjusted ofwcall to clear %r0, %r2, %r4-%r12 before the = openfirmware code executes and to leave them alone after. (Thus I = changed the register usage.) This allows (probably) seeing what other = volatile registers may have changed, especially when past different = value reports are considered/compared with the results for this context. = %r19-%r28 also set to zero. %r13-%r18,%r29 deliberately record things to = report on failure (breakpoint use). (Prior tests showed %r14-%r19 as = zero after so this tests if they are left alone vs. set to zero.) (The = %r13-%r18 values also are generally put to some normal use as well.) = %r30-%r31 are also put to various uses (not so simple to tell if = preserved). Results beyond %r1/%r3's corruption: %r4 gets the lr value used by = openfirmware in returning to ofwcall: %r4 no longer zero. Past activity = shows %r0 is actually set to zero if it is initially non-zero instead, = which not visible for how I did this test. The other registers that were = intended to be easily tested were preserved, including %r2. [This only tests/classifies the failing openfirmware call. So the normal = test here is of the first ofwcall once pmap_bootstrapped is true. It = happens to be a peer request. Other requests could be different in the = details for all I know. I've not hit any other ofwcall failure points so = far.] The above notes should apply even if the "Count of times ..." string is = an accident of %r2's normal value at the time of the failure (0xd23050) = --rather than a deliberate attempted addressing of the string's = reference in the .got. [.got: global offset table used, for example, for = position independent code access to things in memory.] =3D=3D=3D Mark Millard markmi at dsl-only.net More information on the odd %r1 and %r3 value... The current and recent kernels that I've built get 0xd23450 for the = corrupted values in %r1 and %r3 after openfirmware returns. So I decided to look up what that might be... objdump -h /boot/kernel/kernel shows (.got: "global object table" or = some such?) ... Sections: Idx Name Size VMA LMA File off = Algn ... 35 .got 0002f5c0 0000000000cfb248 0000000000cfb248 00bfb248 = 2**3 CONTENTS, ALLOC, LOAD, DATA 36 .dynamic 000000d0 0000000000d2a808 0000000000d2a808 00c2a808 = 2**3 CONTENTS, ALLOC, LOAD, DATA ... and objdump -s -j .got /boot/kernel/kernel shows... d23438 00000000 00bbfd48 00000000 00bbfd60 .......H.......` d23448 00000000 00bbfd90 00000000 00bbfdb0 ................ d23458 00000000 00bbfdf0 00000000 00e17dd0 ..............}. Then for 0xbbfdb0 from the above: objdump -h /boot/kernel/kernel = shows... 6 .rodata.str1.8 000834a8 0000000000b4ddf8 0000000000b4ddf8 00a4ddf8 = 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 7 set_sysinit_set 00002538 0000000000bd12a0 0000000000bd12a0 00ad12a0 = 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA and objdump -s -j .rodata.str1.8 /boot/kernel/kernel shows... bbfda8 6f756e74 00000000 436f756e 74206f66 ount....Count of bbfdb8 2074696d 65732074 68726f74 746c696e times throttlin bbfdc8 67206261 73656420 6f6e2072 65717565 g based on reque bbfdd8 73742073 70616365 20686173 206f6363 st space has occ bbfde8 75727265 64000000 25733a20 6d617374 urred...%s: mast So 0xd23450 appears to possibly be a indirect reference to the string = "Count of times throttling based on request space has occurred" or = similar indirect content based on some offset from 0xd23450 indirectly = getting to something else through the .got section. That string that I = quoted is from /usr/src/sys/rpc/svc.c: SVCPOOL* svcpool_create(const char *name, struct sysctl_oid_list *sysctl_base) { ... SYSCTL_ADD_INT(&pool->sp_sysctl, sysctl_base, OID_AUTO, "request_space_throttle_count", CTLFLAG_RD, &pool->sp_space_throttle_count, 0, "Count of times throttling based on request space has = occurred"); } return pool; } (I have not done this lookup sequence across various FreeBSD updates and = rebuilds that also get 0xd23450 in %r1 and %r3. Nor with FreeBSD builds = that get some other corruption value. I do not know that the indirect = lookup would have always gotten to that same string.) =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 11:30 PM, Mark Millard = wrote: I added including after-ofwcall %r1 and %r3 values to my ofwcall history = buffer that I have ddb report when there is a problem. This makes it apparent that %r3 has also been corrupted when %r1 has = been. I say that because the usual/normal %r3 value is 0 in what the code = records and reports and I gather from the FreeBSD source code that the = error indicator is -1. But all along I've been reporting %r3 values for = the crashes that look more like 0xd18868 or other such. Never a 0 or -1 = (0xfff...). And the %r3 crash values even move around when the ofwstk = changes place from build to build. (This "usual"/"error-check" mix suggests %r3 from openfirmware is a = multi-bit representation of a Boolean value, with one's complemented = alternative values and zero as one of the two bit patterns --when %r3 is = not corrupted.) I also got an example of a somewhat later than normal ofwcall failure: = about 23 ofwcall's later than normal. It was not a peer request: ... OF_finddevice+0x90 powermac_smp_get_bsp+0x20 platform_smp_get_bsp+0x78 cpu_mp_start+0x24 mp_startup+0x7c mi_startup+0x10c btext_0xbc So pmap_bootstrapped had been true for a while by this point. Available = memory had been displayed as of when this example stopped to report the = %r1 change. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 3:18 PM, Mark Millard = wrote: For openfirmware: is %r3 on return any more then a failed vs. not flag = with a particular failed-value? Is there any way to validate that %r3 = values for non-failure look reasonable vs. not looking reasonable? (For = all I know %r3 could also be corrupt.) I do not have any documentation for the PowerMac G5 openfirmware API = that is in use or the associated ABI as far as I remember. I do not know = if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if = there was some conversion going back and forth (as there is for FreeBSD, = at least for powerpc64). For openfirmware I derive properties from what = I see in FreeBSD's code (which has to be more explicit then when a = compiler's code generation happens to match at least large parts of an = ABI directly). As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's = ABI but FreeBSD does. If true I do not know what other differences that = there might be (even ignoring the 32 bit vs. 64 bit issues for the = kernels). But the point would be an existence proof of at least one = difference. My understanding is that %r1 was as in FreeBSD. I vaguely seem to remember that for Darwin/Mac OS X some register was = volatile in leaf functions but non-volatile otherwise, or at least when = nested functions were involved. And that brings to mind that the = condition code sets in cr might have had a mix of volatile and = non-volatile status despite being in one register? Did Darwin/Mac OS X = have something special for register usage for Thread-Specific Storage? = Position Independent Code? Indirect Calls? Frame Pointers? I may have some Darwin/Mac OS X information around but I doubt that it = is complete, especially for the 64-bit ABI or for privileged contexts. = For the 32-bit ABI (non-priviledged) I likely have the information about = the above possible ABI properties. I assume that openfirmware avoids the FPU and other such --but I do not = know. But it is privileged code. Are there any known sources of at least some of the information for the = the PowerMac G5 openfirmware ABI(s)? What are good references for the = FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)? [I cut off some of the older history.] =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn wrote: r1 *must be* preserved by the standard and for anything to work. It's = being corrupted somehow (Mark's comment about r3 is illuminating), and = if r1 is being corrupted, you can't rely on anything. I suspect it might = be an exception handling issue since it's non-deterministic, but it's = hard to tell. It could also be triggered by the way we've set up the OF = stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. >=20 > Nathan, thoughts? >=20 > - Justin >=20 > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard = wrote: >> Additional notes from additional experiments... (So far from one G5.) >>=20 >> I got back trace, show registers, and my openfirmware-history list = going for failure reporting based on explicit before vs. after tests of = %r1 values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). >>=20 >> An interesting property resulted: every time %r1 had changed from = having the before-value (stack pointer value) %r1 instead ended up with = a value equal to what openfirmware put in %r3. >>=20 >> And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. >>=20 >> And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... >>=20 >> vgapci0: Boot video device ... >> pcib1: ... >>=20 >> with back trace (from OF_peer down): >>=20 >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >>=20 >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >>=20 >> The same after-openfirmware %r1 and %r3 values that had been showing = up for the before-copyright examples of ofwcall failures. >>=20 >> And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. >>=20 >> I later did some experiments where I had it report but not stop when = the after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) >>=20 >>=20 >>=20 >> I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) >>=20 >> I have not yet investigated making analogous powerpc/GENERIC code and = builds. >>=20 >> Nor have I dealt with having it report more detail about the peer = requests that fail. >>=20 >> Nor have I seen examples of what "not failing/%r1-unchanged" looks = like overall. >>=20 >> I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens.