From owner-freebsd-ppc@freebsd.org Wed May 15 21:41:20 2019 Return-Path: Delivered-To: freebsd-ppc@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 48A8F15A3209 for ; Wed, 15 May 2019 21:41:20 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic316-55.consmr.mail.gq1.yahoo.com (sonic316-55.consmr.mail.gq1.yahoo.com [98.137.69.31]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3091972C3B for ; Wed, 15 May 2019 21:41:19 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: N82fH18VM1kFE8Ii5z60IOvkkhqOAyFN6yVs0RSqWATtPYghDxzGiTOhcFT_MwO ZI8PhFU2bVGEX7asNN63rgdkw9zWz_8F0MY4wldcs8dsAvPgoQnUzXQYm31OWynR2p07lsU0Dit6 POXjldq7uQsG7C92ahZkcRAIZpTvDbFu7oJz98XFpS_D2y7LXHv2RsRmudor22EO879_79VU3yFN hnvUrmnM12tmRTeCGvsR3Xv1xNE_z9w8lvebBZ2bTP91CUZqJnJUV.oaViEPbH7DmFOW3apswTF9 hkDN8LImO0Wkk2Y5_TrTH8buq2HIX.S_z3MHGkGrlugQZPcNzQsqGe013ocfJiSS715BB99e1LdR f2qVFxgMz35KR0LjJ7RVVfe8mxjliv8HKPJ2DKI_cLmhODRnIkMdPCui6xP.HivATk2ftcewoE2N oPC65Pd.E6dXBLi67ehmIqXfykcYZ1KvccGRuiMoIM5Jp9tDoRXdoePixCIcboi42tw1mT3AoZZE P7DlMMhbnDePUSVPdhpGWeTbZip04_C0vd7Ggy_GoaO76JOj_BQiTsMLyfa0cTsvy2WTjP0jOvM7 H32eEm0zUlWZGQVdC.8P7RyOWMHvsUgMZM4aaQAeEpTeG.xoiPdr4lG2_PDbB5A9qy4aCCCOhw2E .GYbo5uiyMydr2mnLQbgBCNvKa4pApOnp9u0TORFm_yVnZknwnyqJa4Ci3HNWOM0ZC9Vk2UsCqLl K6K1gWFvQSDDmxH5TzsFCBWfCs6wYeGyXzdsT_rNbh6hwTIDtnL9ippeOumR8mVh4kEPPxGLGS1g cYLmNkKyfBsKA9cuURC73wA4qVTCgaSFFiRHWs330NCyQdtF0HgcE5loDyeLtRMj6ZWseNDj.c7b vYbv4hRkJVskh7UdXsVQKKb5xCOzO3zXVMut21IHvi1Px6moofOFYn1FOH_lX89tXn7MuUF.yV6I P4iSMcFIBXRfZfLUaGtKAWoPZjgOtEL0fIzzCNnsZqap73DE22RbrEFCGyWxskik16OoedcSOitK rjRydrycE32TXRZV68MnpKtqJr80fFdU7jI3hF4AnYK5d4_tE_WLocSLCKfPI1YRUi.qPO3Sa2sb DENNyfjFQqwDy_KKdTzIbdfRzxF7zQ15IIivhIVgAv_uKHUpXi6xktdeXm0wKrWoIxdSUiYLJfqd I03JrQCkjUvLBAWxWBLZ14949A_NucMqyoPV_aoY- Received: from sonic.gate.mail.ne1.yahoo.com by sonic316.consmr.mail.gq1.yahoo.com with HTTP; Wed, 15 May 2019 21:41:12 +0000 Received: from c-76-115-7-162.hsd1.or.comcast.net (EHLO [192.168.1.103]) ([76.115.7.162]) by smtp413.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 74f20ef8519f6674c6b1307db8d5a50f; Wed, 15 May 2019 21:41:11 +0000 (UTC) From: Mark Millard Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\)) Subject: Re: An experiment in PowerMac G5 multi-socket/multi-core having better matching mftb() values Date: Wed, 15 May 2019 14:41:11 -0700 References: <48525401-7C75-4AE7-98D9-AD7CC7F53DE8@yahoo.com> To: Justin Hibbits , FreeBSD PowerPC ML In-Reply-To: <48525401-7C75-4AE7-98D9-AD7CC7F53DE8@yahoo.com> Message-Id: <34AF33E2-F115-47AB-B538-20E5506F2634@yahoo.com> X-Mailer: Apple Mail (2.3445.104.8) X-Rspamd-Queue-Id: 3091972C3B X-Spamd-Bar: ++ X-Spamd-Result: default: False [2.26 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; MV_CASE(0.50)[]; FREEMAIL_FROM(0.00)[yahoo.com]; RCVD_COUNT_THREE(0.00)[3]; TO_DN_ALL(0.00)[]; MX_GOOD(-0.01)[cached: mta6.am0.yahoodns.net]; DKIM_TRACE(0.00)[yahoo.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.62)[-0.624,0]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; NEURAL_SPAM_SHORT(0.98)[0.978,0]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(2.01)[ip: (8.30), ipnet: 98.137.64.0/21(1.02), asn: 36647(0.81), country: US(-0.06)]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.40)[0.401,0]; RCVD_IN_DNSWL_NONE(0.00)[31.69.137.98.list.dnswl.org : 127.0.5.0] X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 May 2019 21:41:20 -0000 [Use on an actual 2-socket 7455-based G4 PowerMac has lead to an update.] On 2019-May-14, at 17:42, Mark Millard wrote: > [Switching to code not using 64-bit atomics.] >=20 > On 2019-May-13, at 03:23, Mark Millard wrote: >=20 >> I've been experimenting with a alternate >> technique of dealing with boot-time 970 family >> PowerMac G5 tbr value synchronization across >> sockets/cores. So far it has narrowed the >> range significantly. I've reverted my hack for >> tolerating the mismatches in order to see how >> it goes. >>=20 >> . . . >>=20 >>=20 >> # svnlite diff /usr/src/sys/powerpc/powermac/platform_powermac.c = /usr/src/sys/powerpc/powerpc/mp_machdep.c | more >> . . . >=20 > So far the experiment has gone well. >=20 > But the original code used 64-bit atomic types > and so was inappropriate for multi-socket > PowerMac G4's. >=20 > So I've been experimenting with an alternate > coding that is not powerpc64 specific. >=20 > I present the updated code below, still only > enabling the new technique for PowerMac's. >=20 > The example code does show my use of volatile > for the ap_pcpu pointer value and so would not > match the svn code in that respect: >=20 > /usr/src/sys/powerpc/aim/aim_machdep.c:extern void * volatile ap_pcpu; > /usr/src/sys/powerpc/aim/mp_cpudep.c:void * volatile ap_pcpu; > /usr/src/sys/powerpc/pseries/platform_chrp.c:extern void *ap_pcpu; > /usr/src/sys/powerpc/powermac/platform_powermac.c:extern void * = volatile ap_pcpu; >=20 > (booke has an example volatile for what is pointed to. > But I've not dealt with examples that I do not have to test > and so, thus far, I only changed the above for the issue.) >=20 > (Whitespace details may not survive such e-mail > based handling.) >=20 > Index: /usr/src/sys/powerpc/powermac/platform_powermac.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/powermac/platform_powermac.c (revision = 347549) > +++ /usr/src/sys/powerpc/powermac/platform_powermac.c (working copy) > @@ -55,7 +55,7 @@ >=20 > #include "platform_if.h" >=20 > -extern void *ap_pcpu; > +extern void * volatile ap_pcpu; >=20 > static int powermac_probe(platform_t); > static int powermac_attach(platform_t); > @@ -333,6 +333,9 @@ > return (powermac_smp_fill_cpuref(cpuref, bsp)); > } >=20 > +// platform_powermac.c is implicitly an AIM context: no explicit AIM = test. > +extern volatile int alternate_timebase_sync_style; // 0 indicates old = style; 1 indicates new style > + > static int > powermac_smp_start_cpu(platform_t plat, struct pcpu *pc) > { > @@ -367,6 +370,13 @@ >=20 > ap_pcpu =3D pc; >=20 > + // platform_powermac.c is implicitly an AIM context: no = explicit AIM test. > + // Part of: Attempt a better-than-historical approximately > + // equal timebase value for ap vs. bsp > + alternate_timebase_sync_style=3D 1; // So: new style for = PowerMacs > + > + powerpc_sync(); // for ap_pcpu and = alternate_timebase_sync_style > + > if (rstvec_virtbase =3D=3D NULL) > rstvec_virtbase =3D pmap_mapdev(0x80000000, PAGE_SIZE); >=20 >=20 > Index: /usr/src/sys/powerpc/powerpc/mp_machdep.c > . . . While things seem okay for the 32-bit powerpc build run on the 2 types of G5's that I sometimes have access to, use on a 2-socket 7455-based G4 PowerMac lead to improvements for 32-bit support . . . The G4 has a faster tbr increment rate than the G5s but has notably slower processors than the G5s. This leads to code activity showing up more in biasing timings. Controlling the timing in the code of access the 64-bit tbr value used as a basis for adjusting the tbr via mttb helps significantly for such a context. The result is definitely better than the historical technique, for the 3 types of multi-socket and/or multi-core PowerMacs that I've been able to test. (powerpc64 FreeBSD is not intended to have much observable change from the updated source code.) The replacement for sys/powerpc/powerpc/mp_machdep.c experiment is: Index: /usr/src/sys/powerpc/powerpc/mp_machdep.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- /usr/src/sys/powerpc/powerpc/mp_machdep.c (revision 347549) +++ /usr/src/sys/powerpc/powerpc/mp_machdep.c (working copy) @@ -70,6 +70,88 @@ static struct mtx ap_boot_mtx; struct pcb stoppcbs[MAXCPU]; =20 +#if defined(AIM) +// Part of: Attempt a better-than-historical approximately +// equal timebase value for ap vs. bsp + +volatile int alternate_timebase_sync_style=3D 0; // 0 indicates old = style; 1 indicates new style. +volatile uint64_t bsp_timebase_sample=3D 0u; + +volatile unsigned int from_bsp_status_flag=3D 0u; +// stages: 0u, 1u (bsp ready to start), 2u (bsp tbr value available to = ap) + +volatile unsigned int from_ap_status_flag=3D 0u; +// stages: 0u, 1u (ap ready for bsp tbr value to be found and sent) +#endif + +static __inline uint64_t +mftb_with_no_pointer_use(void) +{ +#ifdef __powerpc64__ + uint64_t tb; // not used for 32-bit powerpc + __asm __volatile ("mftb %0" : "=3Dr"(tb)); + return tb; +#else + uint32_t tbu; // not pointer into tb + uint32_t tbl; // not pointer into tb + + do { + tbu=3D mfspr(TBR_TBU); + tbl=3D mfspr(TBR_TBL); + } while (tbu !=3D mfspr(TBR_TBU)); + + // The construction of the unint64_t value does bias the mttb = some + // for the round-trip-start side of things. + // + // The pointers into tb technique would involve a pair of memory + // writes and a pair of memory reads instead, the writes being + // in the loop. + return ((uint64_t)tbu<<32) | tbl; +#endif +} + +static __inline uint64_t +mftb_plus_delta(volatile uint64_t* bsp_tbr, int64_t ap_midpoint) + // The return value is used in the mttb as the argument. +{ +#ifdef __powerpc64__ +uint64_t tb; // not used for 32-bit powerpc +__asm __volatile ("mftb %0" : "=3Dr"(tb)); + // The construction of the unint64_t value does bias the mttb = some: + // it assignes an earlier time than hoped for, given these later + // calculations. + return tb + ((int64_t)*bsp_tbr - ap_midpoint); +#else + // Establishes delta_for_approx_match_to_bsp_tbr_values such = that: + // = ap_midpoint+delta_for_approx_match_to_bsp_tbr_values=3D=3D*bsp_tbr + int64_t delta_for_approx_match_to_bsp_tbr_values; + uint32_t tbu; // not pointer into tb + uint32_t tbl; // not pointer into tb + + do { + // The below in-loop style is for avoiding the loop + // vs. ap_midpoint's calculation being reversed in + // the code generated: volatile is is being put to + // use here. + delta_for_approx_match_to_bsp_tbr_values=3D = (int64_t)*bsp_tbr-ap_midpoint; + + tbu=3D mfspr(TBR_TBU); + tbl=3D mfspr(TBR_TBL); + } while (tbu !=3D mfspr(TBR_TBU)); + + // The construction of the unint64_t value does bias the mttb = some: + // it assignes an earlier time than hoped for, given these later + // calculations. Easily observable on the example 7455 based = PowerMac + // G4. (Faster than G5 tbr increment rate but a slower = processor,) + // But the overall process is still an improvement. + // + // The pointers into tb technique would involve a pair of memory + // writes and a pair of memory reads instead, the writes being + // in the loop. The "+ . . ." would still be involved. + return ( ((uint64_t)tbu<<32) | tbl ) + = delta_for_approx_match_to_bsp_tbr_values; +#endif +} + void machdep_ap_bootstrap(void) { @@ -77,19 +159,75 @@ PCPU_SET(awake, 1); __asm __volatile("msync; isync"); =20 +#if defined(AIM) + powerpc_sync(); + isync(); + if (1=3D=3Dalternate_timebase_sync_style) + { + // Part of: Attempt a better-than-historical = approximately + // equal timebase value for ap vs. bsp + + // No claim to deal with overflow/wraparound of tbr, or = even + // of the upper bit being on. + + register_t oldmsr=3D intr_disable(); + + while (1u!=3Dfrom_bsp_status_flag) + ; // spin waiting for bsp to flag that its ready = to start. + + // Start to measure a round trip:: to the bsp and back. + + isync(); // Be sure below mftb() result is not from = earlier speculative execution. + uint64_t const start_round_trip_time_on_ap=3D = mftb_with_no_pointer_use(); + atomic_store_rel_int(&from_ap_status_flag, 1u); // bsp = waits for such before its mftb(). + + while (2u!=3Dfrom_bsp_status_flag) + ; // spin waiting for bsp's tbr value + + // Mid-point of ap round trip and the bsp timebase value = should be approximately equal + // when the tbr's are well matched, absent interruptions = on both sides. + + isync(); // Be sure below mftb() result is not from = earlier speculative execution. + uint64_t const end_round_trip_time_on_ap=3D = mftb_with_no_pointer_use(); + + int64_t const approx_round_trip_tbr_delta_on_ap + =3D (int64_t)end_round_trip_time_on_ap - = (int64_t)start_round_trip_time_on_ap; + int64_t const ap_midpoint_value + =3D (int64_t)start_round_trip_time_on_ap = + approx_round_trip_tbr_delta_on_ap/2; + + // The mftb_plus_delta use is for helping to the control = the code order relative + // to tbr access. Such issues are notable for the 7455 = based 2-socket PowerMacs, + // for example. Faster tbr increment rate than the G5's = but slower processors + // and such. Still, overall this definitely helps such = contexts compared to the + // historical style of timebase synchronization. + isync(); // Be sure below mftb() result is not from = earlier speculative execution. + = mttb(mftb_plus_delta(&bsp_timebase_sample,ap_midpoint_value)); + + atomic_store_rel_int(&from_bsp_status_flag, 0u); // Get = ready for next ap in bsp loop + atomic_store_rel_int(&from_ap_status_flag, 0u); // Flag = bsp that this ap is done + + mtmsr(oldmsr); + } +#endif + while (ap_letgo =3D=3D 0) nop_prio_vlow(); nop_prio_medium(); =20 - /* - * Set timebase as soon as possible to meet an implicit = rendezvous - * from cpu_mp_unleash(), which sets ap_letgo and then = immediately - * sets timebase. - * - * Note that this is instrinsically racy and is only relevant on - * platforms that do not support better mechanisms. - */ - platform_smp_timebase_sync(ap_timebase, 1); +#if defined(AIM) + if (0=3D=3Dalternate_timebase_sync_style) +#endif + { + /* + * Set timebase as soon as possible to meet an implicit = rendezvous + * from cpu_mp_unleash(), which sets ap_letgo and then = immediately + * sets timebase. + * + * Note that this is instrinsically racy and is only = relevant on + * platforms that do not support better mechanisms. + */ + platform_smp_timebase_sync(ap_timebase, 1); + } =20 /* Give platform code a chance to do anything else necessary */ platform_smp_ap_init(); @@ -260,6 +398,34 @@ pc->pc_cpuid, = (uintmax_t)pc->pc_hwref, pc->pc_awake); smp_cpus++; + +#if defined(AIM) + // Part of: Attempt a better-than-historical = approximately + // equal timebase value for ap vs. bsp + powerpc_sync(); + isync(); + if (1=3D=3Dalternate_timebase_sync_style) + { + register_t oldmsr=3D intr_disable(); + + = atomic_store_rel_int(&from_bsp_status_flag, 1u); // bsp ready to start. + + while (1u!=3Dfrom_ap_status_flag) + ; // spin waiting for ap to = flag: time to send a tbr. + + isync(); // Be sure below mftb() result = is not from earlier. + bsp_timebase_sample=3D = mftb_with_no_pointer_use(); + = atomic_store_rel_int(&from_bsp_status_flag, 2u); // bsp tbr available. + + // Most of the rest of the usage is in = machdep_ap_bootstrap, + // other than controling = alternate_timebase_sync_style value. + + while (0u!=3Dfrom_ap_status_flag) + ; // spin waiting for ap to be = done with the sample. + + mtmsr(oldmsr); + } +#endif } else CPU_SET(pc->pc_cpuid, &stopped_cpus); } @@ -266,14 +432,22 @@ =20 ap_awake =3D 1; =20 - /* Provide our current DEC and TB values for APs */ - ap_timebase =3D mftb() + 10; - __asm __volatile("msync; isync"); +#if defined(AIM) + if (0=3D=3Dalternate_timebase_sync_style) +#endif + { + /* Provide our current DEC and TB values for APs */ + ap_timebase =3D mftb() + 10; + __asm __volatile("msync; isync"); + } =09 /* Let APs continue */ atomic_store_rel_int(&ap_letgo, 1); =20 - platform_smp_timebase_sync(ap_timebase, 0); +#if defined(AIM) + if (0=3D=3Dalternate_timebase_sync_style) +#endif + platform_smp_timebase_sync(ap_timebase, 0); =20 while (ap_awake < smp_cpus) ; =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)