Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 1 Feb 2017 14:16:47 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Conrad Meyer <cem@freebsd.org>
Cc:        Bruce Evans <brde@optusnet.com.au>,  src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Message-ID:  <20170201123838.X1974@besplex.bde.org>
In-Reply-To: <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>
References:  <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org> <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Another reply to this...

On Tue, 31 Jan 2017, Conrad Meyer wrote:

> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrote=
:
>> On Tue, 31 Jan 2017, Bruce Evans wrote:
>> Unrolling (or not) may be helpful or harmful for entry and exit code.
>
> Helpful, per my earlier benchmarks.
>
>> I
>> think there should by no alignment on entry -- just assume the buffer is
>> aligned in the usual case, and only run 4% slower when it is misaligned.
>
> Please write such a patch and demonstrate the improvement.

It is easy to demonstrate.  I just put #if 0 around the early alignment
code.  The result seem too good to be true, so maybe I missed some
later dependency on alignment of the addresses:
- for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
   alignment and 1.02 seconds without alignment.
This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
crc32q's.  The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte.
7 more 3-cycle instructions and overhead too is far more than the cost
of letting the CPU do read-combining.
- for 4096-byte buffers, the difference is insignificant (0.47 seconds for
   10g.

>> I
>> don't understand the algorithm for joining crcs -- why doesn't it work
>> to reduce to 12 or 24 bytes in the main loop?
>
>It would, but I haven't implemented or tested that.  You're welcome to
>do so and demonstrate an improvement.  It does add more lookup table
>bloat, but perhaps we could just remove the 3x8k table =E2=80=94 I'm not s=
ure
>it adds any benefit over the 3x256 table.

Good idea, but the big table is useful.  Ifdefing out the LONG case reduces
the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the
setup below.  Ifdefing out the SHORT case only reduces to ~0.39 seconds.
I hoped that an even shorter SHORT case would work.  I think it now handles
768 bytes (3 * SHORT) in the inner loop.  That is 32 sets of 3 crc32q's,
and I would have thought that update at the end would take about as long
as 1 iteration (3%), but it apparently takes 33%.

>> ...
>> Your benchmarks mainly give results for the <=3D 768 bytes where most of
>> the manual optimizations don't apply.
>
> 0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
> 0x000800: asm:132 intrins:133  (ns per buf)
> 0x002000: asm:449 intrins:446  (ns per buf)
> 0x008000: asm:1501 intrins:1497  (ns per buf)
> 0x020000: asm:5618 intrins:5609  (ns per buf)
>
> (All routines are in a separate compilation unit with no full-program
> optimization, as they are in the kernel.)

These seem slow.  I modified my program to test the actual kernel code,
and get for 10gB on freefall's Xeon (main times in seconds):

0x000008: asm(rm):3.41 asm(r):3.07 intrins:6.01 gcc:3.74  (3S =3D 2.4ns/buf=
)
0x000010: asm(rm):2.05 asm(r):1.70 intrins:2.92 gcc:2.62  (2S =3D 3/2ns/buf=
)
0x000020: asm(rm):1.63 asm(r):1.58 intrins:1.62 gcc:1.61  (1.6S =3D 5.12ns/=
buf)
0x000040: asm(rm):1.07 asm(r):1.11 intrins:1.06 gcc:1.14  (1.1S =3D 7.04ns/=
buf)
0x000080: asm(rm):1.02 asm(r):1.04 intrins:1.03 gcc:1.04  (1.02S =3D 13.06n=
s/buf)
0x000100: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.08  (1.02S =3D 52.22n=
s/buf)
0x000200: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.02  (1.02S =3D 104.45=
ns/buf)
0x000400: asm(rm):0.58 asm(r):0.57 intrins:0.57 gcc:0.57  (.57S =3D 116.43n=
s/buf)
0x001000: asm(rm):0.62 asm(r):0.57 intrins:0.57 gcc:0.57  (.57S =3D 233.44n=
s/buf)
0x002000: asm(rm):0.48 asm(r):0.46 intrins:0.46 gcc:0.46  (.46S =3D 376.83n=
s/buf)
0x004000: asm(rm):0.49 asm(r):0.46 intrins:0.46 gcc:0.46  (.46S =3D 753.66n=
s/buf)
0x008000: asm(rm):0.49 asm(r):0.38 intrins:0.38 gcc:0.38  (.38S =3D 1245.18=
ns/buf)
0x010000: asm(rm):0.47 asm(r):0.38 intrins:0.36 gcc:0.38  (.36S =3D 2359.30=
ns/buf)
0x020000: asm(rm):0.43 asm(r):1.05 intrins:0.35 gcc:0.36  (.35S =3D 4587.52=
ns/buf)

asm(r) is a fix for clang's slownes with inline asms.  Just change the
constraint from "rm" to "r".  This takes an extra register, but no more
uops.

This is for the aligned case with no hacks.

intrins does something bad for small buffers.  Probably just the branch ove=
r
the dead unrolling.  Twice 2.4ns/buf for 8-byte buffers is still very fast.
This is 16 cycles.  3 cycles to do 1 crc32q and the rest mainly for 1 funct=
ion
call and too many branches.

Bruce
From owner-svn-src-head@freebsd.org  Wed Feb  1 03:29:15 2017
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 549DFCC93A4;
 Wed,  1 Feb 2017 03:29:15 +0000 (UTC)
 (envelope-from jhibbits@FreeBSD.org)
Received: from repo.freebsd.org (repo.freebsd.org
 [IPv6:2610:1c1:1:6068::e6a:0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 2F3D43C2;
 Wed,  1 Feb 2017 03:29:15 +0000 (UTC)
 (envelope-from jhibbits@FreeBSD.org)
Received: from repo.freebsd.org ([127.0.1.37])
 by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v113TE4h037476;
 Wed, 1 Feb 2017 03:29:14 GMT (envelope-from jhibbits@FreeBSD.org)
Received: (from jhibbits@localhost)
 by repo.freebsd.org (8.15.2/8.15.2/Submit) id v113TEPn037471;
 Wed, 1 Feb 2017 03:29:14 GMT (envelope-from jhibbits@FreeBSD.org)
Message-Id: <201702010329.v113TEPn037471@repo.freebsd.org>
X-Authentication-Warning: repo.freebsd.org: jhibbits set sender to
 jhibbits@FreeBSD.org using -f
From: Justin Hibbits <jhibbits@FreeBSD.org>
Date: Wed, 1 Feb 2017 03:29:14 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: svn commit: r313036 - in head/sys/powerpc: booke include
X-SVN-Group: head
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>;
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 01 Feb 2017 03:29:15 -0000

Author: jhibbits
Date: Wed Feb  1 03:29:13 2017
New Revision: 313036
URL: https://svnweb.freebsd.org/changeset/base/313036

Log:
  Add Book-E Enhanced Debug (E.D) profile debug support
  
  Freescale added the E.D profile to e500mc and derivative cores.  From
  Freescale's EREF reference manual this is enabled by a bit in HID0 and should
  otherwise default to traditional debug.  However, none of the Freescale cores
  support that bit, and instead always use E.D.  This results in kernel panics
  using the standard debug on e500mc+ cores.
  
  Enhanced debug allows debugging of interrupts, including critical interrupts,
  as it uses a different save/restore registers (srr*).  At this time we don't use
  this ability, so instead share the core of the debug handler code between both
  handlers.
  
  MFC after:	3 weeks

Modified:
  head/sys/powerpc/booke/booke_machdep.c
  head/sys/powerpc/booke/trap_subr.S
  head/sys/powerpc/include/spr.h

Modified: head/sys/powerpc/booke/booke_machdep.c
==============================================================================
--- head/sys/powerpc/booke/booke_machdep.c	Wed Feb  1 02:42:45 2017	(r313035)
+++ head/sys/powerpc/booke/booke_machdep.c	Wed Feb  1 03:29:13 2017	(r313036)
@@ -187,6 +187,7 @@ extern void *int_watchdog;
 extern void *int_data_tlb_error;
 extern void *int_inst_tlb_error;
 extern void *int_debug;
+extern void *int_debug_ed;
 extern void *int_vec;
 extern void *int_vecast;
 #ifdef HWPMC_HOOKS
@@ -242,6 +243,7 @@ ivor_setup(void)
 	case FSL_E500mc:
 	case FSL_E5500:
 		SET_TRAP(SPR_IVOR7, int_fpu);
+		SET_TRAP(SPR_IVOR15, int_debug_ed);
 		break;
 	case FSL_E500v1:
 	case FSL_E500v2:

Modified: head/sys/powerpc/booke/trap_subr.S
==============================================================================
--- head/sys/powerpc/booke/trap_subr.S	Wed Feb  1 02:42:45 2017	(r313035)
+++ head/sys/powerpc/booke/trap_subr.S	Wed Feb  1 03:29:13 2017	(r313036)
@@ -794,6 +794,22 @@ interrupt_vector_top:
 INTERRUPT(int_debug)
 	STANDARD_CRIT_PROLOG(SPR_SPRG2, PC_BOOKE_CRITSAVE, SPR_CSRR0, SPR_CSRR1)
 	FRAME_SETUP(SPR_SPRG2, PC_BOOKE_CRITSAVE, EXC_DEBUG)
+	bl	int_debug_int
+	FRAME_LEAVE(SPR_CSRR0, SPR_CSRR1)
+	rfci
+
+INTERRUPT(int_debug_ed)
+	STANDARD_CRIT_PROLOG(SPR_SPRG2, PC_BOOKE_CRITSAVE, SPR_DSRR0, SPR_DSRR1)
+	FRAME_SETUP(SPR_SPRG2, PC_BOOKE_CRITSAVE, EXC_DEBUG)
+	bl	int_debug_int
+	FRAME_LEAVE(SPR_DSRR0, SPR_DSRR1)
+	rfdi
+	/* .long 0x4c00004e */
+
+/* Internal helper for debug interrupt handling. */
+/* Common code between e500v1/v2 and e500mc-based cores. */
+int_debug_int:
+	mflr	%r14
 	GET_CPUINFO(%r3)
 	lwz	%r3, (PC_BOOKE_CRITSAVE+CPUSAVE_SRR0)(%r3)
 	bl	0f
@@ -819,7 +835,8 @@ INTERRUPT(int_debug)
 	mtspr	SPR_SRR0, %r3
 	lwz	%r4, (PC_BOOKE_CRITSAVE+CPUSAVE_SRR1+8)(%r4);
 	mtspr	SPR_SRR1, %r4
-	b	9f
+	mtlr	%r14
+	blr
 1:
 	addi	%r3, %r1, 8
 	bl	CNAME(trap)
@@ -828,10 +845,6 @@ INTERRUPT(int_debug)
 	 * We actually need to return to the process with an rfi.
 	 */
 	b	trapexit
-9:
-	FRAME_LEAVE(SPR_CSRR0, SPR_CSRR1)
-	rfci
-
 
 /*****************************************************************************
  * Common trap code

Modified: head/sys/powerpc/include/spr.h
==============================================================================
--- head/sys/powerpc/include/spr.h	Wed Feb  1 02:42:45 2017	(r313035)
+++ head/sys/powerpc/include/spr.h	Wed Feb  1 03:29:13 2017	(r313036)
@@ -671,6 +671,8 @@
 #define	SPR_CSRR1		0x03b	/* ..8 59 Critical SRR1 */
 #define	SPR_MCSRR0		0x23a	/* ..8 570 Machine check SRR0 */
 #define	SPR_MCSRR1		0x23b	/* ..8 571 Machine check SRR1 */
+#define	SPR_DSRR0		0x23e	/* ..8 574 Debug SRR0<E.ED> */
+#define	SPR_DSRR1		0x23f	/* ..8 575 Debug SRR1<E.ED> */
 
 #define	SPR_MMUCR		0x3b2	/* 4.. MMU Control Register */
 #define	  MMUCR_SWOA		(0x80000000 >> 7)



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170201123838.X1974>