Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 Jan 2017 09:15:13 -0800
From:      Conrad Meyer <cem@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Message-ID:  <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>
In-Reply-To: <20170201005009.E2504@besplex.bde.org>
References:  <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrote:
> On Tue, 31 Jan 2017, Bruce Evans wrote:
> Unrolling (or not) may be helpful or harmful for entry and exit code.

Helpful, per my earlier benchmarks.

> I
> think there should by no alignment on entry -- just assume the buffer is
> aligned in the usual case, and only run 4% slower when it is misaligned.

Please write such a patch and demonstrate the improvement.

> The exit code handles up to SHORT * 3 =3D 768 bytes, not up to 4 or 8
> bytes or up to 3 times that like simpler algorithms.  768 is quite
> large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
> time without any dependency reduction, and then 1 byte at a time.

Yes, this is the important loop to unroll for small inputs.  Somehow
with the unrolling, it is only ~19% slower than the by-3 algorithm on
my system =E2=80=94 not 66%.  Clang 3.9.1 unrolls both of these trailing
loops; here is the first:

   0x0000000000401b88 <+584>:   cmp    $0x38,%rbx
   0x0000000000401b8c <+588>:   jae    0x401b93 <sse42_crc32c+595>
   0x0000000000401b8e <+590>:   mov    %rsi,%rdx
   0x0000000000401b91 <+593>:   jmp    0x401be1 <sse42_crc32c+673>
   0x0000000000401b93 <+595>:   lea    -0x1(%rdi),%rbx
   0x0000000000401b97 <+599>:   sub    %rdx,%rbx
   0x0000000000401b9a <+602>:   mov    %rsi,%rdx
   0x0000000000401b9d <+605>:   nopl   (%rax)
   0x0000000000401ba0 <+608>:   crc32q (%rdx),%rax
   0x0000000000401ba6 <+614>:   crc32q 0x8(%rdx),%rax
   0x0000000000401bad <+621>:   crc32q 0x10(%rdx),%rax
   0x0000000000401bb4 <+628>:   crc32q 0x18(%rdx),%rax
   0x0000000000401bbb <+635>:   crc32q 0x20(%rdx),%rax
   0x0000000000401bc2 <+642>:   crc32q 0x28(%rdx),%rax
   0x0000000000401bc9 <+649>:   crc32q 0x30(%rdx),%rax
   0x0000000000401bd0 <+656>:   crc32q 0x38(%rdx),%rax
   0x0000000000401bd7 <+663>:   add    $0x40,%rdx
   0x0000000000401bdb <+667>:   add    $0x8,%rbx
   0x0000000000401bdf <+671>:   jne    0x401ba0 <sse42_crc32c+608>


> I
> don't understand the algorithm for joining crcs -- why doesn't it work
> to reduce to 12 or 24 bytes in the main loop?

It would, but I haven't implemented or tested that.  You're welcome to
do so and demonstrate an improvement.  It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table =E2=80=94 I'm not su=
re
it adds any benefit over the 3x256 table.

> Your benchmarks mainly give results for the <=3D 768 bytes where most of
> the manual optimizations don't apply.

0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
0x000800: asm:132 intrins:133  (ns per buf)
0x002000: asm:449 intrins:446  (ns per buf)
0x008000: asm:1501 intrins:1497  (ns per buf)
0x020000: asm:5618 intrins:5609  (ns per buf)

(All routines are in a separate compilation unit with no full-program
optimization, as they are in the kernel.)

> Compiler optimizations are more
> likely to help there.  So I looked more closely at the last 2 loop.
> clang indeed only unrolls the last one,

Not in 3.9.1.

> only for the unreachable case
> with more than 8 bytes on amd64.

How is it unreachable?

Best,
Conrad



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q>