Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Jan 2009 00:33:31 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        d@delphij.net
Cc:        =?UTF-8?B?RGFnLUVybGluZyBTbcO4cmdyYXY=?= <des@des.no>, freebsd-arch@freebsd.org
Subject:   Re: RFC: MI strlen()
Message-ID:  <20090113232658.E31712@delplex.bde.org>
In-Reply-To: <496C3A80.5040007@delphij.net>
References:  <4966B5D4.7040709@delphij.net> <86zlhw5zsr.fsf@ds4.des.no> <496C3A80.5040007@delphij.net>

next in thread | previous in thread | raw e-mail | index | archive | help
  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-471949903-1231853611=:31712
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Mon, 12 Jan 2009, Xin LI wrote:

> Dag-Erling Sm=C3=B8rgrav wrote:
>> Xin LI <delphij@delphij.net> writes:
>>> Here is a new implementation of strlen() which employed the bitmask
>>> skill in order to achieve better performance on modern hardware.
>>
>> Why bother?  Do we have code that uses strlen() heavily in performance-
>> critical regions?  Does anybody?

Of course not.

Also, it takes something like -fno-builtin-strlen in CFLAGS for the
libc strlen() to ever be used.  On both amd64 and i386, gcc's builtin
strlen() uses an inline "rep scasb", which can be inferior to even the
simple C loop in the simple libc version since i386ish string instructions
tend to be inferior (e.g., on AthlonXP, "rep scas*" takes 2.5 cycles
per iteration, plus 15 cycles to start, possibly plus extra startup/finishu=
p
costs to use specific registers for it, while the simple C loop takes
1 cycle per iteration and can have startup costs much smaller than 15+
cycles (but the function call for the libc version costs about 15
cycles).  I think most strings are short, so sophisticated methods will
usually lose since they take longer to start up and have more branches.
However, the loss is usually unnoticable since it doesn't take long to
process a non-huge number of short strings.

The "optimized" asm i386 strlen() also uses "rep scasb", so it tends to be
inferior to the simple C version.  The "optimized" asm amd64 strlen()
intentionally doesn't exist, since the "rep scasb" version of it is
believed to be slower than the simple C version on all amd64 CPUs, and
more sophisticated versions would take more work to implement and might
end up being no better, and anyway the gcc builtin normally provides
the "rep scasb" "optimization, so even more work would be required to
actually usually use the sophisticated versions.

amd64 still has "optimized" asm versions of strcat(), strcmp() and
strcpy(), and actually-optimized asm version of the memcpy() family.
For memcpy(), the string instruction actually tends to be faster (1.33
cycles/iteration on AthlonXP, same on A64 IIRC, and faster on Phenom
IIRC), though it still wins mainly because it does 64 bits at a time
while the simple libc version only does 32 bits at a time (due to its
rotted code:

typedef int word;=09=09/* "word" used for optimal copy speed */

-- 32-bit words are of course far from optimal on 64-bit machines).

Another issue for the memcpy() family is gcc doesn't have a builtin for
everything.  IIRC, it doesn't have one for memcpy() since handling
overlapping copies makes the inline version too large.

> I agree that strlen() should never be used in performance-sensitive
> regions, but given we do not have assembly optimized versions of these
> functions for amd64 and almost everybody else has it, having a C version
> that gives similar performance is valuable.  Also, worldstone with -j9
> on 2x4core machine with both src/ and obj/ in tmpfs, seems to have
> small, but positive effect:
>
> Unpatched libc:
>     1400.97 real      4159.34 user      2901.08 sys
>     1396.73 real      4159.06 user      2906.16 sys
>     1380.27 real      4158.20 user      2803.22 sys
>
> Patched libc:
>     1363.29 real      4154.89 user      2749.94 sys
>     1373.96 real      4150.45 user      2830.46 sys
>     1368.62 real      4152.48 user      2838.52 sys
>
> (with 'make cleanworld' between builds)

I don't believe this improvement.  First, it is far too large to have been
caused by changing libc strlen().  Second, libc strlen() is not normally
used.

My buildworld times on a 1x2core machine with nfs-mounted src/ and ffs
obj/ src are much smaller:

       824.96 real      1294.35 user       187.09 sys

but this is for FreeBSD-~5.2.  FreeBSD and gcc have a combined bloat factor
of almost 4 since then :-(, and more cores don't help much, apparently
due to them always running into locks (the bloat factor for system time is
almost 15, and that is with nfs bloating my sys time by about a factor of
2).

Here are some old files for too-simple strlen() benchmarks:

%%%
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#=09=09"End of shell archive."
# Contents:  z z.c zi zq.S
# Wrapped by root@besplex.bde.org on Wed Jan 14 00:11:47 2009
PATH=3D/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'z' -a "${1}" !=3D "-c" ; then
   echo shar: Will not clobber existing file \"'z'\"
else
echo shar: Extracting \"'z'\" \(0 characters\)
sed "s/^X//" >'z' <<'END_OF_FILE'
END_OF_FILE
if test 0 -ne `wc -c <'z'`; then
     echo shar: \"'z'\" unpacked with wrong size!
fi
# end of 'z'
fi
if test -f 'z.c' -a "${1}" !=3D "-c" ; then
   echo shar: Will not clobber existing file \"'z.c'\"
else
echo shar: Extracting \"'z.c'\" \(418 characters\)
sed "s/^X//" >'z.c' <<'END_OF_FILE'
X#include <stdio.h>
X#include <stdlib.h>
X#include <string.h>
X
X#ifndef LEN
X#define=09LEN=095
X#endif
X
Xstatic char *foo[1000];
Xstatic volatile size_t s;
X
X#ifdef MISTRLEN
X#include "/usr/src/lib/libc/string/strlen.c"
X#endif
X
Xint
Xmain(void)
X{
X=09int i;
X
X=09for (i =3D 0; i < 10; i++) {
X=09=09foo[i] =3D malloc(LEN + 1);
X=09=09sprintf(foo[i], "%*.*s", LEN, LEN, "foo");
X=09}
X=09for (i =3D 0; i < 10000000; i++)
X=09=09s =3D strlen(foo[i % 10]);
X=09return (0);
X}
END_OF_FILE
if test 418 -ne `wc -c <'z.c'`; then
     echo shar: \"'z.c'\" unpacked with wrong size!
fi
# end of 'z.c'
fi
if test -f 'zi' -a "${1}" !=3D "-c" ; then
   echo shar: Will not clobber existing file \"'zi'\"
else
echo shar: Extracting \"'zi'\" \(209 characters\)
sed "s/^X//" >'zi' <<'END_OF_FILE'
Xfor len in 0 1 10 100 1000
Xdo
X=09echo "string length $len:"
X=09for alg in -fbuiltin -fno-builtin "-fno-builtin zq.S"
X=09do
X=09=09echo "algorithm $alg:"
X=09=09cc -o z z.c -O2 $alg -g -static -DLEN=3D$len
X=09=09time ./z
X=09done
Xdone
END_OF_FILE
if test 209 -ne `wc -c <'zi'`; then
     echo shar: \"'zi'\" unpacked with wrong size!
fi
# end of 'zi'
fi
if test -f 'zq.S' -a "${1}" !=3D "-c" ; then
   echo shar: Will not clobber existing file \"'zq.S'\"
else
echo shar: Extracting \"'zq.S'\" \(4304 characters\)
sed "s/^X//" >'zq.S' <<'END_OF_FILE'
X/*
X * Written by J.T. Conklin <jtc@acorntoolworks.com>
X * Public domain.
X */
X
X#include <machine/asm.h>
X__FBSDID("$FreeBSD$");
X
X#if 0
X=09RCSID("$NetBSD: strlen.S,v 1.3 2004/07/19 20:04:41 drochner Exp $")
X#endif
X
XENTRY(strlen)
X=09movq=09%rdi,%rax
X=09negq=09%rdi
X
X.Lalign:
X=09/* Consider unrolling loop? */
X=09testb=09$7,%al
X=09je=09.Lword_aligned
X=09cmpb=09$0,(%rax)
X=09jne=091f
X=09leaq=09(%rdi,%rax),%rax
X=09ret
X1:=09incq=09%rax
X=09jmp=09.Lalign
X
X=09/*
X=09 * There are many well known branch-free sequences which are used
X=09 * for determining whether a zero-byte is contained within a word.
X=09 * These sequences are generally much more efficent than loading
X=09 * and comparing each byte individually.
X=09 *
X=09 * The expression [1,2]:
X=09 *
X=09 * (1)  ~(((x & 0x7f....7f) + 0x7f....7f) | (x | 0x7f....7f))
X=09 *
X=09 * evaluates to a non-zero value if any of the bytes in the
X=09 * original word is zero.
X=09 *
X=09 * It also has the useful property that bytes in the result word
X=09 * that coorespond to non-zero bytes in the original word have
X=09 * the value 0x00, while bytes cooresponding to zero bytes have
X=09 * the value 0x80. This allows calculation of the first (and
X=09 * last) occurance of a zero byte within the word (useful for C's
X=09 * str* primitives) by counting the number of leading (or
X=09 * trailing) zeros and dividing the result by 8.  On machines
X=09 * without (or with slow) clz() / ctz() instructions, testing
X=09 * each byte in the result word for zero is necessary.
X=09 *
X=09 * This typically takes 4 instructions (5 on machines without
X=09 * "not-or") not including those needed to load the constant.
X=09 *
X=09 *
X=09 * The expression:
X=09 *
X=09 * (2)  ((x - 0x01....01) & ~x & 0x80....80)
X=09 *
X=09 * evaluates to a non-zero value if any of the bytes in the
X=09 * original word is zero.
X=09 *
X=09 * On little endian machines, the first byte in the result word
X=09 * that cooresponds to a zero byte in the original byte is 0x80,
X=09 * so clz() can be used as above.  On big endian machines, and
X=09 * little endian machines without (or with a slow) clz() insn,
X=09 * testing each byte in the original for zero is necessary
X=09 *
X=09 * This typically takes 3 instructions (4 on machines without
X=09 * "and with complement") not including those needed to load
X=09 * constants.
X=09 *
X=09 *
X=09 * The expression:
X=09 *
X=09 * (3)  ((x - 0x01....01) & 0x80....80)
X=09 *
X=09 * always evaluates to a non-zero value if any of the bytes in
X=09 * the original word is zero.  However, in rare cases, it also
X=09 * evaluates to a non-zero value when none of the bytes in the
X=09 * original word is zero.
X=09 *
X=09 * To account for possible false positives, each byte of the
X=09 * original word must be checked when the expression evaluates to
X=09 * a non-zero value.  However, because it is simpler than those
X=09 * presented above, code that uses it will be faster as long as
X=09 * the rate of false positives is low.
X=09 *
X=09 * This is likely, because the the false positive can only occur
X=09 * if the most siginificant bit of a byte within the word is set.
X=09 * The expression will never fail for typical 7-bit ASCII strings.
X=09 *
X=09 * This typically takes 2 instructions not including those needed
X=09 * to load constants.
X=09 *
X=09 *
X=09 * [1] Henry S. Warren Jr., "Hacker's Delight", Addison-Westley 2003
X=09 *
X=09 * [2] International Business Machines, "The PowerPC Compiler Writer's
X=09 *     Guide", Warthman Associates, 1996
X=09 */
X
X=09.align 4
X.Lword_aligned:
X=09movabsq=09$0x0101010101010101,%r8
X=09movabsq=09$0x8080808080808080,%r9
X.Lloop:
X=09movq=09(%rax),%rdx
X=09addq=09$8,%rax
X=09subq=09%r8,%rdx
X=09testq=09%r9,%rdx
X=09je=09.Lloop
X
X=09/*
X=09 * In rare cases, the above loop may exit prematurely. We must
X=09 * return to the loop if none of the bytes in the word equal 0.
X=09 */
X=09cmpb=09$0,-8(%rax)=09=09/* 1st byte =3D=3D 0? */
X=09je=09.Lsub8
X=09cmpb=09$0,-7(%rax)=09=09/* 2nd byte =3D=3D 0? */
X=09je=09.Lsub7
X=09cmpb=09$0,-6(%rax)=09=09/* 3rd byte =3D=3D 0? */
X=09je=09.Lsub6
X=09cmpb=09$0,-5(%rax)=09=09/* 4th byte =3D=3D 0? */
X=09je=09.Lsub5
X=09cmpb=09$0,-4(%rax)=09=09/* 5th byte =3D=3D 0? */
X=09je=09.Lsub4
X=09cmpb=09$0,-3(%rax)=09=09/* 6th byte =3D=3D 0? */
X=09je=09.Lsub3
X=09cmpb=09$0,-2(%rax)=09=09/* 7th byte =3D=3D 0? */
X=09je=09.Lsub2
X=09cmpb=09$0,-1(%rax)=09=09/* 8th byte =3D=3D 0? */
X=09jne=09.Lloop
X
X.Lsub1:
X=09leaq=09-1(%rdi,%rax),%rax
X=09ret
X.Lsub2:
X=09leaq=09-2(%rdi,%rax),%rax
X=09ret
X.Lsub3:
X=09leaq=09-3(%rdi,%rax),%rax
X=09ret
X.Lsub4:
X=09leaq=09-4(%rdi,%rax),%rax
X=09ret
X.Lsub5:
X=09leaq=09-5(%rdi,%rax),%rax
X=09ret
X.Lsub6:
X=09leaq=09-6(%rdi,%rax),%rax
X=09ret
X.Lsub7:
X=09leaq=09-7(%rdi,%rax),%rax
X=09ret
X.Lsub8:
X=09leaq=09-8(%rdi,%rax),%rax
X=09ret
END_OF_FILE
if test 4304 -ne `wc -c <'zq.S'`; then
     echo shar: \"'zq.S'\" unpacked with wrong size!
fi
# end of 'zq.S'
fi
echo shar: End of shell archive.
exit 0
%%%

Here
- z.c is a simple C program that calls strlen()
- zi is a shell script that tries various algrithms (sorry, no makefile)
- zq.S is the NetBSD amd64 asm version which uses the 0x8080 trick
   and not "rep scasb"
The algorithms are:
- -fbuiltin: try to use builtin strlen()
- -fno-builtin: use default libc strlen(), except if MISTRLEN is defined,
   use MI libc strlen().  Variation to use MISTRLEN is not supported by zi.
- -fno-builtin zq.S: use NetBSD amd64 strlen()

I forget the results (don't have an amd64 machine handy for re-testing).

Bruce
--0-471949903-1231853611=:31712--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090113232658.E31712>