FreeBSD Mail Archives

Date:      Thu, 12 Jul 2007 20:37:33 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        "Sean C. Farley" <scf@freebsd.org>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: Assembly string functions in i386 libc
Message-ID:  <20070712191616.A4682@delplex.bde.org>
In-Reply-To: <20070711134721.D2385@thor.farley.org>
References:  <20070711134721.D2385@thor.farley.org>

On Wed, 11 Jul 2007, Sean C. Farley wrote:

> While looking at increasing the speed of strlen(), I noticed that on
> i386 platforms (PIII, P4 and Athlon XP) the performance is abysmal in
> libc compared to the version I was writing.  After more testing, I found
> it was only the assembly version that is really slow.  The C version is
> fairly quick.  Is there a need to continue to use the assembly versions
> of string functions on i386?  Does it mainly help slower systems such as
> those with i386 or i486 CPU's?

I think you are mistaken about the asm version being slow.  In my tests
(mainly on P4 and AXP), all versions are just mediocre, with the
asm version usually slightly faster.  It is hard to handle all cases
efficiently, so generic string routines tend to be mediocre (even if
they are bloated with special handling for all cases, the bloat tends
to make them mediocre, especially for the usual (?) case of short
strings).

Anyway, the asm version is almost never used on amd64 or i386.  gcc's
builtin strlen is used at optimization levels >= 1 unless you do something
to turn it off.  gcc generates essentially the same mediocre code as
the i386 asm version (repnz scasb, with not much other overhead except
for the usual overhead for string instructions).

Perhaps I misremember the asm version being faster.  All string
instructions except movs and stos are quite slow.  On AXP they take:

     rep movs            24 + 4/3*count
     rep stos            24 + 1*count
     rep scas            15 + 5/2*count
     others are worse

I think the C loop for strlen() takes N + 2*count on AXP.  On A64 in i386
mode it does take N+2.0*count, while the builtin and the library (asm
rep scasb) both take N+2.1*count.  The speed of the C version depends on
lots of pipelines which AXP and A64 have (not sure about P4).  The
critical loop is:

1:
 	incl	%eax
 	cmpb	$0,(%eax)
 	jne	1b

The cmpb in this takes 2 cycles, but the loop overhead takes no more
due to pipelining.  (On AXP and A64, all small loops take 2 cycles.
You can add a one or two independent integer instructions in the loop
without slowing it down, since up to 6 integer micro-ops can be executed
in 2 cycles (2 cycles times 3 integer pipelines), and the above takes
4 or 5.  memcmp() is more interesting than strlen() since it takes 5,
6 or 7.)

The above loop is far from optimal.  It could be unrolled, but this
would be just a pessimization on AXP and A64, since the loop overhead
is already null.  It could use a better algorithm involving comparing
4 bytes at a time on i386 and 8 bytes on amd64.  glibc (tege) implemented
such algorithms almost 20 years ago.  It is unclear if they are actually
better, since they have more overhead outside of the loop.  I suspect
that the C version seems to be much better in your benchmarh because
you only tested short strings.  For short strings, the setup and
finalization overhead (or more perhaps main memory overhead) dominates
and both i386 string instructions and sophisticated algorithms lose
because theor overheads are higher than for the simple C version.

As to slower systems: versions using the string instructions are much
better on older systems, because the old systems only have 1 slow
pipeline instead of 2 or 3 fast pipelines, and they have slow branches
and no branch cache; however they had string instructions only slightly
larger (or even smaller) parameters than the above.

libc/i386/string also has micro-optimizations for original i386's.
libc/amd64/string is better, mainly by being almost empty, but it is
still too muuch like libc/i386/string, having micro-optimizations for
original i386's.  E.g., it does things like using rep movsb in bcopy.S
to copy the last 0-7 bytes without checking if the count is nonzero.
This may have been a good optimization for original i386's, but it is
one that the axp and amd64 optimization manuals say not to do.  No one
notices because it is only a minor pessimization.  The kernel i386
bcopy has somehow always been optimized for AXP's and not for original
i386's.

> I have the results from my P4 (Id = 0xf24 Stepping = 4) system and the
> test program here[1].  strlen.tar.bz2 is the archive of it for anyone's
> testing.  In the strlen/results subdirectory, there are the results for
> strings of increasing lengths.

Sorry, I didn't look at this.  I just wrote a quick re-test and ran it
on an A64 in i386 mode.  In a previous test on P4, I tried to benchmark
mainly branch overhead because the loop overhead is uninteresting ("all
loops take 2 cycles").  I tried to randomize string lengths in a way
the would defeat branch caches in the same way as average usage.  The
results were inconclusive.  I don't know what average usage is, but
I suspect average usage is to rarely usage strlen() so its efficiency
doesn't matter %-).

Bruce

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070712191616.A4682>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation