Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Jul 2007 16:02:47 -0500 (CDT)
From:      "Sean C. Farley" <scf@FreeBSD.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-arch@FreeBSD.org
Subject:   Re: Assembly string functions in i386 libc
Message-ID:  <20070712142024.Q8789@thor.farley.org>
In-Reply-To: <20070712211245.M8625@besplex.bde.org>
References:  <20070711134721.D2385@thor.farley.org> <20070712191616.A4682@delplex.bde.org> <20070712211245.M8625@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 12 Jul 2007, Bruce Evans wrote:

> On Thu, 12 Jul 2007, Bruce Evans wrote:
>
>> On Wed, 11 Jul 2007, Sean C. Farley wrote:
>> 
>>> While looking at increasing the speed of strlen(), I noticed that on
>>> i386 platforms (PIII, P4 and Athlon XP) the performance is abysmal
>>> in libc compared to the version I was writing.  After more testing,
>>> I found it was only the assembly version that is really slow.  The C
>>> version is fairly quick.  Is there a need to continue to use the
>>> assembly versions of string functions on i386?  Does it mainly help
>>> slower systems such as those with i386 or i486 CPU's?
>> 
>> I think you are mistaken about the asm version being slow.  In my
>> tests ...
>
> Partly.
>
>>> I have the results from my P4 (Id = 0xf24 Stepping = 4) system and
>>> the test program here[1].  strlen.tar.bz2 is the archive of it for
>>> anyone's testing.  In the strlen/results subdirectory, there are the
>>> results for strings of increasing lengths.
>> 
>> Sorry, I didn't look at this.  I just wrote a quick re-test and ran
>> it
>
> Now I've looked at it.  I think it is not testing strlen() at all,
> except for the libc case, because __pure prevents more than 1 call to
> strlen().  (The existence of __pure is also a bug.  __pure was the
> FreeBSD spelling of the __const__ attribute in gcc-1.  It was removed
> when special support for gcc-1 was dropped, and should not have been
> recycled.)  __pure is a syntax error in the old version of FreeBSD
> that I tested on.  I first tried __pure2, which is the FreeBSD
> spelling of the __const__ attribute in gcc-2.  I think it is weaker
> than the __pure__ attribute in gcc-3.

>From what I could find, strlen() should not have the __const__ (__pure2)
attribute since it is being passed a pointer, but __pure__ (__pure)
should work.  Are you saying that __pure used to mean __const__ in gcc-1
but now it means __pure__ for gcc-2.96 and above?  The redefinition of
__pure is what you are saying is a bug.  Yes?

> After removing __pure* and adding -static -g to CFLAGS, with
> gcc-3.3.3:
>
> On a old Celeron (400MHz) (all P2's probably behave like this):
>
> %%%
> libcstrlen:	time spent executing strlen(string) = 64:	7.786868
> basestrlen:	time spent executing strlen(string) = 64:	3.816736
> strlen:	time spent executing strlen(string) = 64:	3.364313
> strlen2:	time spent executing strlen(string) = 64:	2.662973
> %%%
>
> rep scasb is apparently very slow on P2's.
>
> On an A64 in i386 mode:
>
> %%%
> libcstrlen:	time spent executing strlen(string) = 64:	0.709657
> basestrlen:	time spent executing strlen(string) = 64:	0.691397
> strlen:	time spent executing strlen(string) = 64:	0.527339
> strlen2:	time spent executing strlen(string) = 64:	0.441090
> %%%
>
> Now rep scasb is only slightly slower than the simple C loop (since
> all small loops take 2 cycles on AXP and A64...).  strlen and strlen2
> are marginally faster since their loops do more.
>
> basestrlen is fastest for lengths <= 5 on the Celeron.
>
> basestrlen is fastest for lengths <= 9 on the A64.

I removed __pure from main.c and added -static -g.

Athlon XP 2100 (1.72 GHz):
libcstrlen:     time spent executing strlen(string) = 64:       0.994755
asmstrlen:      time spent executing strlen(string) = 64:       0.989012
basestrlen:     time spent executing strlen(string) = 64:       0.879722
strlen:         time spent executing strlen(string) = 64:       0.626727
strlen2:        time spent executing strlen(string) = 64:       0.587162

P4 1.6 GHz:
libcstrlen:     time spent executing strlen(string) = 64:       2.412558
asmstrlen:      time spent executing strlen(string) = 64:       2.413904
basestrlen:     time spent executing strlen(string) = 64:       1.049927
strlen:         time spent executing strlen(string) = 64:       0.543575
strlen2:        time spent executing strlen(string) = 64:       0.547015

PIII 450MHz:
libcstrlen:     time spent executing strlen(string) = 64:       6.976066
asmstrlen:      time spent executing strlen(string) = 64:       6.974106
basestrlen:     time spent executing strlen(string) = 64:       3.464854
strlen:         time spent executing strlen(string) = 64:       2.541872
strlen2:        time spent executing strlen(string) = 64:       2.339469

The Athlon XP did much better with the assembly version than either
Intel CPU for me.  For all three CPU's using various string lengths from
1 to 256, the C versions always beat the assembly version although it
came somewhat close for the 9 to 32 byte lengths to basestrlen.

Even if this does not show that the assembly version should be replaced,
I find this performance testing interesting.  I learned something new.

Sean
-- 
scf@FreeBSD.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070712142024.Q8789>