From owner-svn-src-stable@FreeBSD.ORG  Thu May  9 12:20:19 2013
Return-Path: <owner-svn-src-stable@FreeBSD.ORG>
Delivered-To: svn-src-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id D046A3C3;
 Thu,  9 May 2013 12:20:19 +0000 (UTC) (envelope-from jilles@stack.nl)
Received: from mx1.stack.nl (unknown [IPv6:2001:610:1108:5012::107])
 by mx1.freebsd.org (Postfix) with ESMTP id 6294F2FC;
 Thu,  9 May 2013 12:20:19 +0000 (UTC)
Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131])
 by mx1.stack.nl (Postfix) with ESMTP id 245B812013A;
 Thu,  9 May 2013 14:20:05 +0200 (CEST)
Received: by snail.stack.nl (Postfix, from userid 1677)
 id ECEC128493; Thu,  9 May 2013 14:20:01 +0200 (CEST)
Date: Thu, 9 May 2013 14:20:01 +0200
From: Jilles Tjoelker <jilles@stack.nl>
To: Sergey Kandaurov <pluknet@freebsd.org>
Subject: Re: svn commit: r250215 - stable/9/lib/libc/locale
Message-ID: <20130509122001.GB48322@stack.nl>
References: <201305031552.r43FqiPN024580@svn.freebsd.org>
 <5183E899.4000503@freebsd.org>
 <CAE-mSO+B_p_HCbKwSO-rJ+dforcPEfThmOxy+Ki_1e9zPn3q_w@mail.gmail.com>
 <20130503195540.GA52657@stack.nl>
 <CAE-mSOLT6EdaYQheNka++NPZRbUFM=kXv6i9k=uRiyQTy1JuuA@mail.gmail.com>
 <5184ED7E.3040703@freebsd.org>
 <CAE-mSO+JOTcfx1vDbiux8LpikZV0J1ti2HJ0ypCsotfeJ4qKzg@mail.gmail.com>
 <51851969.6020802@freebsd.org>
 <CAE-mSOKPjx-tF5gtXXtNUHYraPL-Rd1FPxq5ECw8Nbup=jahng@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAE-mSOKPjx-tF5gtXXtNUHYraPL-Rd1FPxq5ECw8Nbup=jahng@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: svn-src-stable@freebsd.org, svn-src-all@freebsd.org,
 src-committers@freebsd.org, Andrey Chernov <ache@freebsd.org>,
 svn-src-stable-9@freebsd.org
X-BeenThere: svn-src-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: SVN commit messages for all the -stable branches of the src tree
 <svn-src-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-stable>,
 <mailto:svn-src-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-stable>
List-Post: <mailto:svn-src-stable@freebsd.org>
List-Help: <mailto:svn-src-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-stable>,
 <mailto:svn-src-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 09 May 2013 12:20:19 -0000

On Sat, May 04, 2013 at 08:42:18PM +0400, Sergey Kandaurov wrote:
> On 4 May 2013 18:21, Andrey Chernov <ache@freebsd.org> wrote:
> > On 04.05.2013 16:03, Sergey Kandaurov wrote:
> >>> BTW, I don't run tests and look in asm code for sure, but it seems
> >>> property[0] == p[0] is unneeded because almost every compiler tries to
> >>> inline strcmp().

> >> Doesn't seem so (in-lining), see below.

> > Yes, system's GNU cc don't inline strcmp() but inlines memcmp():
> >  repz
> >  cmpsb
> > I don't have clang nearby right now to test what it does.

> I've checked gcc46 and clang3.2, and they behave similarly (poor).
> It's worth to note that inlined memcmp didn't help with performance
> relative to strcmp().

> note2  - it's surprising that only base gcc inlined memcmp. This explains
> the difference between base gcc and {gcc46, clang} in the table below.

> 1 - base gcc 4.2
> 2 - gcc46
> 3 - base clang 3.2

> a - if (property[0] == p[0] && strcmp(property, p) == 0)
> b - if (property[0] == p[0] && memcmp(property, p, *len2) == 0)
> c - if (memcmp(property, p, *len2) == 0)

I also tried gperf and it is faster and more consistent; however, gperf
generates a 256-byte table in addition to the expected hash table and I
consider that too big.

> Time spend for 2097152 wctype() calls for each of wctype property

>          1a     2a      3a      1b      2b      3b      1c      2c      3c
> alnum    0.034  0.036   0.034   0.049   0.071   0.073   0.046   0.068   0.069
> alpha    0.045  0.049   0.046   0.111   0.156   0.158   0.107   0.153   0.154
> blank    0.037  0.041   0.038   0.053   0.075   0.079   0.153   0.224   0.223
> cntrl    0.039  0.044   0.042   0.058   0.078   0.081   0.206   0.300   0.301
> digit    0.039  0.044   0.043   0.059   0.080   0.085   0.259   0.378   0.378
> graph    0.043  0.049   0.050   0.061   0.082   0.087   0.313   0.455   0.455
> lower    0.044  0.049   0.051   0.062   0.085   0.090   0.365   0.532   0.533
> print    0.048  0.054   0.059   0.067   0.088   0.092   0.419   0.610   0.610
> punct    0.060  0.067   0.103   0.127   0.183   0.211   0.477   0.692   0.692
> space    0.053  0.059   0.067   0.072   0.092   0.097   0.525   0.764   0.765
> upper    0.054  0.059   0.068   0.074   0.094   0.100   0.578   0.841   0.842
> xdigit   0.060  0.066   0.077   0.079   0.099   0.106   0.635   0.922   0.985
> ideogram 0.068  0.074   0.084   0.087   0.089   0.094   0.695   0.986   0.985
> special  0.098  0.104   0.113   0.169   0.210   0.212   0.753   1.116   1.118
> phonogram 0.136 0.156   0.187   0.240   0.285   0.325   0.815   1.181   1.183
> rune     0.064  0.070   0.087   0.099   0.104   0.113   0.842   1.293   1.283

I think newer compilers don't use the REPZ CMPSB because they know it is
very slow. Intel optimization manuals discourage all string instructions
except repeated MOVS and STOS. For example, agner.org's tables say REPZ
CMPSB takes 80+2n cycles on an Intel Sandy Bridge. That's so slow you
can take a few branch mispredictions from a simple loop and still come
out ahead. On other CPUs it may not be as slow but generally still
slower than a simple loop (and you can do better than a simple loop).

The only reason to use REPZ CMPSB is its small size: it is itself
smaller than a call and requires fewer instructions to save and restore
registers, set up parameters, etc.

The i386 and amd64 versions of memcmp() use REPZ CMPSL or CMPSQ followed
by REPZ CMPSB for the remainder. This cuts down on the 2n part but the
80 cycles setup time is paid twice.

I think architecture-specific memcmp() for i386 and amd64 can still be
beneficial because of the fast unaligned access offered by these CPUs,
which allows comparison of 4 or 8 bytes at a time. SSE2 allows
comparison of 16 bytes at a time but is somewhat harder: not all i386
CPUs support SSE2, unaligned access is slow on some older CPUs and it
requires assembly so it only uses %xmm8-%xmm15 so rtld does not trash
function parameters (or rtld needs to use non-SSE2 code).

-- 
Jilles Tjoelker