From owner-svn-src-stable@FreeBSD.ORG Thu May 9 12:20:19 2013 Return-Path: Delivered-To: svn-src-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id D046A3C3; Thu, 9 May 2013 12:20:19 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (unknown [IPv6:2001:610:1108:5012::107]) by mx1.freebsd.org (Postfix) with ESMTP id 6294F2FC; Thu, 9 May 2013 12:20:19 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id 245B812013A; Thu, 9 May 2013 14:20:05 +0200 (CEST) Received: by snail.stack.nl (Postfix, from userid 1677) id ECEC128493; Thu, 9 May 2013 14:20:01 +0200 (CEST) Date: Thu, 9 May 2013 14:20:01 +0200 From: Jilles Tjoelker To: Sergey Kandaurov Subject: Re: svn commit: r250215 - stable/9/lib/libc/locale Message-ID: <20130509122001.GB48322@stack.nl> References: <201305031552.r43FqiPN024580@svn.freebsd.org> <5183E899.4000503@freebsd.org> <20130503195540.GA52657@stack.nl> <5184ED7E.3040703@freebsd.org> <51851969.6020802@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-stable@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Andrey Chernov , svn-src-stable-9@freebsd.org X-BeenThere: svn-src-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SVN commit messages for all the -stable branches of the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 May 2013 12:20:19 -0000 On Sat, May 04, 2013 at 08:42:18PM +0400, Sergey Kandaurov wrote: > On 4 May 2013 18:21, Andrey Chernov wrote: > > On 04.05.2013 16:03, Sergey Kandaurov wrote: > >>> BTW, I don't run tests and look in asm code for sure, but it seems > >>> property[0] == p[0] is unneeded because almost every compiler tries to > >>> inline strcmp(). > >> Doesn't seem so (in-lining), see below. > > Yes, system's GNU cc don't inline strcmp() but inlines memcmp(): > > repz > > cmpsb > > I don't have clang nearby right now to test what it does. > I've checked gcc46 and clang3.2, and they behave similarly (poor). > It's worth to note that inlined memcmp didn't help with performance > relative to strcmp(). > note2 - it's surprising that only base gcc inlined memcmp. This explains > the difference between base gcc and {gcc46, clang} in the table below. > 1 - base gcc 4.2 > 2 - gcc46 > 3 - base clang 3.2 > a - if (property[0] == p[0] && strcmp(property, p) == 0) > b - if (property[0] == p[0] && memcmp(property, p, *len2) == 0) > c - if (memcmp(property, p, *len2) == 0) I also tried gperf and it is faster and more consistent; however, gperf generates a 256-byte table in addition to the expected hash table and I consider that too big. > Time spend for 2097152 wctype() calls for each of wctype property > 1a 2a 3a 1b 2b 3b 1c 2c 3c > alnum 0.034 0.036 0.034 0.049 0.071 0.073 0.046 0.068 0.069 > alpha 0.045 0.049 0.046 0.111 0.156 0.158 0.107 0.153 0.154 > blank 0.037 0.041 0.038 0.053 0.075 0.079 0.153 0.224 0.223 > cntrl 0.039 0.044 0.042 0.058 0.078 0.081 0.206 0.300 0.301 > digit 0.039 0.044 0.043 0.059 0.080 0.085 0.259 0.378 0.378 > graph 0.043 0.049 0.050 0.061 0.082 0.087 0.313 0.455 0.455 > lower 0.044 0.049 0.051 0.062 0.085 0.090 0.365 0.532 0.533 > print 0.048 0.054 0.059 0.067 0.088 0.092 0.419 0.610 0.610 > punct 0.060 0.067 0.103 0.127 0.183 0.211 0.477 0.692 0.692 > space 0.053 0.059 0.067 0.072 0.092 0.097 0.525 0.764 0.765 > upper 0.054 0.059 0.068 0.074 0.094 0.100 0.578 0.841 0.842 > xdigit 0.060 0.066 0.077 0.079 0.099 0.106 0.635 0.922 0.985 > ideogram 0.068 0.074 0.084 0.087 0.089 0.094 0.695 0.986 0.985 > special 0.098 0.104 0.113 0.169 0.210 0.212 0.753 1.116 1.118 > phonogram 0.136 0.156 0.187 0.240 0.285 0.325 0.815 1.181 1.183 > rune 0.064 0.070 0.087 0.099 0.104 0.113 0.842 1.293 1.283 I think newer compilers don't use the REPZ CMPSB because they know it is very slow. Intel optimization manuals discourage all string instructions except repeated MOVS and STOS. For example, agner.org's tables say REPZ CMPSB takes 80+2n cycles on an Intel Sandy Bridge. That's so slow you can take a few branch mispredictions from a simple loop and still come out ahead. On other CPUs it may not be as slow but generally still slower than a simple loop (and you can do better than a simple loop). The only reason to use REPZ CMPSB is its small size: it is itself smaller than a call and requires fewer instructions to save and restore registers, set up parameters, etc. The i386 and amd64 versions of memcmp() use REPZ CMPSL or CMPSQ followed by REPZ CMPSB for the remainder. This cuts down on the 2n part but the 80 cycles setup time is paid twice. I think architecture-specific memcmp() for i386 and amd64 can still be beneficial because of the fast unaligned access offered by these CPUs, which allows comparison of 4 or 8 bytes at a time. SSE2 allows comparison of 16 bytes at a time but is somewhat harder: not all i386 CPUs support SSE2, unaligned access is slow on some older CPUs and it requires assembly so it only uses %xmm8-%xmm15 so rtld does not trash function parameters (or rtld needs to use non-SSE2 code). -- Jilles Tjoelker