Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Feb 2015 02:37:05 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Jung-uk Kim <jkim@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, John Baldwin <jhb@freebsd.org>
Subject:   Re: svn commit: r278474 - head/sys/sys
Message-ID:  <20150211014516.N1511@besplex.bde.org>
In-Reply-To: <54D92CE8.1030803@FreeBSD.org>
References:  <201502092103.t19L3OAn013792@svn.freebsd.org> <1698688.fEq0HxqPxg@ralph.baldwin.cx> <54D92CE8.1030803@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 9 Feb 2015, Jung-uk Kim wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On 02/09/2015 16:08, John Baldwin wrote:
>> On Monday, February 09, 2015 09:03:24 PM John Baldwin wrote:
>>> Author: jhb Date: Mon Feb  9 21:03:23 2015 New Revision: 278474
>>> URL: https://svnweb.freebsd.org/changeset/base/278474
>>>
>>> Log: Use __builtin_popcnt() to implement a BIT_COUNT() operation
>>> for bitsets and use this to implement CPU_COUNT() to count the
>>> number of CPUs in a cpuset.
>>>
>>> MFC after:	2 weeks
>>
>> Yes, __builtin_popcnt() works with GCC 4.2.  It should also allow
>> the compiler to DTRT in userland uses of this if -msse4.2 is
>> enabled.
>
> Back in 2012, when I submitted a similar patch, bde noted
> __builtin_popcount*() cannot be used with GCC 4.2 for *kernel* because
> it emits a library call.

Also, the gcc library uses a fairly slow implementation (a 256-byte
lookup table), even in gcc-4.8.  On modern CPUs, this is probably
slower than the bit-magic algorithm in <sys/systm.h>.

In the usual case where the arch doesn't support popcount in hardware,
clang generates inline code that uses much the same bit-magic algorithm
as in <sys/systm.h>.  It also uses this for its libgcc.a.

gcc48 only has library support for 32-bit and 64-bit integers on 64-bit
arches.  clang also supports a "ti2" (128-bit?) integer type in its
libgcc.a.  I don't know how to test this.  (__int128 works for declaring
128-bit variables, but the only popcount builtins according to the
available documentation (strings clang) are *popcount(), *popcountl() and
*popcountll(), and ll is the same as l on amd64.

The implementation in <sys/systm.h> is a silly optimization.  It also
supports 16-bit integers but not 64-bit ones.  If optimizing popcount
was important, then it would be done in libc first and would support
64-bit integers.  It would use the builtin iff it is available and
doesn't reduce to a library call that is slower than the bit-magic
algorithm.  This is not easy, since compilers don't even tell you if
their builtins reduce to library calls, and whether they do depends
on the arch (*).  Older compilers also don't tell you if their
builtins exist.

Some builtins have a type-generic API, but unfortunately this not one.
This means that the builtin can barely do as well as the bit-magic inline
function for 16-bit integers since you cannot even write the name of
the builtin for this case (you have to promote to 32 bits; popcount on
this is probably faster than bit-magic, but bit-magic has more bits to
look at).  Similarly for 128-bit intergers.

(*) Since generic amd64 and i386 have no popcount instruction in hardware,
using builtin popcount rarely uses the hardware instruction (it takes
special -march to get it, and the resulting binaries don't run on generic
CPUs).  Thus using the builtin works worse than using the old inline
function in most cases.  Except, the old inline function is only
implemented in the kernel, and isn't implemented for 64-bit integers.

gcc-4.8 generates the hardware popcount if the arch supports it.  Only
its library popcounts are slower than clang's.  gcc-4.2 presumably
doesn't generate the hardware popcount, since it doesn't have a -march
for newer CPUs that have it.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150211014516.N1511>