Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 21 Mar 2015 13:11:50 -0400
From:      John Baldwin <jhb@FreeBSD.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r280279 - head/sys/sys
Message-ID:  <550DA656.5060004@FreeBSD.org>
In-Reply-To: <20150321163536.GK2379@kib.kiev.ua>
References:  <201503201027.t2KAR6Ze053047@svn.freebsd.org> <20150320130216.GS2379@kib.kiev.ua> <550D9699.7070703@FreeBSD.org> <20150321163536.GK2379@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 3/21/15 12:35 PM, Konstantin Belousov wrote:
> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote:
>>>> Author: jhb
>>>> Date: Fri Mar 20 10:27:06 2015
>>>> New Revision: 280279
>>>> URL: https://svnweb.freebsd.org/changeset/base/280279
>>>>
>>>> Log:
>>>>   Expand the bitcount* API to support 64-bit integers, plain ints and longs
>>>>   and create a "hidden" API that can be used in other system headers without
>>>>   adding namespace pollution.
>>>>   - If the POPCNT instruction is enabled at compile time, use
>>>>     __builtin_popcount*() to implement __bitcount*(), otherwise fall back
>>>>     to software implementations.
>>> Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
>>> on machines back to SandyBridge, but not on Nehalems.
>>> HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
>>> delayed until previous non-dependent instructions have executed.
>>>
>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by
>>> xoring the dst register.  I have some patch for amd64 pmap, see the end
>>> of the message.
>>
>> No, I was not aware, but I think it's hard to fix this anywhere but the
>> compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
>> uses POPCOUNT for this function from ACPI-CA:
>>
>> static UINT8
>> AcpiRsCountSetBits (
>>     UINT16                  BitField)
>> {
>>     UINT8                   BitsSet;
>>
>>
>>     ACPI_FUNCTION_ENTRY ();
>>
>>
>>     for (BitsSet = 0; BitField; BitsSet++)
>>     {
>>         /* Zero the least significant bit that is set */
>>
>>         BitField &= (UINT16) (BitField - 1);
>>     }
>>
>>     return (BitsSet);
>> }
>>
>> (I ran into this accidentally because a kernel built on my system failed
>> to boot in older qemu because the kernel paniced with an illegal instruction
>> fault in this function.)
>>
>> There's no way we can preemptively locate every bit of C that clang might
>> decide to replace with popcount and fix them to employ a workaround.  I think
>> the only viable solution is to use "-mno-popcount" on all compilers that aren't
>> known to be fixed.
> Both you and Bruce and Jilles state this, but I am unable to understand the
> argument about the compiler.  Can somebody explain this to me, please ?

If you compile with -mpopcount (or a march that includes it like -march=corei7)
then the compiler will assume that popcount is available and will use it without
doing any runtime checks.  In the case of my sandybridge desktop, I have
CPUTYPE set to "corei7-avx" in /etc/make.conf which adds "-march=corei7-avx"
to CFLAGS, so the compiler assumes it can use POPCOUNT (as well as newer
SSE and AVX).  In this case the compiler recognized the pattern in the C
function above as doing a population count and replaced the entire function with
a POPCOUNT instruction even though the C code doesn't explicitly try to use it.

However, if we know that a compiler doesn't employ the workaround for this
errata, we can employ -mno-popcount so that using a custom CPUTYPE does not
result in use of POPCOUNT except for known-ok compilers.  Either that or
we use a more elaborate set of #if checks in <sys/types.h> that only uses
__builtin_popcount() for known-ok compilers (and possibly uses the
clear-destination workaround for other compilers if -mpopcount is enabled).

> Compiler cannot generate popcntq instructions unless it is directed to
> generate code not for amd64, but for some specific microacrchitecture.
> Any use of bitcount in generic kernel or userspace is going to be a
> SWAR. In other words, #ifdef __POPCNT__ is for non-default settings of
> the FreeBSD compilation environment.

Correct.

> And even then, compiler must issue cpuid to fetch the feature mask, which
> arguably compiler cannot do due to the serialization instruction cost.
> Compiler could generate the call to instruction in the init code, but
> this is impossible for freestanding environment, since compiler does not
> know where to put init code.

No, the compiler does not have to check cpuid.  If you've told it to compile
for a given CPU type, it has tables to hardcode the features known to be
present on those CPUs and just uses them.  If you use the wrong CPU type
you get illegal instruction faults. :)  (As I did in my qemu test case.)

>> In regards to whether or not this should be a dynamic test instead, it seems a
>> bit much to require a dynamic check for code that has to work in both userland
>> and the kernel, and I'm not sure if something like CPU_COUNT() would actually
>> benefit from that versus just always using the software solution instead.
> 
> I agree.

Hence why I opted to only use POPCOUNT if someone explicitly enables it via
-mpopcount or -march flags.  If you've told the compiler it's ok to use it
without any checks, then it will use POPCOUNT, otherwise it will use the
SWAR method.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?550DA656.5060004>