From owner-svn-src-all@FreeBSD.ORG  Mon Apr 13 20:04:47 2015
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx2.freebsd.org (mx2.freebsd.org [8.8.178.116])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 54593981;
 Mon, 13 Apr 2015 20:04:47 +0000 (UTC)
Received: from hammer.pct.niksun.com (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx2.freebsd.org (Postfix) with ESMTP id 7D1CD250C;
 Mon, 13 Apr 2015 20:04:46 +0000 (UTC)
Message-ID: <552C215D.8020107@FreeBSD.org>
Date: Mon, 13 Apr 2015 16:04:45 -0400
From: Jung-uk Kim <jkim@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Alan Cox <alc@rice.edu>, John Baldwin <jhb@freebsd.org>, 
 Bruce Evans <brde@optusnet.com.au>
Subject: Re: svn commit: r280279 - head/sys/sys
References: <201503201027.t2KAR6Ze053047@svn.freebsd.org>
 <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org>
 <17035816.lxyzYKiOWV@ralph.baldwin.cx> <552BFEB2.8040407@rice.edu>
In-Reply-To: <552BFEB2.8040407@rice.edu>
Content-Type: multipart/mixed; boundary="------------050306010303090506010906"
Cc: Konstantin Belousov <kostikbel@gmail.com>, svn-src-head@freebsd.org,
 svn-src-all@freebsd.org, src-committers@freebsd.org
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Apr 2015 20:04:47 -0000

This is a multi-part message in MIME format.
--------------050306010303090506010906
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 04/13/2015 13:36, Alan Cox wrote:
> On 03/30/2015 10:50, John Baldwin wrote:
>> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
>>> On Sat, 21 Mar 2015, John Baldwin wrote:
>>> 
>>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote:
>>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin
>>>>> wrote:
>>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin
>>>>>>> wrote:
>>>>>>>> Author: jhb Date: Fri Mar 20 10:27:06 2015 New
>>>>>>>> Revision: 280279 URL:
>>>>>>>> https://svnweb.freebsd.org/changeset/base/280279
>>>>>>>> 
>>>>>>>> Log: Expand the bitcount* API to support 64-bit
>>>>>>>> integers, plain ints and longs and create a "hidden"
>>>>>>>> API that can be used in other system headers without 
>>>>>>>> adding namespace pollution. - If the POPCNT
>>>>>>>> instruction is enabled at compile time, use 
>>>>>>>> __builtin_popcount*() to implement __bitcount*(),
>>>>>>>> otherwise fall back to software implementations.
>>>>>>> Are you aware of the Haswell errata HSD146 ?  I see the
>>>>>>> described behaviour on machines back to SandyBridge,
>>>>>>> but not on Nehalems. HSD146.   POPCNT Instruction May
>>>>>>> Take Longer to Execute Than Expected Problem: POPCNT
>>>>>>> instruction execution with a 32 or 64 bit operand may
>>>>>>> be delayed until previous non-dependent instructions
>>>>>>> have executed.
>>>>>>> 
>>>>>>> Jilles noted that gcc head and 4.9.2 already provides a
>>>>>>> workaround by xoring the dst register.  I have some
>>>>>>> patch for amd64 pmap, see the end of the message.
>>>>>> No, I was not aware, but I think it's hard to fix this
>>>>>> anywhere but the compiler.  I set CPUTYPE in src.conf on
>>>>>> my Ivy Bridge desktop and clang uses POPCOUNT for this
>>>>>> function from ACPI-CA:
>>>>>> 
>>>>>> static UINT8 AcpiRsCountSetBits ( UINT16
>>>>>> BitField) { UINT8                   BitsSet;
>>>>>> 
>>>>>> 
>>>>>> ACPI_FUNCTION_ENTRY ();
>>>>>> 
>>>>>> 
>>>>>> for (BitsSet = 0; BitField; BitsSet++) { /* Zero the
>>>>>> least significant bit that is set */
>>>>>> 
>>>>>> BitField &= (UINT16) (BitField - 1); }
>>>>>> 
>>>>>> return (BitsSet); }
>>>>>> 
>>>>>> (I ran into this accidentally because a kernel built on
>>>>>> my system failed to boot in older qemu because the kernel
>>>>>> paniced with an illegal instruction fault in this
>>>>>> function.)
>>> Does it do the same for the similar home made popcount in
>>> pmap?:
>> Yes:
>> 
>> ffffffff807658d4:       f6 04 25 46 e2 d6 80    testb
>> $0x80,0xffffffff80d6e246 ffffffff807658db:       80 
>> ffffffff807658dc:       74 32                   je
>> ffffffff80765910 <pmap_demote_pde_locked+0x4d0> ffffffff807658de:
>> 48 89 4d b8             mov    %rcx,-0x48(%rbp) ffffffff807658e2:
>> f3 48 0f b8 4d b8       popcnt -0x48(%rbp),%rcx ffffffff807658e8:
>> 48 8b 50 20             mov    0x20(%rax),%rdx ffffffff807658ec:
>> 48 89 55 b0             mov    %rdx,-0x50(%rbp) ffffffff807658f0:
>> f3 48 0f b8 55 b0       popcnt -0x50(%rbp),%rdx ffffffff807658f6:
>> 01 ca                   add    %ecx,%edx ffffffff807658f8:
>> 48 8b 48 28             mov    0x28(%rax),%rcx ffffffff807658fc:
>> 48 89 4d a8             mov    %rcx,-0x58(%rbp) ffffffff80765900:
>> f3 48 0f b8 4d a8       popcnt -0x58(%rbp),%rcx ffffffff80765906:
>> eb 1b                   jmp    ffffffff80765923
>> <pmap_demote_pde_locked+0x4e3> ffffffff80765908:       0f 1f 84
>> 00 00 00 00    nopl   0x0(%rax,%rax,1) ffffffff8076590f:       00
>>  ffffffff80765910:       f3 48 0f b8 c9          popcnt
>> %rcx,%rcx ffffffff80765915:       f3 48 0f b8 50 20       popcnt
>> 0x20(%rax),%rdx ffffffff8076591b:       01 ca
>> add    %ecx,%edx ffffffff8076591d:       f3 48 0f b8 48 28
>> popcnt 0x28(%rax),%rcx ffffffff80765923:       01 d1
>> add    %edx,%ecx
>> 
>> It also uses popcnt for this in blist_fill() and
>> blist_meta_fill():
>> 
>> 742             /* Count the number of blocks we're about to
>> allocate */ 743             bitmap = scan->u.bmu_bitmap & mask; 
>> 744             for (nblks = 0; bitmap != 0; nblks++) 745
>> bitmap &= bitmap - 1;
>> 
>>> Always using new API would lose the micro-optimizations given
>>> by the runtime decision for default CFLAGS (used by
>>> distributions for portability).  To keep them, it seems best to
>>> keep the inline asm but replace popcnt_pc_map_elem(elem) by
>>> __bitcount64(elem).  -mno-popcount can then be used to work
>>> around slowness in the software (that is actually hardware)
>>> case.
>> I'm not sure if bitcount64() is strictly better than the loop in
>> this case even though it is O(1) given the claimed nature of the
>> values in the comment.
>> 
> 
> 
> I checked.  Even with zeroes being more common than ones,
> bitcount64() is faster than the simple loop.  Using bitcount64,
> reserve_pv_entries() takes on average 4265 cycles during
> "buildworld" on my test machine.  In contrast, with the simple
> loop, it takes on average 4507 cycles.  Even though bitcount64 is a
> lot larger than the simple loop, we do the 3 bit count operations
> many times in a loop, so the extra i-cache misses are being made up
> for by the repeated execution of the faster code.
> 
> However, in the popcnt case, we are spilling the bit map to memory
> in order to popcnt it.  That's rather silly:
> 
> 3570:       48 8b 48 18             mov    0x18(%rax),%rcx 3574:
> f6 04 25 00 00 00 00    testb  $0x80,0x0 357b:       80 357c:
> 74 42                   je     35c0 <pmap_demote_pde_locked+0x2f0> 
> 357e:       48 89 4d b8             mov    %rcx,-0x48(%rbp) 3582:
> 31 c9                   xor    %ecx,%ecx 3584:       f3 48 0f b8 4d
> b8       popcnt -0x48(%rbp),%rcx 358a:       48 8b 50 20
> mov    0x20(%rax),%rdx 358e:       48 89 55 b0             mov
> %rdx,-0x50(%rbp) 3592:       31 d2                   xor
> %edx,%edx 3594:       f3 48 0f b8 55 b0       popcnt
> -0x50(%rbp),%rdx 359a:       01 ca                   add
> %ecx,%edx 359c:       48 8b 48 28             mov
> 0x28(%rax),%rcx 35a0:       48 89 4d a8             mov
> %rcx,-0x58(%rbp) 35a4:       31 c9                   xor
> %ecx,%ecx 35a6:       f3 48 0f b8 4d a8       popcnt
> -0x58(%rbp),%rcx 35ac:       01 d1                   add
> %edx,%ecx 35ae:       e9 12 01 00 00          jmpq   36c5 
> <pmap_demote_pde_locked+0x3f5>

Please try the attached patch.

Jung-uk Kim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJVLCFZAAoJEHyflib82/FGOp0H/1+Jr+cKUn/MnV5O5SghPw9f
XzTM4+BV9BcWabLRjFe1LR065SfLDXqKLuU4h5lmVSlXQaxElAXxaMeyO3mrMzR4
Sb1xr0rf+ZfUARJeEJWI65Wpn+gEH+7XxXAIAetYGMwwclBOBgbZIoDXITnCaUFa
/pi3zQIey8EzbvlzhQcffLDV8oF4f8HNEMoSxMRtOiZNNPu/8ECnyGeHZhOd++kh
pwZNsSbcCw3RXMheuErTpKPrJSEXgMNmWG3G00aP7L8IjcObgOqMUQt+8eT8Ge8B
tEv40kgm2G/OG2akONh4/6bX3hyodW3IHcb6AYhqZogiDIqd/eXD4jDup/kkVxU=
=1Ca9
-----END PGP SIGNATURE-----

--------------050306010303090506010906
Content-Type: text/x-patch;
 name="pmap.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="pmap.diff"

Index: sys/amd64/amd64/pmap.c
===================================================================
--- sys/amd64/amd64/pmap.c	(revision 281496)
+++ sys/amd64/amd64/pmap.c	(working copy)
@@ -412,7 +412,7 @@ static caddr_t crashdumpmap;
 static void	free_pv_chunk(struct pv_chunk *pc);
 static void	free_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
-static int	popcnt_pc_map_elem_pq(uint64_t elem);
+static int	popcnt_pc_map(uint64_t *pc_map);
 static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
 static void	reserve_pv_entries(pmap_t pmap, int needed,
 		    struct rwlock **lockp);
@@ -2979,7 +2979,7 @@ retry:
 }
 
 /*
- * Returns the number of one bits within the given PV chunk map element.
+ * Returns the number of one bits within the given PV chunk map.
  *
  * The erratas for Intel processors state that "POPCNT Instruction May
  * Take Longer to Execute Than Expected".  It is believed that the
@@ -2994,12 +2994,21 @@ retry:
  * 5th Gen Core: BDM85
  */
 static int
-popcnt_pc_map_elem_pq(uint64_t elem)
+popcnt_pc_map(uint64_t *pc_map)
 {
-	u_long result;
+	u_long count, result;
+	int field;
 
-	__asm __volatile("xorl %k0,%k0;popcntq %1,%0"
-	    : "=&r" (result) : "rm" (elem));
+	result = 0;
+	if ((cpu_feature2 & CPUID2_POPCNT) != 0)
+		for (field = 0; field < _NPCM; field++) {
+			__asm __volatile("xorl %k0, %k0; popcntq %1, %0"
+			    : "=r" (count) : "m" (pc_map[field]));
+			result += count;
+		}
+	else
+		for (field = 0; field < _NPCM; field++)
+			result += bitcount64(pc_map[field]);
 	return (result);
 }
 
@@ -3031,15 +3040,7 @@ reserve_pv_entries(pmap_t pmap, int needed, struct
 retry:
 	avail = 0;
 	TAILQ_FOREACH(pc, &pmap->pm_pvchunk, pc_list) {
-		if ((cpu_feature2 & CPUID2_POPCNT) == 0) {
-			free = bitcount64(pc->pc_map[0]);
-			free += bitcount64(pc->pc_map[1]);
-			free += bitcount64(pc->pc_map[2]);
-		} else {
-			free = popcnt_pc_map_elem_pq(pc->pc_map[0]);
-			free += popcnt_pc_map_elem_pq(pc->pc_map[1]);
-			free += popcnt_pc_map_elem_pq(pc->pc_map[2]);
-		}
+		free = popcnt_pc_map(pc->pc_map);
 		if (free == 0)
 			break;
 		avail += free;

--------------050306010303090506010906--