From owner-svn-src-all@FreeBSD.ORG Mon Apr 13 20:04:47 2015 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx2.freebsd.org (mx2.freebsd.org [8.8.178.116]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 54593981; Mon, 13 Apr 2015 20:04:47 +0000 (UTC) Received: from hammer.pct.niksun.com (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx2.freebsd.org (Postfix) with ESMTP id 7D1CD250C; Mon, 13 Apr 2015 20:04:46 +0000 (UTC) Message-ID: <552C215D.8020107@FreeBSD.org> Date: Mon, 13 Apr 2015 16:04:45 -0400 From: Jung-uk Kim User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Alan Cox , John Baldwin , Bruce Evans Subject: Re: svn commit: r280279 - head/sys/sys References: <201503201027.t2KAR6Ze053047@svn.freebsd.org> <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org> <17035816.lxyzYKiOWV@ralph.baldwin.cx> <552BFEB2.8040407@rice.edu> In-Reply-To: <552BFEB2.8040407@rice.edu> Content-Type: multipart/mixed; boundary="------------050306010303090506010906" Cc: Konstantin Belousov , svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Apr 2015 20:04:47 -0000 This is a multi-part message in MIME format. --------------050306010303090506010906 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 04/13/2015 13:36, Alan Cox wrote: > On 03/30/2015 10:50, John Baldwin wrote: >> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: >>> On Sat, 21 Mar 2015, John Baldwin wrote: >>> >>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote: >>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin >>>>> wrote: >>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote: >>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin >>>>>>> wrote: >>>>>>>> Author: jhb Date: Fri Mar 20 10:27:06 2015 New >>>>>>>> Revision: 280279 URL: >>>>>>>> https://svnweb.freebsd.org/changeset/base/280279 >>>>>>>> >>>>>>>> Log: Expand the bitcount* API to support 64-bit >>>>>>>> integers, plain ints and longs and create a "hidden" >>>>>>>> API that can be used in other system headers without >>>>>>>> adding namespace pollution. - If the POPCNT >>>>>>>> instruction is enabled at compile time, use >>>>>>>> __builtin_popcount*() to implement __bitcount*(), >>>>>>>> otherwise fall back to software implementations. >>>>>>> Are you aware of the Haswell errata HSD146 ? I see the >>>>>>> described behaviour on machines back to SandyBridge, >>>>>>> but not on Nehalems. HSD146. POPCNT Instruction May >>>>>>> Take Longer to Execute Than Expected Problem: POPCNT >>>>>>> instruction execution with a 32 or 64 bit operand may >>>>>>> be delayed until previous non-dependent instructions >>>>>>> have executed. >>>>>>> >>>>>>> Jilles noted that gcc head and 4.9.2 already provides a >>>>>>> workaround by xoring the dst register. I have some >>>>>>> patch for amd64 pmap, see the end of the message. >>>>>> No, I was not aware, but I think it's hard to fix this >>>>>> anywhere but the compiler. I set CPUTYPE in src.conf on >>>>>> my Ivy Bridge desktop and clang uses POPCOUNT for this >>>>>> function from ACPI-CA: >>>>>> >>>>>> static UINT8 AcpiRsCountSetBits ( UINT16 >>>>>> BitField) { UINT8 BitsSet; >>>>>> >>>>>> >>>>>> ACPI_FUNCTION_ENTRY (); >>>>>> >>>>>> >>>>>> for (BitsSet = 0; BitField; BitsSet++) { /* Zero the >>>>>> least significant bit that is set */ >>>>>> >>>>>> BitField &= (UINT16) (BitField - 1); } >>>>>> >>>>>> return (BitsSet); } >>>>>> >>>>>> (I ran into this accidentally because a kernel built on >>>>>> my system failed to boot in older qemu because the kernel >>>>>> paniced with an illegal instruction fault in this >>>>>> function.) >>> Does it do the same for the similar home made popcount in >>> pmap?: >> Yes: >> >> ffffffff807658d4: f6 04 25 46 e2 d6 80 testb >> $0x80,0xffffffff80d6e246 ffffffff807658db: 80 >> ffffffff807658dc: 74 32 je >> ffffffff80765910 ffffffff807658de: >> 48 89 4d b8 mov %rcx,-0x48(%rbp) ffffffff807658e2: >> f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx ffffffff807658e8: >> 48 8b 50 20 mov 0x20(%rax),%rdx ffffffff807658ec: >> 48 89 55 b0 mov %rdx,-0x50(%rbp) ffffffff807658f0: >> f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx ffffffff807658f6: >> 01 ca add %ecx,%edx ffffffff807658f8: >> 48 8b 48 28 mov 0x28(%rax),%rcx ffffffff807658fc: >> 48 89 4d a8 mov %rcx,-0x58(%rbp) ffffffff80765900: >> f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx ffffffff80765906: >> eb 1b jmp ffffffff80765923 >> ffffffff80765908: 0f 1f 84 >> 00 00 00 00 nopl 0x0(%rax,%rax,1) ffffffff8076590f: 00 >> ffffffff80765910: f3 48 0f b8 c9 popcnt >> %rcx,%rcx ffffffff80765915: f3 48 0f b8 50 20 popcnt >> 0x20(%rax),%rdx ffffffff8076591b: 01 ca >> add %ecx,%edx ffffffff8076591d: f3 48 0f b8 48 28 >> popcnt 0x28(%rax),%rcx ffffffff80765923: 01 d1 >> add %edx,%ecx >> >> It also uses popcnt for this in blist_fill() and >> blist_meta_fill(): >> >> 742 /* Count the number of blocks we're about to >> allocate */ 743 bitmap = scan->u.bmu_bitmap & mask; >> 744 for (nblks = 0; bitmap != 0; nblks++) 745 >> bitmap &= bitmap - 1; >> >>> Always using new API would lose the micro-optimizations given >>> by the runtime decision for default CFLAGS (used by >>> distributions for portability). To keep them, it seems best to >>> keep the inline asm but replace popcnt_pc_map_elem(elem) by >>> __bitcount64(elem). -mno-popcount can then be used to work >>> around slowness in the software (that is actually hardware) >>> case. >> I'm not sure if bitcount64() is strictly better than the loop in >> this case even though it is O(1) given the claimed nature of the >> values in the comment. >> > > > I checked. Even with zeroes being more common than ones, > bitcount64() is faster than the simple loop. Using bitcount64, > reserve_pv_entries() takes on average 4265 cycles during > "buildworld" on my test machine. In contrast, with the simple > loop, it takes on average 4507 cycles. Even though bitcount64 is a > lot larger than the simple loop, we do the 3 bit count operations > many times in a loop, so the extra i-cache misses are being made up > for by the repeated execution of the faster code. > > However, in the popcnt case, we are spilling the bit map to memory > in order to popcnt it. That's rather silly: > > 3570: 48 8b 48 18 mov 0x18(%rax),%rcx 3574: > f6 04 25 00 00 00 00 testb $0x80,0x0 357b: 80 357c: > 74 42 je 35c0 > 357e: 48 89 4d b8 mov %rcx,-0x48(%rbp) 3582: > 31 c9 xor %ecx,%ecx 3584: f3 48 0f b8 4d > b8 popcnt -0x48(%rbp),%rcx 358a: 48 8b 50 20 > mov 0x20(%rax),%rdx 358e: 48 89 55 b0 mov > %rdx,-0x50(%rbp) 3592: 31 d2 xor > %edx,%edx 3594: f3 48 0f b8 55 b0 popcnt > -0x50(%rbp),%rdx 359a: 01 ca add > %ecx,%edx 359c: 48 8b 48 28 mov > 0x28(%rax),%rcx 35a0: 48 89 4d a8 mov > %rcx,-0x58(%rbp) 35a4: 31 c9 xor > %ecx,%ecx 35a6: f3 48 0f b8 4d a8 popcnt > -0x58(%rbp),%rcx 35ac: 01 d1 add > %edx,%ecx 35ae: e9 12 01 00 00 jmpq 36c5 > Please try the attached patch. Jung-uk Kim -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBCAAGBQJVLCFZAAoJEHyflib82/FGOp0H/1+Jr+cKUn/MnV5O5SghPw9f XzTM4+BV9BcWabLRjFe1LR065SfLDXqKLuU4h5lmVSlXQaxElAXxaMeyO3mrMzR4 Sb1xr0rf+ZfUARJeEJWI65Wpn+gEH+7XxXAIAetYGMwwclBOBgbZIoDXITnCaUFa /pi3zQIey8EzbvlzhQcffLDV8oF4f8HNEMoSxMRtOiZNNPu/8ECnyGeHZhOd++kh pwZNsSbcCw3RXMheuErTpKPrJSEXgMNmWG3G00aP7L8IjcObgOqMUQt+8eT8Ge8B tEv40kgm2G/OG2akONh4/6bX3hyodW3IHcb6AYhqZogiDIqd/eXD4jDup/kkVxU= =1Ca9 -----END PGP SIGNATURE----- --------------050306010303090506010906 Content-Type: text/x-patch; name="pmap.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="pmap.diff" Index: sys/amd64/amd64/pmap.c =================================================================== --- sys/amd64/amd64/pmap.c (revision 281496) +++ sys/amd64/amd64/pmap.c (working copy) @@ -412,7 +412,7 @@ static caddr_t crashdumpmap; static void free_pv_chunk(struct pv_chunk *pc); static void free_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); -static int popcnt_pc_map_elem_pq(uint64_t elem); +static int popcnt_pc_map(uint64_t *pc_map); static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp); static void reserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp); @@ -2979,7 +2979,7 @@ retry: } /* - * Returns the number of one bits within the given PV chunk map element. + * Returns the number of one bits within the given PV chunk map. * * The erratas for Intel processors state that "POPCNT Instruction May * Take Longer to Execute Than Expected". It is believed that the @@ -2994,12 +2994,21 @@ retry: * 5th Gen Core: BDM85 */ static int -popcnt_pc_map_elem_pq(uint64_t elem) +popcnt_pc_map(uint64_t *pc_map) { - u_long result; + u_long count, result; + int field; - __asm __volatile("xorl %k0,%k0;popcntq %1,%0" - : "=&r" (result) : "rm" (elem)); + result = 0; + if ((cpu_feature2 & CPUID2_POPCNT) != 0) + for (field = 0; field < _NPCM; field++) { + __asm __volatile("xorl %k0, %k0; popcntq %1, %0" + : "=r" (count) : "m" (pc_map[field])); + result += count; + } + else + for (field = 0; field < _NPCM; field++) + result += bitcount64(pc_map[field]); return (result); } @@ -3031,15 +3040,7 @@ reserve_pv_entries(pmap_t pmap, int needed, struct retry: avail = 0; TAILQ_FOREACH(pc, &pmap->pm_pvchunk, pc_list) { - if ((cpu_feature2 & CPUID2_POPCNT) == 0) { - free = bitcount64(pc->pc_map[0]); - free += bitcount64(pc->pc_map[1]); - free += bitcount64(pc->pc_map[2]); - } else { - free = popcnt_pc_map_elem_pq(pc->pc_map[0]); - free += popcnt_pc_map_elem_pq(pc->pc_map[1]); - free += popcnt_pc_map_elem_pq(pc->pc_map[2]); - } + free = popcnt_pc_map(pc->pc_map); if (free == 0) break; avail += free; --------------050306010303090506010906--