Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Mar 2015 20:24:34 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Bruce Evans <brde@optusnet.com.au>
Subject:   Re: svn commit: r280279 - head/sys/sys
Message-ID:  <20150330172434.GG2379@kib.kiev.ua>
In-Reply-To: <2526359.g5B2nXdKeQ@ralph.baldwin.cx>
References:  <201503201027.t2KAR6Ze053047@svn.freebsd.org> <20150322080015.O955@besplex.bde.org> <20150322093251.GY2379@kib.kiev.ua> <2526359.g5B2nXdKeQ@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Mar 30, 2015 at 11:57:08AM -0400, John Baldwin wrote:
> On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote:
> > On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote:
> > > Always using new API would lose the micro-optimizations given by the runtime
> > > decision for default CFLAGS (used by distributions for portability).  To
> > > keep them, it seems best to keep the inline asm but replace
> > > popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
> > > be used to work around slowness in the software (that is actually
> > > hardware) case.
> > 
> > So anybody has to compile his own kernel to get popcnt optimization ?
> > We do care about trivial things that improve time.
> 
> That is not what Bruce said.  He suggested using bitcount64() for the fallback
> if the cpuid check fails.  He did not say to remove the runtime check to use
> popcnt if it is available:
> 
> "Always using [bitcount64] would lose the micro-optimization... [to] keep
> [it], it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem)
> by [bitcount64(elem)]."
Ok, thank you for the clarification.

I updated the pmap patch, see the end of the message.
> 
> > BTW, I have the following WIP change, which popcnt xorl is a piece of.
> > It emulates the ifuncs with some preprocessing mess.  It is much better
> > than runtime patching, and is a prerequisite to properly support more
> > things, like SMAP.  I did not published it earlier, since I wanted to
> > convert TLB flush code to this.
> 
> This looks fine to me.  It seems to be manually converting certain symbols
> to use a dynamic lookup that must be explicitly resolved before first
> use?
I am not sure what do you mean by dynamic lookup, but possibly it
was mentioned. I can emulate the ifuncs more sincerely, by requiring
a resolver function, which is called on the first real function
invocation. I did not see it as very useful, but it is definitely
doable.

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 6a4077c..fcfba56 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -412,7 +416,7 @@ static caddr_t crashdumpmap;
 static void	free_pv_chunk(struct pv_chunk *pc);
 static void	free_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
-static int	popcnt_pc_map_elem(uint64_t elem);
+static int	popcnt_pc_map_elem_pq(uint64_t elem);
 static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
 static void	reserve_pv_entries(pmap_t pmap, int needed,
 		    struct rwlock **lockp);
@@ -2980,20 +3002,27 @@ retry:
 
 /*
  * Returns the number of one bits within the given PV chunk map element.
+ *
+ * The erratas for Intel processors state that "POPCNT Instruction May
+ * Take Longer to Execute Than Expected".  It is believed that the
+ * issue is the spurious dependency on the destination register.
+ * Provide a hint to the register rename logic that the destination
+ * value is overwritten, by clearing it, as suggested in the
+ * optimization manual.  It should be cheap for unaffected processors
+ * as well.
+ *
+ * Reference numbers for erratas are
+ * 4th Gen Core: HSD146
+ * 5th Gen Core: BDM85
  */
 static int
-popcnt_pc_map_elem(uint64_t elem)
+popcnt_pc_map_elem_pq(uint64_t elem)
 {
-	int count;
+	u_long result;
 
-	/*
-	 * This simple method of counting the one bits performs well because
-	 * the given element typically contains more zero bits than one bits.
-	 */
-	count = 0;
-	for (; elem != 0; elem &= elem - 1)
-		count++;
-	return (count);
+	__asm __volatile("xorl %k0,%k0;popcntq %1,%0"
+	    : "=&r" (result) : "rm" (elem));
+	return (result);
 }
 
 /*
@@ -3025,13 +3054,13 @@ retry:
 	avail = 0;
 	TAILQ_FOREACH(pc, &pmap->pm_pvchunk, pc_list) {
 		if ((cpu_feature2 & CPUID2_POPCNT) == 0) {
-			free = popcnt_pc_map_elem(pc->pc_map[0]);
-			free += popcnt_pc_map_elem(pc->pc_map[1]);
-			free += popcnt_pc_map_elem(pc->pc_map[2]);
+			free = bitcount64(pc->pc_map[0]);
+			free += bitcount64(pc->pc_map[1]);
+			free += bitcount64(pc->pc_map[2]);
 		} else {
-			free = popcntq(pc->pc_map[0]);
-			free += popcntq(pc->pc_map[1]);
-			free += popcntq(pc->pc_map[2]);
+			free = popcnt_pc_map_elem_pq(pc->pc_map[0]);
+			free += popcnt_pc_map_elem_pq(pc->pc_map[1]);
+			free += popcnt_pc_map_elem_pq(pc->pc_map[2]);
 		}
 		if (free == 0)
 			break;



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150330172434.GG2379>