From owner-freebsd-performance@FreeBSD.ORG  Fri Jun 14 02:05:11 2013
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 041AEEED
 for <freebsd-performance@freebsd.org>; Fri, 14 Jun 2013 02:05:11 +0000 (UTC)
 (envelope-from davidxu@freebsd.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id CD2E5183C;
 Fri, 14 Jun 2013 02:05:10 +0000 (UTC)
Received: from xyf.my.dom (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r5E258sL055350;
 Fri, 14 Jun 2013 02:05:09 GMT (envelope-from davidxu@freebsd.org)
Message-ID: <51BA7A78.7010904@freebsd.org>
Date: Fri, 14 Jun 2013 10:05:44 +0800
From: David Xu <davidxu@freebsd.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:17.0) Gecko/20130416 Thunderbird/17.0.5
MIME-Version: 1.0
To: Remy Nonnenmacher <remy.nonnenmacher@activnetworks.com>
Subject: Re: Scaling and performance issues with FreeBSD 9 (& 10) on 4 socket
 systems
References: <20130612225849.GA2858@dragon.NUXI.org>
 <op.wyl7oryz34t2sn@markf.office.supranet.net>
 <51B9B497.70800@activnetworks.com>
In-Reply-To: <51B9B497.70800@activnetworks.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Cc: "freebsd-performance@freebsd.org" <freebsd-performance@freebsd.org>
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-performance>, 
 <mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>, 
 <mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Jun 2013 02:05:11 -0000

On 2013/06/13 20:01, Remy Nonnenmacher wrote:
>
> On 06/13/13 13:32, Mark Felder wrote:
>> On Wed, 12 Jun 2013 17:58:49 -0500, David O'Brien <obrien@freebsd.org>
>> wrote:
>>
>>> We found FreeBSD 8.4 to perform better than FreeBSD 9.1, and Linux
>>> considerably better than both on the same machine.
>>
>> http://svnweb.freebsd.org/base?view=revision&revision=241246
>>
>> The above link is likely why 8.4 is better than 9.1 on the same machine.
>>
>>> We've tried various things and haven't been able to explain why FreeBSD
>>> isn't scaling on the new hardware.  Nor why it performs so much worse
>>> than FreeBSD on the older "M2" machines.
>>
>> The CPUs between those machines are quite different. I'm sure we're
>> looking at different cache sizes, different behavior for the
>> hyperthreading, etc. I'm sure others would be greatly interested in you
>> providing the same benchmark results for a recent snapshot of HEAD as
>> well.
>> _______________________________________________
>> freebsd-performance@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
>> To unsubscribe, send any mail to
>> "freebsd-performance-unsubscribe@freebsd.org"
>
> We had same problem on 4x12 cores (AMD) machines. After investigating
> using hwpmc, it appears that performance was killed by a scheduler
> function trying to find "least used cpu" that unfortunately works on
> contended structures (ie: lots a cores are fighting to get works). A
> solution was found by using artificially long queue of stuck process
> (steal_thresh bumped to over 8) and by cpu affinity crafting.
>
> Was a year ago and from my memory. I guess you may give a try to see if
> it helps.
>
> Disregard is a scheduler specialist contradicts.
>
> Thanks.
>

AMD's cache is very different than Intel, AFAIK eariler than Bulldozer, 
AMD's L3 is exclusive cache, util Bulldozer, AMD describes the L3 cache 
as a “non-inclusive victim cache”, it is still different than Intel 
which is inclusive.

"- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU."

For exclusive cache, the L3 has second-hand data, not hot data, when a 
thread is migrated, will have negative effect, its hot data is lost.
I'd prefer to search idle CPU from L2, then L3.