From owner-freebsd-stable@FreeBSD.ORG  Mon Dec 12 18:50:14 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5C6341065675;
	Mon, 12 Dec 2011 18:50:14 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 3066C8FC16;
	Mon, 12 Dec 2011 18:50:14 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170])
	by cyrus.watson.org (Postfix) with ESMTPSA id D06F946B3C;
	Mon, 12 Dec 2011 13:50:13 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 42D2DB961;
	Mon, 12 Dec 2011 13:50:13 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-current@freebsd.org
Date: Mon, 12 Dec 2011 13:50:11 -0500
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p8; KDE/4.5.5; amd64; ; )
References: <4EE1EAFE.3070408@m5p.com> <4EE6295B.3020308@cran.org.uk>
	<20111212170604.GA74044@troutmask.apl.washington.edu>
In-Reply-To: <20111212170604.GA74044@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201112121350.11784.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(bigwig.baldwin.cx); Mon, 12 Dec 2011 13:50:13 -0500 (EST)
Cc: Bruce Cran <bruce@cran.org.uk>,
	"O. Hartmann" <ohartman@mail.zedat.fu-berlin.de>,
	freebsd-stable@freebsd.org, freebsd-performance@freebsd.org,
	Steve Kargl <sgk@troutmask.apl.washington.edu>
Subject: Re: SCHED_ULE should not be the default
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Dec 2011 18:50:14 -0000

On Monday, December 12, 2011 12:06:04 pm Steve Kargl wrote:
> On Mon, Dec 12, 2011 at 04:18:35PM +0000, Bruce Cran wrote:
> > On 12/12/2011 15:51, Steve Kargl wrote:
> > >This comes up every 9 months or so, and must be approaching FAQ 
> > >status. In a HPC environment, I recommend 4BSD. Depending on the 
> > >workload, ULE can cause a severe increase in turn around time when 
> > >doing already long computations. If you have an MPI application, 
> > >simply launching greater than ncpu+1 jobs can show the problem. PS: 
> > >search the list archives for "kargl and ULE". 
> > 
> > This isn't something that can be fixed by tuning ULE? For example for 
> > desktop applications kern.sched.preempt_thresh should be set to 224 from 
> > its default. I'm wondering if the installer should ask people what the 
> > typical use will be, and tune the scheduler appropriately.
> > 
> 
> Tuning kern.sched.preempt_thresh did not seem to help for
> my workload.  My code is a classic master-slave OpenMPI
> application where the master runs on one node and all
> cpu-bound slaves are sent to a second node.  If I send
> send ncpu+1 jobs to the 2nd node with ncpu's, then 
> ncpu-1 jobs are assigned to the 1st ncpu-1 cpus.  The
> last two jobs are assigned to the ncpu'th cpu, and 
> these ping-pong on the this cpu.  AFAICT, it is a cpu
> affinity issue, where ULE is trying to keep each job
> associated with its initially assigned cpu.
> 
> While one might suggest that starting ncpu+1 jobs
> is not prudent, my example is just that.  It is an
> example showing that ULE has performance issues. 
> So, I now can start only ncpu jobs on each node
> in the cluster and send emails to all other users
> to not use those node, or use 4BSD and not worry
> about loading issues.

This is a case where 4BSD's naive algorithm will spread out the load more
evenly because all the threads are on a single, shared queue and each CPU
just grabs the head of the queue when it finishes a timeslice.  ULE always
assigns threads to a single CPU (even if they aren't pinned to a single
CPU using cpuset, etc.) and then tries to balance the load across cores
later, but I believe in this case it's rebalancer won't have anything to
really do as no matter what it does with the N+1 job it's going to be
sharing a CPU with another job.

-- 
John Baldwin