From owner-freebsd-stable@FreeBSD.ORG Thu Dec 22 00:52:53 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9FCDF106564A; Thu, 22 Dec 2011 00:52:53 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) by mx1.freebsd.org (Postfix) with ESMTP id 71D888FC13; Thu, 22 Dec 2011 00:52:53 +0000 (UTC) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.5/8.14.5) with ESMTP id pBM0qqCH024878; Wed, 21 Dec 2011 16:52:52 -0800 (PST) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.5/8.14.5/Submit) id pBM0qpsu024877; Wed, 21 Dec 2011 16:52:51 -0800 (PST) (envelope-from sgk) Date: Wed, 21 Dec 2011 16:52:50 -0800 From: Steve Kargl To: Attilio Rao Message-ID: <20111222005250.GA23115@troutmask.apl.washington.edu> References: <4EE1EAFE.3070408@m5p.com> <20111215215554.GA87606@troutmask.apl.washington.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: Andrey Chernov , George Mitchell , Doug Barton , freebsd-stable@freebsd.org Subject: Re: SCHED_ULE should not be the default X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 Dec 2011 00:52:53 -0000 On Fri, Dec 16, 2011 at 12:14:24PM +0100, Attilio Rao wrote: > 2011/12/15 Steve Kargl : > > On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote: > >> > >> I basically went through all the e-mail you just sent and identified 4 > >> real report on which we could work on and summarizied in the attached > >> Excel file. > >> I'd like that George, Steve, Doug, Andrey and Mike possibly review the > >> few datas there and add more, if they want, or make more important > >> clarifications in particular about the Xorg presence (or rather not) > >> in their workload. > > > > Your summary of my observations appears correct. > > > > I have grabbed an up-to-date /usr/src, built and > > installed world, and built and installed a new > > kernel on one of the nodes in my cluster. ??It > > has > > > > It seems a perfect environment, just please make sure you made a > debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically). > > The first thing is, can you try reproducing your case? As far as I got > it, for you it was enough to run N + small_amount of CPU-bound threads > to show performance penalty, so I'd ask you to start with using dnetc > or just your preferred cpu-bound workload and verify you can reproduce > the issue. > As it happens, please monitor the threads bouncing and CPU utilization > via 'top' (you don't need to be 100% precise, jut to get an idea, and > keep an eye on things like excessive threads migration, thread binding > obsessity, low throughput on CPU). > One note: if your workloads need to do I/O please use a tempfs or > memory storage to do so, in order to reduce I/O effects at all. > Also, verify this doesn't happen with 4BSD scheduler, just in case. > > Finally, if the problem is still in place, please recompile your > kernel by adding: > options KTR > options KTR_ENTRIES=262144 > options KTR_COMPILE=(KTR_SCHED) > options KTR_MASK=(KTR_SCHED) > > And reproduce the issue. > When you are in the middle of the scheduling issue go with: > # ktrdump -ctf > ktr-ule-problem-YOURNAME.out > > and send to the mailing list along with your dmesg and the > informations on the CPU utilization you gathered by top(1). > > That should cover it all, but if you have further questions, please > just go ahead. Attilio, I have placed several files at http://troutmask.apl.washington.edu/~kargl/freebsd dmesg.txt --> dmesg for ULE kernel summary --> A summary that includes top(1) output of all runs. sysctl.ule.txt --> sysctl -a for the ULE kernel ktr-ule-problem-kargl.out.gz I performed a series of tests with both 4BSD and ULE kernels. The 4BSD and ULE kernels are identical except of course for the scheduler. Both witness and invariants are disabled, and malloc has been compiled without debugging. Here's what I did. On the master node in my cluster, I ran an OpenMPI code that sends N jobs off to the node with the kernel of interest. There is communication between the master and slaves to generate 16 independent chunks of data. Note, there is no disk IO. So, for example, N=4 will start 4 essentially identical numerically intensity jobs. At the start of a run, the master node instructs each slave job to create a chunk of data. After the data is created, the slave sends it back to the master and the master sends instructions to create the next chunk of data. This communication continues until the 16 chunks have been assigned, computed, and returned to the master. Here is a rough measurement of the problem with ULE and numerical intensity loads. This command is executed on the master time mpiexec -machinefile mf3 -np N sasmp sas.in Since time is executed on the master, only the 'real' time is of interest (the summary file includes user and sys times). This command is run at 5 times for each N value and up to 10 time for some N values with the ULE kernel. The following table records the average 'real' time and the number in (...) is the mean absolute deviations. # N ULE 4BSD # ------------------------------------- # 4 223.27 (0.502) 221.76 (0.551) # 5 404.35 (73.82) 270.68 (0.866) # 6 627.56 (173.0) 247.23 (1.442) # 7 475.53 (84.07) 285.78 (1.421) # 8 429.45 (134.9) 223.64 (1.316) These numbers to me demonstrate that ULE is not a good choice for a HPC workload. If you need more information, feel free to ask. If you would like access to the node, I can probably arrange that. But, we can discuss that off-line. -- Steve