Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 21 Dec 2011 16:52:50 -0800
From:      Steve Kargl <sgk@troutmask.apl.washington.edu>
To:        Attilio Rao <attilio@freebsd.org>
Cc:        Andrey Chernov <ache@nagual.pp.ru>, George Mitchell <george+freebsd@m5p.com>, Doug Barton <dougb@freebsd.org>, freebsd-stable@freebsd.org
Subject:   Re: SCHED_ULE should not be the default
Message-ID:  <20111222005250.GA23115@troutmask.apl.washington.edu>
In-Reply-To: <CAJ-FndD0vFWUnRPxz6CTR5JBaEaY3gh9y7-Dy6Gds69_aRgfpg@mail.gmail.com>
References:  <4EE1EAFE.3070408@m5p.com> <CAJ-FndBSOS3hKYqmPnVkoMhPmowBBqy9-%2BeJJEMTdoVjdMTEdw@mail.gmail.com> <20111215215554.GA87606@troutmask.apl.washington.edu> <CAJ-FndD0vFWUnRPxz6CTR5JBaEaY3gh9y7-Dy6Gds69_aRgfpg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 16, 2011 at 12:14:24PM +0100, Attilio Rao wrote:
> 2011/12/15 Steve Kargl <sgk@troutmask.apl.washington.edu>:
> > On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote:
> >>
> >> I basically went through all the e-mail you just sent and identified 4
> >> real report on which we could work on and summarizied in the attached
> >> Excel file.
> >> I'd like that George, Steve, Doug, Andrey and Mike possibly review the
> >> few datas there and add more, if they want, or make more important
> >> clarifications in particular about the Xorg presence (or rather not)
> >> in their workload.
> >
> > Your summary of my observations appears correct.
> >
> > I have grabbed an up-to-date /usr/src, built and
> > installed world, and built and installed a new
> > kernel on one of the nodes in my cluster. ??It
> > has
> >
> 
> It seems a perfect environment, just please make sure you made a
> debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically).
> 
> The first thing is, can you try reproducing your case? As far as I got
> it, for you it was enough to run N + small_amount of CPU-bound threads
> to show performance penalty, so I'd ask you to start with using dnetc
> or just your preferred cpu-bound workload and verify you can reproduce
> the issue.
> As it happens, please monitor the threads bouncing and CPU utilization
> via 'top' (you don't need to be 100% precise, jut to get an idea, and
> keep an eye on things like excessive threads migration, thread binding
> obsessity, low throughput on CPU).
> One note: if your workloads need to do I/O please use a tempfs or
> memory storage to do so, in order to reduce I/O effects at all.
> Also, verify this doesn't happen with 4BSD scheduler, just in case.
> 
> Finally, if the problem is still in place, please recompile your
> kernel by adding:
> options KTR
> options KTR_ENTRIES=262144
> options KTR_COMPILE=(KTR_SCHED)
> options KTR_MASK=(KTR_SCHED)
> 
> And reproduce the issue.
> When you are in the middle of the scheduling issue go with:
> # ktrdump -ctf > ktr-ule-problem-YOURNAME.out
> 
> and send to the mailing list along with your dmesg and the
> informations on the CPU utilization you gathered by top(1).
> 
> That should cover it all, but if you have further questions, please
> just go ahead.

Attilio,

I have placed several files at

http://troutmask.apl.washington.edu/~kargl/freebsd

dmesg.txt      --> dmesg for ULE kernel
summary        --> A summary that includes top(1) output of all runs.
sysctl.ule.txt --> sysctl -a for the ULE kernel
ktr-ule-problem-kargl.out.gz 

I performed a series of tests with both 4BSD and ULE kernels.
The 4BSD and ULE kernels are identical except of course for the
scheduler.  Both witness and invariants are disabled, and malloc
has been compiled without debugging.

Here's what I did.  On the master node in my cluster, I ran an
OpenMPI code that sends N jobs off to the node with the kernel
of interest.  There is communication between the master and
slaves to generate 16 independent chunks of data.  Note, there
is no disk IO.  So, for example, N=4 will start 4 essentially
identical numerically intensity jobs.  At the start of a run,
the master node instructs each slave job to create a chunk of
data.  After the data is created, the slave sends it back to the
master and the master sends instructions to create the next chunk
of data.  This communication continues until the 16 chunks have
been assigned, computed, and returned to the master.  

Here is a rough measurement of the problem with ULE and numerical
intensity loads.  This command is executed on the master

time mpiexec -machinefile mf3 -np N sasmp sas.in

Since time is executed on the master, only the 'real' time is of
interest (the summary file includes user and sys times).  This
command is run at 5 times for each N value and up to 10 time for
some N values with the ULE kernel.  The following table records
the average 'real' time and the number in (...) is the mean
absolute deviations. 

#  N         ULE             4BSD
# -------------------------------------
#  4    223.27 (0.502)   221.76 (0.551)
#  5    404.35 (73.82)   270.68 (0.866)
#  6    627.56 (173.0)   247.23 (1.442)
#  7    475.53 (84.07)   285.78 (1.421)
#  8    429.45 (134.9)   223.64 (1.316)

These numbers to me demonstrate that ULE is not a good choice
for a HPC workload.

If you need more information, feel free to ask.  If you would
like access to the node, I can probably arrange that.  But,
we can discuss that off-line.

-- 
Steve



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111222005250.GA23115>