From owner-freebsd-stable@FreeBSD.ORG  Thu Dec 15 19:46:36 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DB89C106564A;
	Thu, 15 Dec 2011 19:46:36 +0000 (UTC) (envelope-from fidaj@ukr.net)
Received: from fsm2.ukr.net (fsm2.ukr.net [195.214.192.121])
	by mx1.freebsd.org (Postfix) with ESMTP id 6820F8FC0A;
	Thu, 15 Dec 2011 19:46:36 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net;
	s=fsm; 
	h=Content-Transfer-Encoding:Content-Type:Mime-Version:References:In-Reply-To:Message-ID:Subject:Cc:To:From:Date;
	bh=EK+mEIr5wXLNRK+o+gkQwhNeZz7gH4y//yqgkqZMT9E=; 
	b=ClJh5RpI1ShemCy0/yCT1WdqtbQbx66RQR01kStLiG7jjpjg0o65EgTq+l9kHG5f9+Lqoj3t6KYv0fWIB2MseglX6MT+mqfCfOH76g3f/DV83LKXtbLlAFtKDILOT7cLpZNA/x4v5v3vlcozZxllSG5ElG/AKj7dHzTvN4DJ7SM=;
Received: from [178.137.138.140] (helo=nonamehost.)
	by fsm2.ukr.net with esmtpsa ID 1RbHGD-000ME3-L6
	; Thu, 15 Dec 2011 21:46:33 +0200
Date: Thu, 15 Dec 2011 21:46:27 +0200
From: Ivan Klymenko <fidaj@ukr.net>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20111215214627.16f472bf@nonamehost.>
In-Reply-To: <CAJ-FndDnk+tMCuY=VRkLurRc8qKLuYjeCuuuK=1+k7cyTFumQA@mail.gmail.com>
References: <4EE1EAFE.3070408@m5p.com> <4EE22421.9060707@gmail.com>
	<4EE6060D.5060201@mail.zedat.fu-berlin.de>
	<20111213073615.GA69641@icarus.home.lan>
	<CAJ-FndCoxXV-dOT4QAzt-Qs+zUyCGfeFPgbAx+pTot8SrVXA7w@mail.gmail.com>
	<20111215174857.GA28551@icarus.home.lan>
	<CAJ-FndDnk+tMCuY=VRkLurRc8qKLuYjeCuuuK=1+k7cyTFumQA@mail.gmail.com>
X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.6; amd64-portbld-freebsd10.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: "O. Hartmann" <ohartman@mail.zedat.fu-berlin.de>,
	Current FreeBSD <freebsd-current@freebsd.org>,
	freebsd-stable@freebsd.org, freebsd-performance@freebsd.org,
	Jeremy Chadwick <freebsd@jdc.parodius.com>
Subject: Re: SCHED_ULE should not be the default
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Dec 2011 19:46:37 -0000

=D0=92 Thu, 15 Dec 2011 20:02:44 +0100
Attilio Rao <attilio@freebsd.org> =D0=BF=D0=B8=D1=88=D0=B5=D1=82:

> 2011/12/15 Jeremy Chadwick <freebsd@jdc.parodius.com>:
> > On Thu, Dec 15, 2011 at 05:26:27PM +0100, Attilio Rao wrote:
> >> 2011/12/13 Jeremy Chadwick <freebsd@jdc.parodius.com>:
> >> > On Mon, Dec 12, 2011 at 02:47:57PM +0100, O. Hartmann wrote:
> >> >> > Not fully right, boinc defaults to run on idprio 31 so this
> >> >> > isn't an issue. And yes, there are cases where SCHED_ULE
> >> >> > shows much better performance then SCHED_4BSD. ??[...]
> >> >>
> >> >> Do we have any proof at hand for such cases where SCHED_ULE
> >> >> performs much better than SCHED_4BSD? Whenever the subject
> >> >> comes up, it is mentioned, that SCHED_ULE has better
> >> >> performance on boxes with a ncpu > 2. But in the end I see here
> >> >> contradictionary statements. People complain about poor
> >> >> performance (especially in scientific environments), and other
> >> >> give contra not being the case.
> >> >>
> >> >> Within our department, we developed a highly scalable code for
> >> >> planetary science purposes on imagery. It utilizes present GPUs
> >> >> via OpenCL if present. Otherwise it grabs as many cores as it
> >> >> can. By the end of this year I'll get a new desktop box based
> >> >> on Intels new Sandy Bridge-E architecture with plenty of
> >> >> memory. If the colleague who developed the code is willing
> >> >> performing some benchmarks on the same hardware platform, we'll
> >> >> benchmark bot FreeBSD 9.0/10.0 and the most recent Suse. For
> >> >> FreeBSD I intent also to look for performance with both
> >> >> different schedulers available.
> >> >
> >> > This is in no way shape or form the same kind of benchmark as
> >> > what you're planning to do, but I thought I'd throw it out there
> >> > for folks to take in as they see fit.
> >> >
> >> > I know folks were focused mainly on buildworld.
> >> >
> >> > I personally would find it interesting if someone with a
> >> > higher-end system (e.g. 2 physical CPUs, with 6 or 8 cores per
> >> > CPU) was to do the same test (changing -jX to -j{numofcores} of
> >> > course).
> >> >
> >> > --
> >> > | Jeremy
> >> > Chadwick ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??jdc at
> >> > parodius.com | | Parodius
> >> > Networking ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
> >> > http://www.parodius.com/ | | UNIX Systems
> >> > Administrator ?? ?? ?? ?? ?? ?? ?? ?? ?? Mountain View, CA, US |
> >> > | Making life hard for others since 1977. ?? ?? ?? ?? ?? ?? ??
> >> > PGP 4BD6C0CB |
> >> >
> >> >
> >> > sched_ule
> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> > - time make -j2 buildworld
> >> > ??1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io
> >> > 4565pf+0w
> >> > - time make -j2 buildkernel
> >> > ??640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w
> >> >
> >> >
> >> > sched_4bsd
> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> > - time make -j2 buildworld
> >> > ??1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io
> >> > 6451pf+0w
> >> > - time make -j2 buildkernel
> >> > ??638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w
> >> >
> >> >
> >> > software
> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> > * sched_ule test: ??FreeBSD 8.2-STABLE, Thu Dec ??1 04:37:29 PST
> >> > 2011
> >> > * sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST
> >> > 2011
> >>
> >> Hi Jeremy,
> >> thanks for the time you spent on this.
> >>
> >> However, I wanted to ask/let you note 3 things:
> >> 1) Did you use 2 different code base for the test? (one updated on
> >> December 1 and another one on December 12)
> >
> > No; src-all (/usr/src on this system) was not updated between
> > December 1st and December 12th PST. =C2=A0I do believe I updated it
> > today (15th PST). I can/will obviously hold off so that we have a
> > consistent code base for comparing numbers between schedulers
> > during buildworld and/or buildkernel.
> >
> >> 2) Please note that you should have repeated this test several
> >> times (basically until you don't get a standard deviation which is
> >> acceptable with ministat) and report the ministat output
> >
> > This is the first time I have heard of ministat(1). =C2=A0I'm pretty
> > sure I see what it's for and how it applies to this situation, but
> > boy that man page could use some clarification (I have 3 people
> > looking at this thing right now trying to figure out what means
> > what in the graph :-) ). Anyway, graph or not, I see the point.
> >
> > Regarding multiple tests: yup, you're absolutely right, the only
> > way to do it would be to run a sequence of tests repeatedly
> > (probably 10 per scheduler). =C2=A0Reboots and rm -fr /usr/obj/* would
> > be required after each test too, to guarantee empty kernel caches
> > (of all types) consistently every time.
> >
> > What I posted was supposed to give people just a "general idea" if
> > there was any gigantic difference between the two, and there really
> > isn't. But, as others have stated (and you below), buildworld may
> > not be an effective way to "benchmark" what we're trying to test.
> >
> > Hence me wondering exactly what would make for a good test.
> > =C2=A0Example:
> >
> > 1. Run + background some program that "beats on things" (I really
> > don't know what; creation/deletion of threads? =C2=A0CPU benchmark?
> > =C2=A0bonnie++?), with output going to /dev/null.
> > 2. Run + background "time make -j2 buildworld" with output going
> > to /dev/null 3. Record/save output from "time".
> > 4. rm -fr /usr/obj && shutdown -r now
> > 5. Repeat all steps ~10 times
> > 6. Adjust kernel configuration file to use other scheduler
> > 7. Repeat steps 1-5.
> >
> > What I'm trying to figure out is what #1 and #2 should be in the
> > above example.
> >
> >> 3) The difference is less than 2% which I suspect is really
> >> statistically unuseful/the same
> >
> > Understood.
> >
> >> I'm not really even surprised ULE is not faster than 4BSD in this
> >> case because usually buildworld/buildkernel tests are driven for
> >> the vast majority by I/O overhead rather than scheduler capacity.
> >> It would be more interesting to analyze how buildworld does while
> >> another type of workload is going on.
> >
> > Yup, agreed/understood, hence me trying to find out what would
> > classify as a good stress test for all of this.
> >
> > I have a testbed system in my garage which I could set up to
> > literally do all of this in a loop, meaning automate the entire
> > above process and just let it go, writing stderr from time to a
> > file (which wouldn't skew the results at all).
> >
> > Let me know what #1 and #2 above, re: "the workloads", should be and
> > I'll be happy to set it up.
>=20
> My idea, in order to gather meaningful datas for both ULE and 4BSD
> would be to see how well they behave in the futher situation:
> - 2 concurrent interactive workloads
> - 2 concurrent cpu-intensive workloads
> - mixed
>=20
> and having the number of threads for both varying as: N/2, N, N +
> small_amount (1 or 2 or 3, etc), N*2 (where N is the number of
> available CPUs) which automatically translates into:
>=20
> - 2 concurrent interactive and intensive (A and B workloads):
>   * A N/2 threads, B N/2 threads
>   * A N threads, B N/2 threads
>   * A N + small_amount, B N/2 threads
>   * A N*2 threads, B N/2 threads
>   * A N threads, B N threads
>   * A N + small_amount, B N threads
>   * A N*2 threads, B N threads
>   * A N + small_amount, B N + small_amount threads
>   * A N*2 threads, B N + small_amount threads
>   * A N*2 threads, B N*2 threads
>=20
> For the mixed case, instead, we should try all the 16 combinations
> possibly and it is likely the most interesting case, to be honest.
>=20
> About the workload, we could use:
> interactives: buildworld and bonnie++ (I'm not totally sure if
> bonnie++ let you decides how many threads to run, but I'm sure we can
> replace with something that really does that)
> cpu-intensive: dnetc and SOMETHINGELSE (please propose something that
> can be setup very easilly!)
> mixed case: buildworld and dnetc
>=20
> About the environment I'd suggest the following things:
> - Try to boot with a maximum of 16 CPUs. I'm sure past that point TLB
> shootdown overhead is going to be too overwhelming, make doesn't
> really scale well, and also there could be too much contention on
> vm_page_lock_queue for interactive threads.
> - Try to reduce the I/O effect by using tmpfs as a storage for in and
> out datas when working out the benchmark
> - Use 10.0 with both kerneland and userland totally debug-free (please
> recall to set MALLOC_PRODUCTION in jemalloc) and always at the same
> svn revision, with the only change being the scheduler switch and the
> number of threads changing during the runs
>=20
> About the test itself I'd suggest the following things:
> - After every test combination, please reboot the machine (like, after
> you have tested the A N/2 threads and B N/2 threads case on
> sched_4bsd, reboot the machine before to do A N threads and B N/2
> threads)
> - For every test combination I suggest to run the workloads 4 times,
> discard the first one (but keep the value!) and ministat the other
> three. Showing the "uncached" case against the average cached one will
> give much more indication than expected.
> - Expect a standard deviation from ministat to be 95% (or beyond) to
> be valuable
> - For every difference in performance we find we should likely start
> worry about if it is as or bigger than 3% and being very concerned
> from 5% to above
>=20
> I think we already have some datas of ULE being broken in some cases
> (like George's and Steven's case) but we really need to characterize
> more, I think.
>=20
> Now, I understand this seems a gigantic work but I think there is much
> people which is interested in working on this and we may scatter these
> tests around, to different testers, to find meaningful datas.
>=20
> If it was me, I would start with comparisons involving all the N and N
> + small_amount cases which should be the most interesting.
>=20
> Do you have questions?
>=20
> Thanks,
> Attilio
>=20
>=20

Perhaps it makes sense to co-write a script to automate these actions?
And place it in /usr/src/tools/sched/...