Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 May 2007 19:53:50 -0400
From:      Kris Kennaway <kris@obsecurity.org>
To:        Jeff Roberson <jroberson@chesapeake.net>
Cc:        arch@freebsd.org
Subject:   Re: sched_lock && thread_lock()
Message-ID:  <20070523235349.GA66762@xor.obsecurity.org>
In-Reply-To: <20070523155236.U9443@10.0.0.1>
References:  <20070520155103.K632@10.0.0.1> <20070523155236.U9443@10.0.0.1>

next in thread | previous in thread | raw e-mail | index | archive | help

--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, May 23, 2007 at 03:56:35PM -0700, Jeff Roberson wrote:
> Resuming the original intent of this thread;
>=20
> http://www.chesapeake.net/~jroberson/threadlock.diff
>=20
> I have updated this patch to the most recent current.  I have included a=
=20
> scheduler called sched_smp.c that is a copy of ULE using per-cpu schedule=
r=20
> spinlocks.  There are also changes to be slightly more agressive with=20
> updating the td_lock pointer when it has been blocked.

I have not yet found an application benchmark that really demonstrates
this (e.g. the SQL benchmark is now entirely bottlenecked by the
global select lock), but on a microbenchmark designed to specifically
test scheduler performance (sysbench --test=3Dthreads) this gives
dramatic results on an 8-core opteron.

--
This test mode was written to benchmark scheduler performance, more
specifically the cases when a scheduler has a large number of threads
competing for some set of mutexes.

SysBench creates a specified number of threads and a specified number
of mutexes. Then each thread starts running the requests consisting of
locking the mutex, yielding the CPU, so the thread is placed in the
run queue by the scheduler, then unlocking the mutex when the thread
is rescheduled back to execution. For each request, the above actions
are run several times in a loop, so the more iterations is performed,
the more concurrency is placed on each mutex.
--

With the threadlock.diff changes and sched_smp there is a factor of
3.8 performance improvement compared to sched_ule.  4BSD actually
performs 30% better than ULE on this microbenchmark (it has been much
slower on all the application benchmarks I've done on this system),
but is still a factor of 2.8 slower than sched_smp.

Indeed, profiling confirms that with ULE and 4BSD the global
sched_lock is the only relevant lock, and is heavily contended.  This
contention largely goes away with the per-cpu scheduler locks in
sched_smp (but there is still some contention).

Profiling indicates there might be further scope to as much as double
performance of this benchmark by improving the load balancing and
other architectural changes (system is about 50% idle still).

I am hoping to see some real application benchmark improvements on
sun4v when Kip gets it up and running again (should be soon), since
last time we looked the global sched_lock was a dominant effect there.

Kris

--vtzGhvizbBRQ85DL
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (FreeBSD)

iD8DBQFGVNQNWry0BWjoQKURAlVVAKDzB/DTbAQh5X6fbaN7JwdRjtH+lwCgj6uq
PM/rKM5sVT3cEZwABTUgF8w=
=whQ2
-----END PGP SIGNATURE-----

--vtzGhvizbBRQ85DL--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070523235349.GA66762>