Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Jul 2006 09:48:16 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Peter Wemm <peter@wemm.org>
Cc:        Daniel Eischen <deischen@freebsd.org>, threads@freebsd.org, David Xu <davidxu@freebsd.org>, Julian Elischer <julian@elischer.org>, freebsd-threads@freebsd.org
Subject:   Re: Strawman proposal: making libthr default thread implementation?
Message-ID:  <20060705092048.P70011@fledge.watson.org>
In-Reply-To: <200607041819.05510.peter@wemm.org>
References:  <20060703101554.Q26325@fledge.watson.org> <200607042204.52572.davidxu@freebsd.org> <44AAC47F.2040508@elischer.org> <200607041819.05510.peter@wemm.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Tue, 4 Jul 2006, Peter Wemm wrote:

> Because Linux was the most widely and massively deployed threading system 
> out there, people tended to write (or modify) their applications to work 
> best with those assumptions.  ie: keep pthread mutex blocking to an absolute 
> minimum, and not care about kernel blocking.
>
> However, with the SA/KSE model, our tradeoffs are different.  We implement 
> pthread mutex blocking more quickly (except for UTS bugs that can make it 
> far slower), but we make blocking in kernel context significantly higher 
> cost than the 1:1 case, probably as much as double the cost. For 
> applications that block in the kernel a lot instead of on mutexes, this is a 
> big source of pain.
>
> When most of the applications that we're called to run are written with the 
> linux behavior in mind, when our performance is compared against linux we're 
> the ones that usually come off the worst.

The problem I've been running into is similar but different.  The reason for 
my asking about libthr being the default is that, in practice, our performance 
optimization advice for a host of threaded applications has been "Switch to 
libthr".  This causes quite a bit of complexity from a network stack 
optimization perspective, because the behavior of threading in threaded 
network/IPC applications changes enormously if the threading model is changed. 
As a result, the optimization strategies differ greatly.  To motivate this, 
let me give you an example.

Widely distributed MySQL benchmarks are basically kernel IPC benchmarks, and 
on multi-processor systems, this means they basically benchmark context 
switch, scheduling, network stack overhead, and network stack parallelism. 
However, the locking hot spots differ significantly based on the threading 
model used.  There are two easily identified reasons for this:

- Libpthread "rate limits" threads entering the kernel in the run/running
   state, resulting in less contention on per-process sleep mutexes.

- Libthr has greater locality of behavior in that the mapping of thread
   activities to kernel-visible threads is greater.

Consider the case of an application that makes frequent short accesses to file 
descriptors -- for example, by sending lots of short I/Os on a set of UNIX 
domain sockets from various worker threads, each performing transactions on 
behalf of a client via IPC.  This is, FYI, a widely deployed programming 
approach, and is not limited to MySQL.  The various user threads will be 
constantly looking up file descriptor numbers in the file descriptor array; 
often, the same thread will look up the same number several times (accept, 
i/o, i/o, i/o, ..., close).  This results in very high contention on the file 
descriptor array mutex, even though individual uses are short.

In practice, libpthread sees somewhat lower contention, because in the 
presence of adaptive mutexes, kernel threads spin rather than blocking, 
causing libpthread to not push further threads in to contend on the lock. 
However, one of the more interesting optimizations to explore involves 
"loaning" file descriptors to threads, in order to take advantage of locality 
of reference, where repeated access to the same fd is cheaper, but revocation 
of the loan for use by another thread is more expensive.  In libthr, we have 
lots of locality of reference, because user threads map 1:1 to kernel threads; 
in libpthread, this is not the case, as user threads float across pthreads, 
and even if they do get mapped to the same kernel thread repeatedly, their 
execution in the presence of blocking is discontinuous in the same kernel 
thread.

This makes things tricky for someone working on reducing contention in the 
kernel as the number of threads increases: do I optimize for libpthread, which 
offers little or no locality of reference with respect to mapping user thread 
behavior to kernel threads, or do I optimize for libthr, which offers high 
locality of reference?

Since our stock advice is to run libthr for high performance applications, the 
design choice should be clear: I should optimize for libthr.  However, in 
doing so, I would likely heavily pessimize libpthread performance, as I would 
basically guarantee that heuristics based on user thread locality would fail 
with moderate frequency, as the per-kernel thread working set for kernel 
objects is significantly greater.

FWIW, you can quite clearly measure the difference in file descriptor array 
lock contention using the http/httpd micro-benchmarks in 
src/tools/tools/netrate.  If you run without threading, performance is better, 
in significant part because there is much less contention.  This is an 
interesting, and apparently counter-intuitive observation: many people believe 
that the reduced context switch and greater cache locality of threaded 
applications always results in improved performance.  This is not true for a 
number of important workloads -- by operating with more shared data 
structures, contention on those shared data structures is increased, reducing 
performance.  Moving to the two threading models, you see markedly better 
libpthread performance under extremely high load involving many threads with 
small transactions, as libpthread provides heuristically better management of 
kernel load.  This advantage does not carry over to real-world application 
loads, however, which tend to use a smaller thread worker pools with sequences 
of locality-rich transaction, which is why libthr performs btter as the 
workload approaches real-world conditions.  This micro-benchmark makes for 
quite an interesting study piece, as you can easily vary the thread/proc 
model, the number of workers, and the transaction size, giving pretty clear 
performance curves to compare.

Anyhow, my main point in raising this thread was actually oriented entirely on 
the initial observation, which is that in practice, we find ourselves telling 
people who care about performance to use libthr.  If our advice is always "use 
libthr instead of the default", that suggests we have a problem with the 
default.  Switching the default requires an informed decision: what do we 
lose, not just what do we gain.  Dan has now answered this question -- we lose 
support for a number of realtime scheduling primitives if we switch today 
without further work.

I think the discussion of the future of M:N support is also critical, though, 
as it has an immediate impact on kernel optimization strategies, especially as 
number of CPUs grows.  In case anyone failed to notice, it's now possible to 
buy hardware with 32 "threads" for <$10,000, and the future appears relatively 
clear -- parallelism isn't just for high-end servers, it now appears in 
off-the-shelf notebook hardware, and appears to be the way that vendors are 
going to continue to improve performance.  Having spent the last five years 
working on threading and SMP, we're well-placed to be to support this 
hardware, but it requires us to start consolidating our gains now, which means 
deciding what the baseline is for optimization when it comes to threaded 
applications.

Robert N M Watson
Computer Laboratory
University of Cambridge



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060705092048.P70011>