Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Jun 2006 13:48:01 -0700
From:      "Kip Macy" <kip.macy@gmail.com>
To:        "Bruce Evans" <bde@zeta.org.au>
Cc:        Scott Long <scottl@samsco.org>, kmacy@freebsd.org, Paul Saab <ps@mu.org>, Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>, Kris Kennaway <kris@obsecurity.org>, freebsd-performance@freebsd.org, danial_thom@yahoo.com
Subject:   Re: Initial 6.1 questions
Message-ID:  <b1fa29170606141348j4ebb3140q7c4960758d5b9784@mail.gmail.com>
In-Reply-To: <20060614133024.E1753@epsplex.bde.org>
References:  <20060612195754.72452.qmail@web33306.mail.mud.yahoo.com> <20060612210723.K26068@fledge.watson.org> <20060612203248.GA72885@xor.obsecurity.org> <200606130715.52425.davidxu@freebsd.org> <20060613105930.N34121@fledge.watson.org> <b1fa29170606132015p654e2877s1ec1da6184ce672e@mail.gmail.com> <20060614133024.E1753@epsplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Bruce -
Thanks for the lengthy response. I should not have brought up
interrupt handling as a) its a tertiary concern for me at the moment
b) everyone has an opinion on it c) I could cut off several fingers
and still count on one hand the number of people who understand why
its bad that ithreads go through the scheduler in the default case
(having a pcpu_runq only helps affinity).

To make it easier for future respondents to stay on topic let me
explain my situation. I have ported FreeBSD to Sun's new UltraSPARC
architecture, sun4v. The current implementation, the T1, has 6-8 cores
with 4 threads per core. Unlike HTT on x86, these machines actually
have ample memory bandwith ~26GB/s so threading can actually be
useful. On my 32-cpu system benchmarks like supersmack max out at 9
threads - i.e. one can't get the system below 70% idle. Across the
board context switches on solaris/T1 take 2x as long as they do on
linux/T1. Because of lock contention FreeBSD in turn takes between 10%
- 100% longer than Solaris to context switch.

I would like to be able to tout FreeBSD as a strong competitor on the
sun4v architecture. At the moment I can't. Perhaps this isn't the
right forum for discussing my concerns - a freebsd-scalability list
might be in order.

                            -Kip

On 6/13/06, Bruce Evans <bde@zeta.org.au> wrote:
> On Tue, 13 Jun 2006, Kip Macy wrote:
>
> > ...
> > Why do I say "non-interrupt blocking?". Currently we have roughly a
> > half dozen locking primitives. The two that I am familiar with are
> > blocking and spinning mutexes. The general policy is to use blocking
> > locks except where a lock is used in interrupts or the scheduler. It
> > seems to me that in the scheduler interrupts only actually need to be
> > blocked across cpu_switch. Spin locks obviously have to be used
> > because a thread cannot very well context switch while its in the
> > middle of context switching - however, provided td_critnest > 0, there
> > is no reason that interrupts need to be blocked. Currently sched_lock
> > is acquired in cpu_hardclock and statclock - so it does need to block
> > interrupts. There is no reason that these two functions couldn't be
> > run in ast().
>
> These functions are called from "fast" interrupt handlers, so they
> cannot use sleep locks.  They also cannot be run in ast(), since ast()
> is only run on return to user mode and uses sleep locks a lot.  Gathering
> of some user-mode statistics could be deferred until return to user
> mode, but this wouldn't work for kernel-mode statistics, which is never
> for threads that never leave the kernel, and large changes would be
> required for the user-mode statistics: algorithmic changes: various,
> mainly to keep kernel-mode separate; locking: ast() uses sched_lock,
> so without large changes you would just move the problem (there would
> be up to hz + stathz extra calls to ast() per second); the statistics
> fields are all locked by sched_lock, and although this would not be
> needed for access in ast() some locking would still be needed for many
> which are accessed from elsewhere).
>
> What they (and all fast interrupt handlers or even "fast" interrupt
> handlers) can do better is use spin locks != sched_lock (and for fast
> interrupt handlers, != mtx_lock_spin(any)).  This is not easy to do
> in general, and is especially difficult for clock interrupt handlers,
> because all accesses to data accessed by a fast interrupt handler must
> be locked by a common lock (especially outside of the handlers) and
> clock interrupt handlers access a lot of data.  Currently, clock
> interrupt handlers use sched_lock and depend on sched_lock being used
> too much so that most of the data accessed by clock interrupt handlers
> is locked automatically.  Even then, there are large gaps in the locking.
> E.g., hardclock() starts by calling tc_ticktock() which mostly uses
> very delicate time-domain locking but sometimes races with syscalls
> that use sleep locking, most frequently by calling ntp_update_second().
> Most of kern_ntptime.c is documented (in comments) as being required
> to run at splclock() or higher, but it is actually all locked only by
> Giant, so sched_lock'ing and other spinlocking for it is neither
> necessary or sufficient, and calling it correctly from a "fast" interrupt
> handler is impossible.
>
> In my kernel, fast interrupt handlers (and associated non-handler code
> that shares data) are actually fast (== low-latency &&
> !(very-large-footprint || takes-very-long)).  This requires:
> - mtx_lock_spin() to not mask interrupts, since masking interrupts gives
>    !low-latency at least in the UP case.
> - fast interrupt handlers to not use sched_lock, since sched_lock gives
>    very-large-footprint.
> - fast interrupt handlers to not use only mtx_lock_spin(), since that no
>    longer masks them.  My implementation actually uses simple_locks plus
>    explicit per-cpu interrupt disabling (as in RELENG_4).  This also avoids
>    having to turn off features like WITNESS and KTR which don't honor the
>    rules for fast interrupt handlers.
> - fast interrupt handlers to not use normal scheduling (things like
>    swi_sched()), since that uses sched_lock and is generally very
>    inefficient.  My implementation uses a combination of timeouts
>    and a hack to metamorphose into a SWI handler.  The latter is a
>    very expensive operation and should be avoided.  swi_sched() encourages
>    this inefficiency except in the SWI_DELAY case.  The SWI_DELAY case
>    only takes 50-100 times as many instructions as corresponding
>    scheduling in RELENG_4.  SWI_DELAY seems to be unused except in
>    my drivers.  My implementation enforces non-use of normal scheduling
>    and some other invalid data accesses (e.g., to curthread) unmapping
>    PCPU data in fast interrupt handlers.
> - clock interrupt handlers to not be fast interrupt handlers.  They
>    have far too large a footprint to be fast interrupt handlers.  Locking
>    them is hard enough when they are only "fast" interrupt handlers.
>    I made them normal interrupt handlers and don't support "fast" interrupt
>    handlers.
>
> I get very few benefits from this.  Normal interrupt handlers for
> clocks are inefficient.  They don't take very long, but switching to
> them is inefficient.  I get lower interrupt latency, but this is
> not very important now that CPUs are very fast compared with i/o
> for all devices that I have.  I get the possibility of simpler
> locking in clock interrupt handlers, but haven't simplified or fixed
> their locking.  I get enforced smallness and complexity for fast
> interrupt handlers since large ones would be too complicated and
> normal scheduling and locking cannot be used.
>
> Bruce
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b1fa29170606141348j4ebb3140q7c4960758d5b9784>