Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Oct 2005 15:45:21 +1000 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Scott Long <scottl@samsco.org>
Cc:        cvs-src@FreeBSD.org, src-committers@FreeBSD.org, Andrew Gallatin <gallatin@cs.duke.edu>, cvs-all@FreeBSD.org, David Xu <davidxu@FreeBSD.org>
Subject:   Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c
Message-ID:  <20051020145234.H99720@delplex.bde.org>
In-Reply-To: <4355080C.302@samsco.org>
References:  <200510172310.j9HNAVPL013057@repoman.freebsd.org> <20051018094402.A29138@grasshopper.cs.duke.edu> <435501B9.4070401@samsco.org> <17237.1482.52148.283282@grasshopper.cs.duke.edu> <4355080C.302@samsco.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 18 Oct 2005, Scott Long wrote:

[Excessive quoting retained since I want to comment on separate points.]

> Andrew Gallatin wrote:
>> Scott Long writes:
>>  > Andrew Gallatin wrote:
>>  > > David Xu [davidxu@FreeBSD.org] wrote:
>>  > >  > >>davidxu     2005-10-17 23:10:31 UTC
>>  > >>
>>  > >>  FreeBSD src repository
>>  > >>
>>  > >>  Modified files:
>>  > >>    sys/amd64/amd64      cpu_switch.S machdep.c  > >>  Log:
>>  > >>  Micro optimization for context switch. Eliminate code for saving 
>> gs.base
>>  > >>  and fs.base. We always update pcb.pcb_gsbase and pcb.pcb_fsbase
>>  > >>  when user wants to set them, in context switch routine, we only need 
>> to
>>  > >>  write them into registers, we never have to read them out from 
>> registers
>>  > >>  when thread is switched away. Since rdmsr is a serialization 
>> instruction,
>>  > >>  micro benchmark shows it is worthy to do.

>>  > >  > >  > > Nice.  This reduces lmbench context switch latency by about 
>> 0.4us (7.2
>>  > > -> 6.8us), and reduces TCP loopback latency by about 0.9us (36.1 ->
>>  > > 35.2) on my dual core 3800+

I wonder if this reduces the context switch latency from about 1.320
usec to 0.900 usec on my A64-3000.  The latency is only .520 usec in
i386 mode.  I use a TSC timecounter of course.

The fastest loopback latency that I've seen is 5.638 usec under
Linux-2.2.9 on the same machine.  In Linux-2.6.10, it has regressed
to 17.1 usec.  In FreeBSD last year, it was 10.8 usec on the same
machine in i386 mode and 19.0 in amd64 mode.  So the A64 can almost
keep up with an AXP-1400 running a pre-SMPng version of FreeBSD where
it was 9.94 usec.

[... Nonsense by phk already snipped]

The timecounter is not used by schedulers, so the inefficiency of non-TSC
timecounters and its effect on context switching has nothing to do with
schedulers.  Schedulers use mainly tick counts, and intentionally don't
try hard to keep track of interrupt times because the fine-grained
timekeeping needed to keep track of interrupts would be too expansive.
It is still too expensive, but is now done (except for fast interrupts),
but is not used by schedulers.  The timestamps taken by mi_switch() are
used mainly by userland statistics utilities.  They are very useful for
debugging and for otherwise understanding system behaviour, but are
sometimes too inefficient.

>>  > >  > > It is a shame we can't find a way to use the TSC as a timecounter 
>> on
>>  > > SMP systems.  It seems that about 40% of the context switch time is
>>  > > spent just waiting for the PIO read of the ACPI-fast or i8254 to
>>  > > return.

It seems to be more like 95% in year case.

>>  > >  > >  > > Drew
>>  > >  > >  > >  >  > The TSC represents the clock rate of the CPU, and thus 
>> can vary wildly
>>  > when thermal and power management controls kick in, and there is no way
>>  > to know when it changes.  Because of this, I think that it's
>>  > practically useless on Pentium-Mobile and Pentium-M chips, among many
>>  > others.  There is also the issue of multiple CPUs having to keep their
>>  > TSC's somewhat in sync in order to get consistent counting in the
>>  > system.  The best that you can do is to periodically read a stable
>>  > counter and try to recalibrate, but then you'll likely start getting
>>  > wild operational variances.

I agree that it's too hard to sync the TSC on systems with power
management.  It would be easy enough to sync with the i8254 every HZ,
but even that would give extreme nonlinearities when the TSC frequency
jumps up or down.  Jumping up is the worst case.  E.g, if the TSC
frequency starts at 1GHz and HZ is 1000 expect the TSC count to increment
by 10^6 in the next msec.  If the TSC frequency jumps up to 2GHz, then
the TSC count will actually increment by 2*10^6.  I see nothing better
than recalibrating half way into the next msec (when the TSC count
reaches 10^6) and then wildly slewing the TSC clock so that the 10^6
increment in the count expected in the next half a msec from causing
another half-msec error.

>> As I pointed out in another thread, both linux and solaris do it.
>> Solaris seems to have a nice algorithm for keeping things in sync, and
>> accounting for the TSC getting cleared after suspend/resume etc.  At
>> my level of understanding, this argument is nothing more than "but
>> Mom, all the other kids are doing it".  I was just hoping that
>> somebody with real understanding could pick up on it.
>
> Steering mutliple TSC's together isn't that hard and there are plenty of
> examples, as you point out.  Accounting for the changes due to thermal
> and power management (note that this isn't the same problem as suspend
> and resume) is what worries me.

Possibly the systems with power management don't matter here.  Power
management is currently only essential for portable machines, and the
portable machines won't have multi-Gb/s networks to keep up with and
might not have such strict real time requirements.

>>  >				 It's a shame that a PIO read is still so
>>  > expensive.  I'd hate to see just how bad your benchmark becomes when
>>  > ACPI-slow is used instead of ACPI-fast.
>> 
>> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
>> switch is otherwise 4us, it adds up. i8254 is much worse on this
>> system (6.5us).

I don't know why your system is so slow.  I get ~50nsec for TSC, ~1000 nsec
for ACPI-fast, ~3000 nsec for ACPI-slow and ~4000 nsec for i8254.  But PIO
keeps getting slower even in absolute terms.  My (nearly) newest system
(nForce2) has ISA PIO times of 1133 nsec for the i8254 registers where
my first PCI system (with an early Intel chipset) has a read time of 703
usec and a write time of 1180 nsec.  The nForce2 system also has a PCI
PIO read time of 290 nsec for the same PCI card that can be read in 125
nsec (overclocked) or 150 nsec (not overclocked) on a KT266A system.

>>  > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
>>  > of an idea.  Having preemption in the kernel means that ithreads can run
>>  > right away instead of having to wait for a tick, and various fixes to
>>  > 4BSD in the past year have eliminated bugs that would make the CPU wait
>>  > for up to a tick to schedule a thread.  So all we're getting now is a
>>  > 10x increase in scheduler overhead, including reading the timecounters.
>> 
>> Yeah.  I moved my back to hz=1000 when I noticed 4000 interrupts/sec
>> on an idle system.
>
> Do you mean 1000 or 100 here?  Anyways, the high clock interrupt rate is
> so that we can use the local apic clock to get the various system ticks
> that we have instead of continuing to fight motherboards that no longer
> hook up the 8259 in a sane way.  This is why 5.x doesn't work well on a
> number of new motherboards (nvidia ones especially) but 6.x works just
> fine.

[Dan actually meant 100.]

I use 100 and never downgraded to use 1000 except for testing how bad
it is.  The default number is now up to <number of CPUs> * 2 * HZ.
E.g., it is 4000 on sledge.freebsd.org.  While 4000 interrupts/sec can
be handled easily by any new machine, 4000 is a disgustingly large
number to use for clock interrupts.  Have a look at vmstat -i output
on almost any machine.  On most machines in the freebsd cluster, the
total number of interrupts is dominated by clock interrupts even with
HZ = 100.

The main use for a large HZ is to low quality hardware and applications
that need or want to poll very often.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20051020145234.H99720>