Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Jun 2012 06:51:00 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John Baldwin <jhb@FreeBSD.org>
Cc:        Gianni <gianni@FreeBSD.org>, Alan Cox <alc@rice.edu>, Alexander Kabaev <kan@FreeBSD.org>, Attilio Rao <attilio@FreeBSD.org>, Konstantin Belousov <kib@FreeBSD.org>, freebsd-arch@FreeBSD.org, Konstantin Belousov <kostikbel@gmail.com>
Subject:   Re: Fwd: [RFC] Kernel shared variables
Message-ID:  <20120605054930.H3236@besplex.bde.org>
In-Reply-To: <201206041101.57486.jhb@freebsd.org>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> <201206041101.57486.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 4 Jun 2012, John Baldwin wrote:

> On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote:
>> On Sun, 3 Jun 2012, Konstantin Belousov wrote:
>>> What is timehands offsets ? Do you mean things like leap seconds ?
>>
>> Yes.  binuptime() is:
>>
>> % void
>> % binuptime(struct bintime *bt)
>> % {
>> % 	struct timehands *th;
>> % 	u_int gen;
>> %
>> % 	do {
>> % 		th = timehands;
>> % 		gen = th->th_generation;
>> % 		*bt = th->th_offset;
>> % 		bintime_addx(bt, th->th_scale * tc_delta(th));
>> % 	} while (gen == 0 || gen != th->th_generation);
>> % }
>>
>> Without the kernel providing th->th_offset, you have to do lots of ntp
>> handling for yourself (compatibly with the kernel) just to get an
>> accuracy of 1 second.  Leap seconds don't affect CLOCK_MONOTONIC, but
>> they do affect CLOCK_REALTIME which is the clock id used by
>> gettimeofday().  For the former, you only have to advance the offset
>> for yourself occasionally (compatibly with the kernel) and manage
>> (compatibly with the kernel, especially in the long term) ntp slewing
>> and other syscall/sysctl kernel activity that micro-adjusts th->th_scale.
>
> I think duplicating this logic in userland would just be wasteful.  I have

Sure.  I modestly proposed it.

> a private fast gettimeofday() at my current job and it works by exporting
> the current timehands structure (well, the equivalent) to userland.  The
> userland bits then fetch a copy of the details and do the same as bintime().

How do you keep this up to date, especially for leap seconds?

> (I move the math (bintime_addx() and the multiply)) out of the loop however.

My version has a comment saying to do that, but I just noticed that
it wouldn't work so well -- the timehands fields would have to be
copied to local variables while under protection of the generation
count, so it would give messier code to optimize a case that occurs
_very_ rarely.

>> timehands in a shared pages is close to working.  th_generation protects
>> things in the same way as in the kernel, modulo assumptions that writes
>> are ordered.
>
> It would work fine.  And in fact, having multiple timehands is actually a
> bug, not a feature.  It lets you compute bogus timestamps if you get preempted
> at the wrong time and end up with time jumping around.  At Yahoo! we reduced
> the number of timehands structures down to 2 or some such, and I'm now of
> the opinion we should just have one and dispense with the entire array.

No, it is a feature.  The time should never jump around (backwards), but
it can easily jump forwards.  It makes little difference if preemption
occurs after the timehands have been read, or while reading them but in
such a way that the timehands become stale during preemption but not stale
enough for their generation to change so that you notice that they are
stale -- you get a stale timestamp either way (with staleness approximately
the preemption time).  Times read by different threads can easily have
different staleness according to which timehands they ended up using and
this may be quite different from which timehands they started using and
from which timehands is active after they return.  Perhaps this is what
you mean.  But again, this happens anyway when the preemption occurs after
the timehands have been read.

The main point of timehands was originally to give a copy of the time
that was stable for a time hopefully long enough for the timehands to be
read without them being clobbered by an update.  binuptime() was:

1.59         (phk      26-Mar-98): void
1.113        (phk      07-Feb-02): binuptime(struct bintime *bt)
1.113        (phk      07-Feb-02): {
1.113        (phk      07-Feb-02): 	struct timecounter *tc;
1.113        (phk      07-Feb-02): 
1.113        (phk      07-Feb-02): 	tc = timecounter;
1.113        (phk      07-Feb-02): 	*bt = tc->tc_offset;
1.113        (phk      07-Feb-02): 	bintime_addx(bt, tc->tc_scale * tco_delta(tc));
1.113        (phk      07-Feb-02): }

This has an obvious race if the thread running this is preempted for a long
time, so that the copy of the time is actually not stable for long enough.
This was fixed (except I think in some cases using ddb) by using the
generation count.

With the generation count, multiple timehands are probably unnecessary,
but they reduce locking bugs (no memory ordering for the generation count)
and give the optimization that binuptime() etc. doesn't have to spin
waiting for updates.  Now it is the thread doing the updates that gets
the most advantanges from multiple timehands.  It doesn't have to worry
much about locking, or being preempted, or blocking for a long time, since 
it knows that binuptime() etc. will keep using a previous generation
safely and not busy-wait for it, provided only that it doesn't block for
so long that the oldest previous generation doesn't become too old to
work.  2 timehands are probably enough for this, but 1 isn't.

> For my userland case I only export a single timehands copy.

So readers block for a long time if the writer is updating and the
writer blocks?  Works best for UP :-).  Actually, there are problems
in the kernel even for UP.  Consider the writer doing an update and
being preempted by ddb, and ddb using binuptime(), though it shouldn't.
This is deadlock if there is only 1 timehands.  My version runs the
update as a normal interrupt handler so that it can be interrupted
by fast interrupt handlers.  This gives similar problems -- fast
interrupt handlers shouldn't call binuptime() either (this can
deadlock in the timecounter hardware function for at least the
i8254 timecounter), but they do and this is useful for things like
timestamps from serial hardware.  Multiple timehands at least limit
this problem.  Applications have similar problems (more like my
kernel version since applications can't get as exclusive as access
as a fast interrupt handler can).

>>>> rdtsc is also very unportable, even on CPUs that have it.  But all other
>>>> x86 timecounter hardware is too slow if you want gettimeofday() to be fast
>>>> and as accurate as it is now.
>
> For all the hardware where people run mysql and similar software that calls
> getimeofday() a lot, rdtsc() works just fine.

That wasn't the case until recently (except 10-15 years ago for UP with
no SMM).  Someone just fixed rdtsc()-based time function in dtrace.  It
tries to add a per-cpu rdtsc() offset, but the offset was backwards.  It
takes P-state invariance and maybe more for the offset to be 0 and
not drift.

>>> !rdtsc hardware is probably cannot be used at all due to need to provide
>>> usermode access to device registers. The mere presence of rdtsc does not
>>> means that usermode indeed can use it, it should be decided by kernel
>>> based on the current in-kernel time source. If rdtsc is not usable, the
>>> corresponding data should not be exported, or implementation should go
>>> directly into syscall or whatever.
>
> Yes, the patches I have only work if the kernel uses the TSC as its main
> timecounter as well.

The detail I miss most is the TSC being available for use in userland
even if it is not the primary timecounter.  Maybe it its quality is
enough for the application, or the application can fix it up using
per-cpu offsets.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120605054930.H3236>