FreeBSD Mail Archives

Date:      Tue, 5 Jun 2012 09:26:36 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Gianni <gianni@freebsd.org>, Alan Cox <alc@rice.edu>, Alexander Kabaev <kan@freebsd.org>, Attilio Rao <attilio@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, freebsd-arch@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>
Subject:   Re: Fwd: [RFC] Kernel shared variables
Message-ID:  <20120605075448.B3655@besplex.bde.org>
In-Reply-To: <201206041730.05478.jhb@freebsd.org>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <201206041101.57486.jhb@freebsd.org> <20120605054930.H3236@besplex.bde.org> <201206041730.05478.jhb@freebsd.org>

On Mon, 4 Jun 2012, John Baldwin wrote:

> On Monday, June 04, 2012 4:51:00 pm Bruce Evans wrote:
>> On Mon, 4 Jun 2012, John Baldwin wrote:
>>> ...
>>> a private fast gettimeofday() at my current job and it works by exporting
>>> the current timehands structure (well, the equivalent) to userland.  The
>>> userland bits then fetch a copy of the details and do the same as bintime().
>>
>> How do you keep this up to date, especially for leap seconds?
>
> I added a hack to tc_windup() where it updates the shared copy of the variables
> with the results of the tc_windup() call each time it is invoked.
>
>> My version has a comment saying to do that, but I just noticed that
>> it wouldn't work so well -- the timehands fields would have to be
>> copied to local variables while under protection of the generation
>> count, so it would give messier code to optimize a case that occurs
>> _very_ rarely.
>
> It's not that messy in my experience.

Just 3-4 lines.  With only 16 copies of them in kern_tc.c.  I doubt that
you provide all of these :-).  But full clock_gettime() support requires
more than half of these :-(.  Maybe even all of the FFCLOCK parts for
full compatibility :-(.

>>>> timehands in a shared pages is close to working.  th_generation protects
>>>> things in the same way as in the kernel, modulo assumptions that writes
>>>> are ordered.
>>>
>>> It would work fine.  And in fact, having multiple timehands is actually a
>>> bug, not a feature.  It lets you compute bogus timestamps if you get preempted
>>> at the wrong time and end up with time jumping around.  At Yahoo! we reduced
>>> the number of timehands structures down to 2 or some such, and I'm now of
>>> the opinion we should just have one and dispense with the entire array.
>>
>> No, it is a feature.  The time should never jump around (backwards), but
>> it can easily jump forwards.  It makes little difference if preemption
>> occurs after the timehands have been read, or while reading them but in
>> such a way that the timehands become stale during preemption but not stale
>> enough for their generation to change so that you notice that they are
>> stale -- you get a stale timestamp either way (with staleness approximately
>> the preemption time).  Times read by different threads can easily have
>> different staleness according to which timehands they ended up using and
>> this may be quite different from which timehands they started using and
>> from which timehands is active after they return.  Perhaps this is what
>> you mean.  But again, this happens anyway when the preemption occurs after
>> the timehands have been read.
>
> Time definitely jumped backwards at Yahoo!.  The problem case was when NTP
> was adjusting the time, so if you used a timehands structure that was a
> few generations old (stale), you could have a fairly large component that
> was (delta * scale).
> If the scale had slowed down in subsequent updates,
> then the computed time would jump out into the future.  On the next time
> update with a newer timehands, the effective base was less than the previous
> calculation thought it should have been, and the scale was smaller, so the
> end result if the TSC had not advanced very far was for the new time to be
> less than the previous time, and thus time jumped backwards.

Hmm, changing th_scale in tc_windup() indeed seems to be quite broken,
and reducing to 1 timehands might work around this.  tc_windup() captures
the time using the current scale, so any future reads on the new
timehands will be monotonic, but current and future reads on other
timehands may be too far ahead.  Current and future reads on the
new timehands are prevented by the generation count -- this is why
reducing to 1 timehands might work.  Reducing the number of timehands
to >= 2 only reduces the maximum non-monotonicity.

Someone should test this using adjtime(2).  I think you can use it to
slew much faster than ntpd will let you.

To fix this, just kill all the timehands by setting their generation
count to 0 iff changing the scale.  This preserves the optimization
except when the scale changes, unless ntp (kernel PLL, etc.) changes
it almost every time.  ntpd only changes things every few seconds or
minutes, and I hope the kernel doesn't need to change the frequency
often.

BTW, the ntp part of the locking is quite broken.  You can see this
in kern_adjtime() where it uses Giant locking to try to protect the
non-atomic write to time_adjtime and other less critical variables.
kern_ntptime.c still says that almost everything must be locked by
splclock(), but that is null.  Actually, almost everything must be
locked by something that locks out tc_windup() or a little more.
Sched locking might have done it for hardclock() and tc_windup(), but
no mutex except Giant has ever been used in kern_ntptime.c.

There is also essentially null locking for pps calls from non-clock
fast interrupt handlers.  The worst case in a useful configuration
seems to be 1 CPU executing ntp_update_second() via a fast interrupt
handler, and another CPU executing hardpps() via another fast interrupt
handler.  A non-useful configuration might more than 1 other CPU
executing hardpps().

>>> For my userland case I only export a single timehands copy.
>>
>> So readers block for a long time if the writer is updating and the
>> writer blocks?  Works best for UP :-).
>
> The update to the shared timehands structure does not take a long time,
> specifically for userland it does not require all of tc_windup()'s
> execution time, merely the time to update the values.

But whenever it is preempted, it it may take a long time.  You have no
control short of privileged rtprio and nice to prevent preemption.
OTOH, tc_windup() is run from a fast interrupt handler.  Almost nothing
can prevent it being called without much delay or preempt it (sched
locking of it used to delay it enough to cause significant lock
contention).

Bruce

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120605075448.B3655>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation