Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Mar 2012 11:33:46 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Luigi Rizzo <rizzo@iet.unipi.it>
Cc:        arch@FreeBSD.org
Subject:   Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
Message-ID:  <20120301071145.O879@besplex.bde.org>
In-Reply-To: <20120229194042.GA10921@onelab2.iet.unipi.it>
References:  <20120229194042.GA10921@onelab2.iet.unipi.it>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 29 Feb 2012, Luigi Rizzo wrote:

> I have always been annoyed by the fact that FreeBSD rounds timeouts
> in select/usleep/poll in very conservative ways, so i decided to
> try how other systems behave in this respect. Attached is a simple
> program that you should be able to compile and run on various OS
> and see what happens.

Many are broken, indeed.

The simple program isn't attached.

> Here are the results (HZ=1000 on the system under test, and FreeBSD
> has the same behaviour since at least 4.11):
>
> 	        |    Actual timeout
>                |      select            | poll  | usleep|
> 	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
> 	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
> 	--------+-------+-------+--------+-------+-------+
> 	    1      2000      99       6     0      2000

Try HZ = 20 (possible, at the user's option, even with an i8254 timer)
or lower (possible, at the user's option, with better timers).  FreeBSD
should then get timeouts of up to 2/HZ = 100000 us.  Applications must
deal with this range somehow (maybe by telling the user to configure
HZ better).  In ttcp, I found the timeouts unusable and resorted to
an option to busy-wait.  The rate-limiting timeouts in tools/netrate 
don't work at all for small HZ, and barely work for large HZ, since
timeouts are similarly unusable.

It is possible to easily improve this to a maximum of only 1/HZ = 50000
us, at some cost to efficiency, by waking up 1 tick early, checking
if the timeout has expired, and sleeping for another tick if it hasn't.

Waking up early is needed anyway for long timeouts, in case the timeouts
interrupts are running a little slower than their estimated frequency
-- being off by just 1 part per million will accumulate to an error
of 86400 us after 1 day, and FreeBSD cluser machines used to be off
by about 10%, giving an error of 2.4 hours for your appointment next
day.  The error will often be 100's of parts per million, giving an
error of 10's of seconds per day.  To handle the 10% error, timeouts
must wake up 10% early.

select(), poll(), and nanosleep() all check that the timeout expires
when they wake up, but they don't set it to be wake up early, and their
check for whether it has expired is even fuzzier than the timeout
granularity (it uses the broken-as-designed getnanouptime() API to get
the current time; timecounters are only updated every
$(sysctl kern.timecounter.tick) ticks, by default to limit their
update frequency to about 1000 Hz when HZ is configured to be much
larger than 1000.  This ensures an extra, unnecessary innaccuracy of
up to 1000 us whenever the timeout wakes up a little early according
the the retarded clock used to measure the current time.  I think
the error is fail-safe -- it may extend the timeout by as much as
1000 usec.

> 	   10      2000     109      15     0      2000
> 	   50      2000     149      66     0      2000
> 	  100      2000     196     133     0      2000
> 	  500      2000     597     617     0      2000

You must have synced with timer interrupts to get the above.  Timeouts
in the current FreeBSD implementation should average the actual timeout
rounded up to a multiple of 1/HZ seconds, plus 0.5/HZ seconds, and thus
average 1.5/HZ = 1500 us for short timeouts.

Someone apparently broke poll() on FreeBSD :-(.

Linux and OSX must be using busy-waiting or expensive timer
reprogramming for short timeouts to work.  Linux-2.6.10 has 3491
references to udelay().  This seems to correspond to FreeBSD' DELAY().
The linux nanosleep() code is too complicated for me to easily see
what it is doing for short timeouts, but I noticed that it isn't
missing clock_nanosleep() like FreeBSD does, and presumably has
a collaterally non-broken nanosleep().  (POSIX requires nanosleep()
to sleep in real time, to be bug for bug compatible with old sleep(),
but this is often not what is wanted.  So POSIX invented
clock_nanosleep() so as to be able to sleep on the monotonic clock
and on any other clock of interest.  FreeBSD doesn't know anything
about this, and only has nanosleep(), which sleeps on a wrong clock
(the monotonic one).

> 	 1000      2000    1103    1136    2000    2000
> 	 1001      3000    1103    1136    2000    3000 <---
> 	 1500      3000    1608    1631    2000    3000 <---
>         2000	   3000    2096    2127    3000    3000
> 	 2001	   4000                    3000    4000 <---
> 	 3001	   5000                    4000    5000 <---
>
> Note how the rounding (poll has the timeout in milliseconds) affects
> the actual timeouts when you are past multiples of 1/HZ.

Also, timeouts that are just before a multiple of 1/HZ may be turned
into 1/HZ + 1000 usec by the inacurracy of getnanouptime().  E.g., if
the requested timeout is 999, HZ is 1000, and tc_tick is 1, 999 should
be turned into 2 ticks (average 1500 usec).  Then, if the timeout for
the second tick is a little early according to the retarded clock,
the total timeout will be extended by another tick, to 3 ticks.  So
all timeouts may be extended by up to 2 ticks, instead of only ones
just larger than a multiple of 1/HZ being extended by the full 2 ticks.
This can be made much worse by setting tc_tick to a large value.

> I know that until we have some hi-res interrupt source there is no
> hope to have better than 1/HZ granularity. However we are doing
> much worse by adding up to 2 extra ticks. This makes apps less
> responsive than they could be, and gives us no way to
> "yield until the next tick".
>
> So what I would like to do is add a sysctl (disabled by
> default) that enables a better approximation of the desired delay.

It is possible to get a timeout every tick in userland using periodic
itimer's.  Maybe not for the initial timeout, but after that the timeouts
repeat with the specified period (rounded up to the next tick boundary,
but not up to the next + 1).

> I see in the kernel that all three syscalls loop around a blocking
> function (tsleep or seltdwait), and do check the "actual" elapsed
> time by calling getmicrouptime() or getnanouptime() around the
> sleeping function .  So the actual timeout passed to tsleep does
> not really matter (as long as it is greater than 0 ).
>
> The only concern is that getmicrouptime()/getnanouptime() are documented
> as "less precise, but faster to obtain". The question is how precise is
> "less precise": do we have some way to get an upper bound for the

It is tc_tick/HZ seconds (usually 1/HZ, but if you set HZ to be > 1000
to reduce this problem, then tc_tick will bite you unless you change it
down.  Both may be acceptable on a real-timeish system that wants short
timeouts at the cost of efficiency.  But I don't like timeouts.  When
they are used a lot, they are a form of busy waiting.  This is only
acceptable if you have CPU to burn).

> precision of the timers used in get*time(), so we can use that value
> in the equation instead of the extra 1/HZ that tvtohz() puts in
> after computing floor(timeout*HZ) ?

It's always worse than 1/HZ if you use these broken-as-designed APIs.

> For reference, below is the core of usleep and select/poll
> (from kern_time.c and sys_generic.c)
>
>    usleep:
> 	getnanouptime(now)

Should use nanotime().  This also fixes the clock id.  But check whether
this is really required by POSIX.  It means that if someone steps the
clock, then the sleep may be extended or truncated significantly.  You
also need to fix the timeout used here to sleep in real time, so that
it doesn't wake up late by an amount of (negative of the step).  Timeouts
also sleep in monotonic time.

Sleeping in monotonic time seems to be wrong in more cases than it is
correct.  Another broken area is suspend/resume.  Suspending for an
hour stops all timeouts by an hour.  Then on resume, they aren't
adusted, so they occur an hour later.  There is still code in
kern_timeout.c under APM_FIXUP_CALLTODO which was supposed to fix this
problem for apm, but it is so poorly maintained that it never even
compiled in any committed version, and there is no option for it.
Problems in this area were supposed to have been fixed by using
monotonic time more, but they seem to have actually been increased.

> 	end = now + timeout;

This is quite broken too:
- it doesn't arrange to wake up early.  Sutract 1 tick here for a quick
   fix for short timeouts.  Maybe 10% for timeouts longer than 10 ticks.
- it can overflow, giving undefined behaviour (in practice, just bad
   results later)
- itimerfix() is supposed to be used to prevent such overflows, but this
   is quite broken:
   - note that itimerfix() is quite different from timevalfix().  Although
     its name is spelled with an 'i' and a 'fix', itimerfix() is useful
     generally and doesn't fix anything except for bogusly adjusting
     fractional ticks.  It is also confusing because its name is spelled
     without a `val', since it was intended to be used only for itimers
     which used to use generic times (timevals) back before better
     representations of times existed.  What it mosly does is validity
     checking for timevals.  OTOH, timevalfix() does pure fixing (to
     handle carry after (possibly multiple) additions and subtractions).
   - timevalfix() used to limit tv_sec to 100 million seconds (else
     EINVAL), but someone broken it by removing this check
   - someone didn't update the man pages which document this limit.  It
     is still documented in at least:
     - setitimer(2).  This is its primary API
     - alarm(3)
     - ualarm(3).  This also bogusly documents what a microsecond is

     Grepping for 100000000 and 100.*million also shows many bad
     descriptions of the limit on tv_nsec for APIs that take a
     timespec arg.  This limit is described verbosely as "1000 million".
     Of course, "1 billion" cannot be used since it is ambiguous, and
     1000000000 should not be used because it is hard to see the number
     of zeros in it, but millions aren't naturally associated with the
     nano prefix.  I like to write such numbers in minimal floating
     point scientific notation (e.g., 1e9) or as powers of 10 (e.g.,
     10**9).  1e9 is better because it is shorter and doesn't need
     an ambiguous '^' operator or a less common Fortran '**' operator
     for exponentiation.  Not sure if this is best for man pages.

     select() and kqueue() used to have the same limit, but it was
     never documented for them.  Not sure about kqueue.

   - this limit isn't permitted by POSIX.
   - here we have timespecs.  itimespecfix() exists too (although
     timespecfix() doesn't).  itimespecfix() has the same semantics
     as itimerfix() (except of course it obviously acts on timespecs
     while itimerfix() unobviously acts on timevals).  itimespecfix()
     never had the limit on tv_sec.

     But timespecfix() has the same bogus rounding up of fractional
     ticks as timerfix().  This is only done for fractional ticks below 1.
     This should be unnecessary provided everywhere else is careful to
     round up and usually to add 1.  timespec*fix() never did the adding
     1 part.  I think they are just defending against sloppy conversions
     that produce 0 ticks from small but nonzero timeouts.  A timeout of
     0 ticks means to sleep forever.  But relevant higher levels also
     defend against this, by silently changing 0 ticks to 1 tick.
- back to bugs at the level of nanosleep().  It can't use itimerfix()
   since it deals with timespecs.  It should call something like
   itimespecfix() (except that should be named timespeccheck()...).
   But IIRC, itimespecfix() didn't exist when nanosleep() was implemented.
   Also, itimespecfix() has wrong semantics for use here..  Its bogus
   rounding up is exactly what you don't want.  I think it has other
   slightly mismatched semantics (perhaps a difference in error numbers),
   but the others are easy to fix up.  So nanosleep() rolled its own
   checking, and got it wrong.

The overflow seems easy to fix as a side affect of waking up early:
- for preposterosterously long timeouts, wake up after 100 million
   seconds or similar, instead of 10% early.  This delays the problem.
   100 million seconds is a little over 3 years, so it won't expire
   in practice, and no one would care if it did.  But this method
   (and the old limit) breaks down about 3 years before the time_t's
   roll over.  So in 2036, you need to limit the timeout to only 2
   years instead of 3, if you are still using 31-bit (sic) time_t's
   with a useless sign bit then.  And in 2105, you need to limit the
   timeout to only 2 years instead of 3, if you are still using 32-bit
   unsigned time_t's then.

   Be careful with overflow even with this fix.  Applications probing
   for kernel bugs will try using maximal tv_sec.  Since POSIX doesn't
   allow rejecting these like the old 100 million second limit did,
   we must start with a long sleep and retain the original preposterous
   timeout so that we can return it as the unslept time.  We have to
   be careful about overflow when adding the preposterous time to the
   current time.  Large time_t's don't do anything to limit this overflow,
   since the appllication can ask for the maximum (2**31-1 or 2**63-1
   in practice), and since the current time is surely >= 1, adding the
   current time to the preposterous time surely gives overflow.

   Example of a not-unreasonable POSIX application to probe for bugs in
   this area:

       set tv.tv_sec and tv.tv_nsec to the maximum possible ("infinity")
       arrange for a signal after 1 second
       nanosleep(&tv, &tv2);
       check that tv2 is about 1 second below the maximum possible

> 	for (;;) {
> 		getnanouptime(now);

Our original `now' and thus `end' are retarded by up to tc_tick/HZ.
This `now' is retarded too.  This complicates the analysis and
changes its results.

> 		delta = end - now;

So `delta' might not even be retarded.  If `end' is normal but `now'
is retarded, then `delta' is advanced.  This case is fail-safe but
not what you want (sleep again).  If `end' is retarded but `now' is
normal, then the retardation in `delta' is maximal (still less than
tc_tick/HZ).  This case is fail-unsafe (return up to tc_tick/HZ
early).  Other cases are in between these, with the retardations
partially or completely cancelling.  By making tc_tick large, the
fail-unsafe case can be made to more than overcome the safety margin
of about 1 1/HZ given by always adding 1.  I only just noticed this
detail.

> 		if (delta <= 0)
> 			break;
> 		tsleep(..., tvtohz(delta) )
> 	}
>
>    select/poll:
> 	itimerfix(timeout) // force at least 1/HZ

That's "bogusly force".  It doesn't add 1 or do anything if the timeout
is above 1/HZ, but both are done later.

> 	getmicrouptime(now)
> 	end = now + timeout;

Same retardation and overflow bugs.

> 	for (;;) {

Missing showing getmicrouptime(now) here?

> 		delta = end - now;
> 		seltdwait(..., tvtohz(delta) )

tvtohz() does round up and add 1.  Its interaction with the above is
unclear.  I think there is double rounding up in some cases.  For
example, even without the retardation, a delta that wants to be
precisely 1 tick (much more than it should be due to the first rounding
up) may be 1 us over 1 tick due to minor inaccuracies.  Then tvtohz()
will round up again.  nanosleep() has the same problem.  To handle this,
we may need to subtract 1 more from the result of tvtohz():
- always subtract 10%
- subtract 1 to compensate for tvtohz() always adding 1.  See the periodic
   itimer code for this fixup.  Periodic itimers need to be more careful
   for long timeouts too.
- subtract 1 more in case delta is a little too large.  Better look at
   delta and not always do this.
- if the resulting timeout is <= 0, change it to 1.

> 		getmicrouptime(now);
> 		if (some_fd_is_ready() || now >= end)
> 			break;
> 	}

Activity on the fd's is likely to give more wakeups than nanosleep()
gets, since the latter only gets woken up for its own timeout and signals.
For this and other reasons, large timeouts are probably smaller for
select() than for nanosleep().  And for poll(), you just can't ask for
a large timeout (the limit is (2**31-1) milliseconds ~= 24.8 days with
32-bit ints, as is the case on all supported arches).

It remains to explain why the above results show that poll() but not
select() is broken for small timeouts (they are turned into 0 us for
poll() and 2000 us for select()).  Well, the granularity for poll is 1
ms, so this looks like just an application bug, with timeouts of < 1 ms
being rounded down to 0 before the kernel sees them.

But how do the other OS's see it?  This might be due to them taking a
long time to handle null timeouts, and their times actually being
reported correctly.
I don't believe the times of 0 and 2000 us reported for FreeBSD.  You
can't do anything in 0 us, and 2000 us is too round a number.  These
round numbers might be due to using the broken as designed
CLOCK_MONOTONIC_FAST_N_BROKEN clock ids.  These are collateral with
getnanouptime() etc.  They are even more broken as designed, since
provided you have non-slow timecounter hardware, the time for
clock_gettime() is dominated by syscall overhead, so
CLOCK_MONOTONIC_FAST_N_BROKEN is only a few percent faster than
CLOCK_MONOTIC.  Typical numbers are:
- 12 (9?) cycles for the hardware part of a TSC timecounter on an old
   Athlon64.  ~250 cycles for the total syscall overhead for an old version
   of FreeBSD UP on almost any x86.  Possible savings from the "fast"
   method: about 5%.  More like 10% due to extra software overhead for the
   timecounter.
- 42 (?) cycles for the hardware part of a TSC timecounter on Phenom+
   (synchronization across CPUs makes it much slower).  Similarly for
   most modern multi-core CPUs.  Intel CPUs were much slower than 12
   cycles even for old single-core ones. ~350 cycles for the total
   syscall overhead for a current version of FreeBSD SMP on almost any
   x86.  Possible savings from the "fast" method: about 15%.  IIRC,
   SMP only costs 20-30 cycles, with the extra 100 being mainly from
   extra layers.
- 1000-2000 nsec (up to ~8000 cycles) for an ACPI-FAST timecounter.
   These are actually ACPI-SLOW.  Now the hardware overhead dominates.
   HPET is better, but has become common at the same time as the TSC
   became usable for SMP, so it is rarely useful as a timecounter.
- up to 5000 nsec for an i8254 timecounter on a modern CPU.  Getting
   slow with more briges between the CPU and the ISA bus.
- up to 30000 nsec for an i8254 timecounter on a 486.
OTOH, getnanouptime() takes about 12 cycles (very fuzzy estimate), so
it is 2-3 times faster than nanouptime() using the fastest TSC hardware,
and about 5-6 times faster than nanouptime() using slower TSC hardware.

IIRC, itimer code doesn't do these checks of the time after wakeups at
all.  Not sure what kqueue does.

I haven't really touched nanosleep(), but have some small fixes near
the tvtohz() call for select() and poll().

% Index: sys_generic.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v
% retrieving revision 1.131
% diff -u -2 -r1.131 sys_generic.c
% --- sys_generic.c	5 Apr 2004 21:03:35 -0000	1.131
% +++ sys_generic.c	13 Aug 2009 11:21:29 -0000
% @@ -806,9 +797,5 @@
%  		getmicrouptime(&rtv);
%  		timevaladd(&atv, &rtv);
% -	} else {
% -		atv.tv_sec = 0;
% -		atv.tv_usec = 0;
%  	}
% -	timo = 0;
%  	TAILQ_INIT(&td->td_selq);
%  	mtx_lock(&sellock);
% @@ -824,5 +811,7 @@
%  	if (error || td->td_retval[0])
%  		goto done;
% -	if (atv.tv_sec || atv.tv_usec) {
% +	if (tvp == NULL)
% +		timo = 0;
% +	else {
%  		getmicrouptime(&rtv);
%  		if (timevalcmp(&rtv, &atv, >=))

Unrelated cleanups of initialization.

% @@ -830,13 +819,10 @@
%  		ttv = atv;
%  		timevalsub(&ttv, &rtv);
% -		timo = ttv.tv_sec > 24 * 60 * 60 ?
% -		    24 * 60 * 60 * hz : tvtohz(&ttv);
% +		timo = tvtohz(&ttv);

The special case for timeouts of > 1 day defeats the careful overflow
handling in tvtohz().  It is supposed to be for avoiding overflow, but
tvtohz() avoids it already, while the above causes it whenever hz
is large but not preposterously so, so that 24 * 60 * 60 * hz overflows.
hz only needs to be 24586 for overflow.  25000 is almost reasonable
for excessive polling, and I have tested lapic timer interrupts though
not hz at 1MHz.

However, reduction of the timeout to a value that will wake up 10% early 
is the first step of fixing the bugs discussed above.  Reduction to
1 day accomplishes this for timeouts of >= 1.1 days provided it doesn't
overflow.

%  	}
% 
%  	/*
% -	 * An event of interest may occur while we do not hold
% -	 * sellock, so check TDF_SELECT and the number of
% -	 * collisions and rescan the file descriptors if
% -	 * necessary.
% +	 * An event of interest may have occurred while we did not hold
% +	 * sellock.  Check for this and rescan if necessary.
%  	 */
%  	mtx_lock_spin(&sched_lock);

Unrelated.

% @@ -985,4 +978,5 @@
%  	if (uap->timeout != INFTIM) {
%  		atv.tv_sec = uap->timeout / 1000;
% +		/* XXX wrong if timeout < 0. */
%  		atv.tv_usec = (uap->timeout % 1000) * 1000;

Since the '%' operator is broken for negative values in C, this gives
a negative tv_usec when the timeout is negative.

%  		if (itimerfix(&atv)) {

itimerfix() then returns EINVAL, and the syscall fails.  But a timeout
of < 0 should be equivalent to a timeout of 0, as it is for select()
and nanosleep().  This can be implemented either by fixing C, or
by fixing the espression, or by just changing negative timeouts to 0.

% @@ -992,9 +986,5 @@
%  		getmicrouptime(&rtv);
%  		timevaladd(&atv, &rtv);
% -	} else {
% -		atv.tv_sec = 0;
% -		atv.tv_usec = 0;
%  	}
% -	timo = 0;
%  	TAILQ_INIT(&td->td_selq);
%  	mtx_lock(&sellock);
% @@ -1006,9 +996,11 @@
%  	mtx_unlock(&sellock);
% 
% -	error = pollscan(td, (struct pollfd *)bits, nfds);
% +	error = pollscan(td, bits, nfds);
%  	mtx_lock(&sellock);
%  	if (error || td->td_retval[0])
%  		goto done;
% -	if (atv.tv_sec || atv.tv_usec) {
% +	if (uap->timeout == INFTIM)
% +		timo = 0;
% +	else {
%  		getmicrouptime(&rtv);
%  		if (timevalcmp(&rtv, &atv, >=))

Unrelated cleanups.

% @@ -1016,12 +1008,8 @@
%  		ttv = atv;
%  		timevalsub(&ttv, &rtv);
% -		timo = ttv.tv_sec > 24 * 60 * 60 ?
% -		    24 * 60 * 60 * hz : tvtohz(&ttv);
% +		timo = tvtohz(&ttv);

As for select().

%  	}
% -	/*
% -	 * An event of interest may occur while we do not hold
% -	 * sellock, so check TDF_SELECT and the number of collisions
% -	 * and rescan the file descriptors if necessary.
% -	 */
% +
% +	/* Rescan if necessary, as above. */

Don't repeat comments ad nauseum.  There used to be large grammar errors
in these comments.  -current may have cleaned them up differently.

%  	mtx_lock_spin(&sched_lock);
%  	if ((td->td_flags & TDF_SELECT) == 0 || nselcoll != ncoll) {

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120301071145.O879>