Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 7 Sep 2009 09:21:59 +0200
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Peter Wemm <peter@wemm.org>
Cc:        stable@freebsd.org
Subject:   Re: incorrect usleep/select delays with HZ > 2500
Message-ID:  <20090907072159.GA18906@onelab2.iet.unipi.it>
In-Reply-To: <e7db6d980909061736p4affc054k3fa5070214adc2f8@mail.gmail.com>
References:  <20090906155154.GA8283@onelab2.iet.unipi.it> <e7db6d980909061736p4affc054k3fa5070214adc2f8@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Sep 06, 2009 at 05:36:29PM -0700, Peter Wemm wrote:
> On Sun, Sep 6, 2009 at 8:51 AM, Luigi Rizzo<rizzo@iet.unipi.it> wrote:
> > (this problem seems to affect both current and -stable,
> > so let's see if here i have better luck)
> >
> > I just noticed [Note 1,2] that when setting HZ > 2500 (even if it is
> > an exact divisor of the APIC/CPU clock) there is a significant
> > drift between the delays generated by usleep()/select() and those
> > computed by gettimeofday(). ?In other words, the error grows with
> > the amount of delay requested.
> >
> > To show the problem, try this function
> >
> > ? ? ? ?int f(int wait_time) { ?// wait_time in usec
> > ? ? ? ? ? ? ? ?struct timeval start, end;
> > ? ? ? ? ? ? ? ?gettimeofday(&start);
> > ? ? ? ? ? ? ? ?usleep(w); ? ? ?// or try select
> > ? ? ? ? ? ? ? ?gettimeofday(&end)
> > ? ? ? ? ? ? ? ?timersub(&end, &start, &x);
> > ? ? ? ? ? ? ? ?return = x.tv_usec + 1000000*x.tv_sec - wait_time;
> > ? ? ? ?}
> >
> > for various HZ (kern.hz=NNNN in /boot/loader.conf) and wait times.
> > Ideally, we would expect the timings to be in error by something
> > between 0 and 1 (or 2) ticks, irrespective of the value of wait_time.
> > In fact, this is what you see with HZ=1000, 2000 and 2500.
> > But larger values of HZ (e.g. 4000, 5000, 10k, 40k) create
> > a drift of 0.5% and above (i.e. with HZ=5000, a 1-second delay
> > lasts 1.0064s and a 10s delay lasts 10.062s; with HZ=10k the
> > error becomes 1% and at HZ=40k the error becomes even bigger.
> 
> Technically, it isn't even an error because the sleeps are defined as
> 'at least' the value specified.  If you're looking for real-time-OS
> level performance, you probably need to look at one.

I know a non RTOS has limitations on the guarantees it can give.
But with this in mind (e.g.  be happy with something that works as
expected just 'most of the times'), on this specific issue FreeBSD
could behave much better while still remaining technically correct.

> > 3. ?CAUSE an error in tvtohz(), reported long ago in
> > ? ? ? ?http://www.dragonflybsd.org/presentations/nanosleep/
> > ? ? ? ?which causes a systematic error of an extra tick in the computation
> > ? ? ? ?of the sleep times.
> > ? ?FIX: the above link also contains a proposed fix (which in fact
> > ? ? ? ?reverts a bug introduced in some old commit on FreeBSD)
> > ? ?PROBLEM: none that i see.
> 
> This change, as-is, is extremely dangerous.  tsleep/msleep() use a
> value of 0 meaning 'forever'.  Changing tvtohz() so that it can now
> return 0 for a non-zero tv is setting land-mines all over the place.
> There's something like 27 callers of tvtohz() in sys/kern alone, some
> of which are used to supply tsleep/msleep() timeouts.  Note that the
> dragonflybsd patch above only adds the 'returns 0' check to one single
> caller.  You either need to patch all callers of tvtohz() since you've
> change the semantics, or add a 'if (ticks == 0) ticks = 1' check (or
> checks) in the appropriate places inside tvtohz().
> 
> If you don't do it, then you end up with callers of random functions
> with very small timeouts instead finding themselves sleeping forever.

You are right, a proper fix for this third issue should be different
(if we want a fix at all -- I'd be almost satisfied by just
removing the drift).

The simplest option is perhaps to compute a custom value for
nanosleep, select and poll. This would remove the risk of side
effects to other parts of the system, and we could also use the
chance to compensate for the errors that arise when hz*tick !=
1000000 or when we know that hardclock does not run exactly every
'tick' (an integer) microseconds.

	cheers
	luigi



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090907072159.GA18906>