Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Jun 2011 22:03:43 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Peter Jeremy <peterjeremy@acm.org>
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>
Subject:   Re: svn commit: r222866 - head/sys/x86/x86
Message-ID:  <20110620213851.D1479@besplex.bde.org>
In-Reply-To: <20110620090640.GA64900@server.vk2pj.dyndns.org>
References:  <201106081938.p58JcWuB044252@svn.freebsd.org> <20110609055112.P2870@besplex.bde.org> <201106081913.09272.jkim@FreeBSD.org> <20110618210815.W889@besplex.bde.org> <20110620090640.GA64900@server.vk2pj.dyndns.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 20 Jun 2011, Peter Jeremy wrote:

> On 2011-Jun-18 22:05:06 +1000, Bruce Evans <brde@optusnet.com.au> wrote:
>> My clock measurement program (mostly an old program by Wollman) shows
>> the following histogram of times for a non-invariant TSC timecounter
>> on a 2GHz UP system:
>>
>> % min 273, max 265102, mean 273.998217, std 79.069534
>> % 1th: 273 (1727219 observations)
>> % 2th: 274 (265607 observations)
>> % 3th: 275 (6984 observations)
>> % 4th: 280 (11 observations)
>> % 5th: 290 (8 observations)
>>
>> The variance is small, and differences of a single nS can be seen clearly.
>
> Unfortunately, Intel broke this in their P-state invariant TSC
> implementation.  Rather than incrementing the TSC by one at the
> CPU core frequency, they increment by the core multiplier at the
> FSB frequency.  This gives a result like the following on my Atom
> N270:
> delta  samples
> 24    49637124
> 36    50312540
> 48       44658
> 60          77
>
> This makes it virtually impossible to measure short periods.
>
> Luckily, AMD seem to have gotten this right.

I tested a FreeBSD cluster machine in userland, since it doesn't have a
usable TSC timecounter (iterating $(sysctl kern.timcounter...) is too
slow.

%%%
#include <sys/types.h>
#include <machine/cpufunc.h>
#include <stdio.h>

static unsigned buf[17];
static volatile unsigned v;

int
main(void)
{
 	int i;

 	for (i = 0; i < 17; i++)
 		buf[i] = rdtsc();
 	for (i = 0; i < 16; i++)
 		printf("%u\n", buf[i + 1] - buf[i]);
 	buf[0] = rdtsc();
 	for (i = 0; i < 1000000; i++)
 		v = rdtsc();
 	printf("%.1f\n", (v - buf[0]) / 1e6);
 	return (0);
}
%%%
Output:
77
63
63
70
63
63
63
70
63
63
70
63
63
63
70
63
65.2
%%%

It seems to always give a multiple of 7, so that might be the multiplier.
63 is also a lot, and limits the resulotion to ~34 nS at 1.86GHz.

On an original Athlon64:
%%%
34
8
5
8
5
8
5
8
5
8
5
8
5
8
5
8
6.5
%%%

Phenom specs say 42 instead of ~6.5 IIRC.  Only slightly better than 63.
This is execution latencu, but although rdtsc is non-serialzied, there
is only 1 of it at least on old CPUs, so it can never deliver results
faster than its latency, on average.  The 5's in the above seem to be
lower than the latency, due to the 8's being delivered late.  I normally
write tests like the above in asm to get more control over the loop
overhead, but the above behaviour is interesting since it is what will
happen for normal unsynchronized use of rdtsc.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110620213851.D1479>