Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Feb 2015 17:15:45 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John-Mark Gurney <jmg@funkthat.com>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, Harrison Grundy <harrison.grundy@astrodoggroup.com>, freebsd-arch@freebsd.org
Subject:   Re: locks and kernel randomness...
Message-ID:  <20150224162133.J1477@besplex.bde.org>
In-Reply-To: <20150224042348.GA46794@funkthat.com>
References:  <20150224012026.GY46794@funkthat.com> <20150224015721.GT74514@kib.kiev.ua> <54EBDC1C.3060007@astrodoggroup.com> <20150224024250.GV74514@kib.kiev.ua> <20150224042348.GA46794@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 23 Feb 2015, John-Mark Gurney wrote:

> Konstantin Belousov wrote this message on Tue, Feb 24, 2015 at 04:42 +0200:
>> On Mon, Feb 23, 2015 at 06:04:12PM -0800, Harrison Grundy wrote:
>>>
>>> The patch attached to
>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197922 switches
>>> sched_balance to use get_cyclecount, which is also a suitable source
>>> of entropy for this purpose.
>>>
>>> It would also be possible to make the scheduler deterministic here,
>>> using cpuid or some such thing to make sure all CPUs don't fire the
>>> balancer at the same time.
>>
>> The patch in the PR is probably in the right direction, but might be too
>> simple, unless somebody dispel my fallacy. I remember seeing claims that
>> on the very low-end embedded devices the get_cyclecount() method may
>> be non-functional, i.e. returning some constant, probably 0. I somehow
>> associate MIPS arch with this bias.

mips seems to use only a hardware cycle counter now.  This is obfuscated
by spelling the function name as mips_rd_ ## count in its implementation.

The only arch with a very slow get_cyclecount() now seem to be i386 without
a TSC and arm (some cases).

> Well, the docs say:
>     The speed and the maximum value of each counter is CPU-dependent.  Some
>     CPUs (such as the Intel 80486) do not have such a register, so
>     get_cyclecount() on these platforms returns a (monotonic) combination of
>     numbers represented by the structure returned by binuptime(9).

The docs are wrong of course.  Almost as bad as the design of this function.
arm now does:

X #ifdef _KERNEL
X #if __ARM_ARCH >= 6
X #include <machine/cpu-v6.h>
X #endif
X static __inline uint64_t
X get_cyclecount(void)
X {
X #if __ARM_ARCH >= 6
X 	return cp15_pmccntr_get();
X #else /* No performance counters, so use binuptime(9). This is slooooow */
X 	struct bintime bt;
X 
X 	binuptime(&bt);
X 	return ((uint64_t)bt.sec << 56 | bt.frac >> 8);
X #endif
X }
X #endif

Versions used to return the highly non-monotonic (bt.sec ^ bt_frac).
This was best for randomness.  But get_cyclecount() was often abused
for timestamps.  So arm now does the above.  It is still non-monotonic.
It wraps every 256 seconds.  It discards noise in the lower 8 bits of
bt.frac to keep predictable values in the bt.sec.  The noise is often
0, but may contain fairly predictable values from ntpd adjustments.

i386 now uses cpu_ticks().  This partly defeats one of the excuses for
the existence of get_cyclecount().  Using binuptime() directly instead
of get_cyclecount() micro-optimized to an inline rdtsc would have cost
a whole extra 10-20 cycles on modern x86.  Using cpu_ticks() directly
restores some of this cost.  In the best case, rdtsc is now 2 function
calls away (1 indirect: call cpu_ticks(), then an indirect call to
rdtsc()).  The costs of these things on (old) Athlon64 are:

     inline rdtsc: 13 cycles counting loop overhead
     add 1 direct function call: +5 cycles
     add another function call: direct +12 cycles; indirect +10 cycles

28 cycles for a function like cpu_ticks().  IIRC, binuptime() took
58 cycles on (older) AthlonXP.

Newer x86 and maybe Intel instead of AMD changes these times significantly. 
Intel rdtsc always seemed to be slower than AMD rdtsc.  Synchronization
for P-state invariance is expensive.  So rdtc on newer x86 takes 40-70
cycles.  The extras for timecounters are more in the noise.  Something
like 90-100 cycles total, with slightly more than half for hardware.

When the TSC is not available, cpu_ticks() may use another cputicker,
but usually there is none, since if there is a cputicker then it is
a TSC by another name, and cpu_ticks uses a timecounter.  It doesn't
do this in the safe way using binuptime(), but returns the hardware
timecount with racy adjustments for monotonicity.  It has to read the
hardware timecounter, and that is the slow part.  So this method is
only used in cases where it is too slow to use.

Apart from its implementation bugs, cpu_ticks() is better than the
special adjustment for arm.

get_cyclecount() is abused for timestamps in de, ktr and sctp.  There
is no support for converting these "timestamps" to human-readable
values even in the easy case where get_cyclecount() returns a monotonic
counter.  The network places should just use binuptime().  ktr shouldn't
do that, and is probably broken in at least some cases where it does
do that via get_cyclecount() using the hardware timecounter.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150224162133.J1477>