Date: Tue, 24 Feb 2015 17:15:45 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: John-Mark Gurney <jmg@funkthat.com> Cc: Konstantin Belousov <kostikbel@gmail.com>, Harrison Grundy <harrison.grundy@astrodoggroup.com>, freebsd-arch@freebsd.org Subject: Re: locks and kernel randomness... Message-ID: <20150224162133.J1477@besplex.bde.org> In-Reply-To: <20150224042348.GA46794@funkthat.com> References: <20150224012026.GY46794@funkthat.com> <20150224015721.GT74514@kib.kiev.ua> <54EBDC1C.3060007@astrodoggroup.com> <20150224024250.GV74514@kib.kiev.ua> <20150224042348.GA46794@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 23 Feb 2015, John-Mark Gurney wrote: > Konstantin Belousov wrote this message on Tue, Feb 24, 2015 at 04:42 +0200: >> On Mon, Feb 23, 2015 at 06:04:12PM -0800, Harrison Grundy wrote: >>> >>> The patch attached to >>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197922 switches >>> sched_balance to use get_cyclecount, which is also a suitable source >>> of entropy for this purpose. >>> >>> It would also be possible to make the scheduler deterministic here, >>> using cpuid or some such thing to make sure all CPUs don't fire the >>> balancer at the same time. >> >> The patch in the PR is probably in the right direction, but might be too >> simple, unless somebody dispel my fallacy. I remember seeing claims that >> on the very low-end embedded devices the get_cyclecount() method may >> be non-functional, i.e. returning some constant, probably 0. I somehow >> associate MIPS arch with this bias. mips seems to use only a hardware cycle counter now. This is obfuscated by spelling the function name as mips_rd_ ## count in its implementation. The only arch with a very slow get_cyclecount() now seem to be i386 without a TSC and arm (some cases). > Well, the docs say: > The speed and the maximum value of each counter is CPU-dependent. Some > CPUs (such as the Intel 80486) do not have such a register, so > get_cyclecount() on these platforms returns a (monotonic) combination of > numbers represented by the structure returned by binuptime(9). The docs are wrong of course. Almost as bad as the design of this function. arm now does: X #ifdef _KERNEL X #if __ARM_ARCH >= 6 X #include <machine/cpu-v6.h> X #endif X static __inline uint64_t X get_cyclecount(void) X { X #if __ARM_ARCH >= 6 X return cp15_pmccntr_get(); X #else /* No performance counters, so use binuptime(9). This is slooooow */ X struct bintime bt; X X binuptime(&bt); X return ((uint64_t)bt.sec << 56 | bt.frac >> 8); X #endif X } X #endif Versions used to return the highly non-monotonic (bt.sec ^ bt_frac). This was best for randomness. But get_cyclecount() was often abused for timestamps. So arm now does the above. It is still non-monotonic. It wraps every 256 seconds. It discards noise in the lower 8 bits of bt.frac to keep predictable values in the bt.sec. The noise is often 0, but may contain fairly predictable values from ntpd adjustments. i386 now uses cpu_ticks(). This partly defeats one of the excuses for the existence of get_cyclecount(). Using binuptime() directly instead of get_cyclecount() micro-optimized to an inline rdtsc would have cost a whole extra 10-20 cycles on modern x86. Using cpu_ticks() directly restores some of this cost. In the best case, rdtsc is now 2 function calls away (1 indirect: call cpu_ticks(), then an indirect call to rdtsc()). The costs of these things on (old) Athlon64 are: inline rdtsc: 13 cycles counting loop overhead add 1 direct function call: +5 cycles add another function call: direct +12 cycles; indirect +10 cycles 28 cycles for a function like cpu_ticks(). IIRC, binuptime() took 58 cycles on (older) AthlonXP. Newer x86 and maybe Intel instead of AMD changes these times significantly. Intel rdtsc always seemed to be slower than AMD rdtsc. Synchronization for P-state invariance is expensive. So rdtc on newer x86 takes 40-70 cycles. The extras for timecounters are more in the noise. Something like 90-100 cycles total, with slightly more than half for hardware. When the TSC is not available, cpu_ticks() may use another cputicker, but usually there is none, since if there is a cputicker then it is a TSC by another name, and cpu_ticks uses a timecounter. It doesn't do this in the safe way using binuptime(), but returns the hardware timecount with racy adjustments for monotonicity. It has to read the hardware timecounter, and that is the slow part. So this method is only used in cases where it is too slow to use. Apart from its implementation bugs, cpu_ticks() is better than the special adjustment for arm. get_cyclecount() is abused for timestamps in de, ktr and sctp. There is no support for converting these "timestamps" to human-readable values even in the easy case where get_cyclecount() returns a monotonic counter. The network places should just use binuptime(). ktr shouldn't do that, and is probably broken in at least some cases where it does do that via get_cyclecount() using the hardware timecounter. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150224162133.J1477>