From owner-svn-src-all@FreeBSD.ORG  Sat Jun 23 15:26:25 2012
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2046A1065675;
	Sat, 23 Jun 2012 15:26:25 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx06.syd.optusnet.com.au
	(fallbackmx06.syd.optusnet.com.au [211.29.132.8])
	by mx1.freebsd.org (Postfix) with ESMTP id 92E6B8FC18;
	Sat, 23 Jun 2012 15:26:24 +0000 (UTC)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by fallbackmx06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q5NFQGPh015068; Sun, 24 Jun 2012 01:26:16 +1000
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q5NFQ1Ma028943
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 24 Jun 2012 01:26:02 +1000
Date: Sun, 24 Jun 2012 01:26:01 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120623140556.GU2337@deviant.kiev.zoral.com.ua>
Message-ID: <20120624005418.W2417@besplex.bde.org>
References: <201206220713.q5M7DVH0063098@svn.freebsd.org>
	<20120622073455.GE69382@alchemy.franken.de>
	<20120622074817.GA2337@deviant.kiev.zoral.com.ua>
	<20120623131757.GB46065@alchemy.franken.de>
	<20120623140556.GU2337@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org,
	src-committers@FreeBSD.org, Marius Strobl <marius@alchemy.franken.de>
Subject: Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys
 include sys
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Jun 2012 15:26:25 -0000

On Sat, 23 Jun 2012, Konstantin Belousov wrote:

> On Sat, Jun 23, 2012 at 03:17:57PM +0200, Marius Strobl wrote:
>> On Fri, Jun 22, 2012 at 10:48:17AM +0300, Konstantin Belousov wrote:
>>> On Fri, Jun 22, 2012 at 09:34:56AM +0200, Marius Strobl wrote:
>>>> On Fri, Jun 22, 2012 at 07:13:31AM +0000, Konstantin Belousov wrote:
>>>>> Author: kib
>>>>> Date: Fri Jun 22 07:13:30 2012
>>>>> New Revision: 237434
>>>>> URL: http://svn.freebsd.org/changeset/base/237434
>>>>>
>>>>> Log:
>>>>>   Use struct vdso_timehands data to implement fast gettimeofday(2) and
>>>>>   clock_gettime(2) functions if supported. The speedup seen in
>>>>>   microbenchmarks is in range 4x-7x depending on the hardware.
>>>>>
>>>>>   Only amd64 and i386 architectures are supported. Libc uses rdtsc and
>>>>>   kernel data to calculate current time, if enabled by kernel.
>>>>
>>>> I don't know much about x86 CPUs but is my understanding correct
>>>> that TSCs are not synchronized in any way across CPUs, i.e.
>>>> reading it on different CPUs may result in time going backwards
>>>> etc., which is okay for this application though?
>>>
>>> Generally speaking, tsc state among different CPU after boot is not
>>> synchronized, you are right.
>>>
>>> Kernel has somewhat doubtful test which verifies whether the after-boot
>>> state of tsc looks good. If the test fails, TSC is not enabled by
>>> default as timecounter, and then usermode follows kernel policy and
>>> falls back to slow syscall. So we err on the safe side.
>>> I tested this on Core i7 2xxx, where the test (usually) passes.
>>
>> Okay, so for x86 the TSCs are not used as timecounters by either
>> the kernel or userland in the SMP case if they don't appear to
>> be synchronized, correct?
> Correct as for now.  But this is bug and not a feature. The tscs shall
> be synchronized, or skew tables calculated instead of refusing to use it.
>>>
>>> While you are there. do you have comments about sparc64 TICK counter ?
>>> On SMP, the counter of BSP is used by IPI. Is it unavoidable ?
>>
>> The TICK counters are per-core and not synchronized by the hardware.
>> We synchronize APs with the BSP on bring-up but they drift over time
>> and the initial synchronization might not be perfect in the first
>> place. At least in the past, drifting TICK counters caused all sorts
>> of issues and strange behavior in FreeBSD when used as timecounter
>> in the SMP case. If my understanding of the above is right, as is
>> this still rules them out as timecounters for userland.
>> Linux has some complex code (based on equivalent code origining in
>> their ia64 port) for constantly synchronizing the TICK counters.
>> In order to avoid that complexity and overhead, what I do in
>> FreeBSD in the SMP case is to (ab)use counters (either intended

Attempted synchronization of TSCs is left out for the same reason on x86.
Except some half-baked synchronization for a home made time function in
dtrace (dtrace_gethrtime() on amd64 and i386) crept in.

>> for that purpose or bus cycle counters probably intended for
>> debugging the hardware during development) available in the
>> various host-to-foo bridges so it doesn't matter which CPU they
>> are read by. This works just fine except for pre-PCI-Express
>> based USIIIi machines, where the bus cycle counters are broken.
>> That's where the TICK counter is always read from the BSP
>> using an IPI in the SMP case. The latter is done as sched_bind(9)
>> isn't possible with td_critnest > 1 according to information
>> from jhb@ and mav@.

How can it work fine?  Buses are too slow.  On x86, ACPI-fast takes
700-1900 nsec on machines that I've tested (mostly pre-PCIe ones).
HPET seems to be only slightly faster (maybe 500 nsec).

>> So apart from introducing code to constantly synchronize the
>> TICK counters, using the timecounters on the host busses also
>> seems to be the only viable solution for userland. The latter
>> should be doable but is long-winded as besides duplicating
>> portions of the corresponding device drivers in userland, it
>> probably also means to get some additional infrastructure
>> like being able to memory map registers for devices on the
>> nexus(4) level in place ...

There is little point in optimizations to avoid syscalls for hardware.
On x86, a syscall takes 100-400 nsec extra, so if the hardware takes
500-2000 nsec then reduction the total time by 100-400 nsec is not
very useful.

> Understand. I do plan eventually to map HPET counters page into usermode
> on x86.

This should be left out too.

> Also, as I noted above, some code to synchronize per-package counters
> would be useful for x86, so it might be developed with multi-arch
> usage in mind.

It's only worth synchonizing fast timecounter hardware so that it can be
used in more cases.  It probably needs to be non-bus based to be fast.
That means the TSC on x86.

The new timeout code to support tickless kernels looks like it will give
large pessimizations unless the timecounter is fast.  Instead of using
the tick counter (1 atomic increment on every clock tick) and some
getbinuptime() calls in places like select(), it uses the hardware
timecounter via binuptime() in most places (since without a tick counter
and without clock interrupts updating the timehands periodically, it takes
a hardware timecounter read to determine the time).  So callout_reset()
might start taking thousands of nsec for per call, depending on how slow
the timecounter is.  This fix is probably to use a fuzzy time for long
long timeouts and to discourage use of short timeouts and/or to turn them
into long or fuzzy timeouts so that they are not very useful.

Bruce