From owner-svn-src-all@FreeBSD.ORG  Wed May  7 03:47:48 2014
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 742B69CD;
 Wed,  7 May 2014 03:47:48 +0000 (UTC)
Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au
 [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 1F678B11;
 Wed,  7 May 2014 03:47:47 +0000 (UTC)
Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au
 (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133])
 by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id B2345D64E06;
 Wed,  7 May 2014 13:47:35 +1000 (EST)
Date: Wed, 7 May 2014 13:47:31 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Alan Somers <asomers@freebsd.org>
Subject: Re: svn commit: r265472 - head/bin/dd
In-Reply-To: <201405062206.s46M6dxW060155@svn.freebsd.org>
Message-ID: <20140507113345.B923@besplex.bde.org>
References: <201405062206.s46M6dxW060155@svn.freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=eojmkOZX c=1 sm=1 tr=0
 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=LJiQtwaW0WEA:10
 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=Dgkua-FqpuDcXRS_RLkA:9
 a=CjuIK1q_8ugA:10
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
 src-committers@freebsd.org
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 07 May 2014 03:47:48 -0000

On Tue, 6 May 2014, Alan Somers wrote:

> Log:
>  dd(1) uses gettimeofday(2) to compute the throughput statistics.  However,
>  gettimeofday returns the system clock, which may jump forward or back,
>  especially if NTP is in use.  If the time jumps backwards, then dd will see
>  negative elapsed time, round it up to 1usec, and print an absurdly fast
>  transfer rate.
>
>  The solution is to use clock_gettime(2) with CLOCK_MONOTONIC_PRECISE as the
>  clock_id.  That clock advances steadily, regardless of changes to the system
>  clock.
>
>  Reviewed by:	delphij
>  MFC after:	3 days
>  Sponsored by:	Spectra Logic
> ...
> Modified: head/bin/dd/dd.c
> ==============================================================================
> --- head/bin/dd/dd.c	Tue May  6 22:04:50 2014	(r265471)
> +++ head/bin/dd/dd.c	Tue May  6 22:06:39 2014	(r265472)
> @@ -50,7 +50,6 @@ __FBSDID("$FreeBSD$");
> #include <sys/conf.h>
> #include <sys/disklabel.h>
> #include <sys/filio.h>
> -#include <sys/time.h>
>
> #include <ctype.h>
> #include <err.h>
> @@ -61,6 +60,8 @@ __FBSDID("$FreeBSD$");
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> +#include <sysexits.h>

Use of <sysexits.h> is a style bug.  It is not used in BSD or KNF code
like dd used to be.

> +#include <time.h>
> #include <unistd.h>
>
> #include "dd.h"
> @@ -123,7 +124,7 @@ static void
> setup(void)
> {
> 	u_int cnt;
> -	struct timeval tv;
> +	struct timespec tv;
>
> 	if (in.name == NULL) {
> 		in.name = "stdin";
> @@ -240,8 +241,9 @@ setup(void)
> 		ctab = casetab;
> 	}
>
> -	(void)gettimeofday(&tv, NULL);

Not checking for errors, and especially voiding the result to say that
errors need not be checked for, was sloppy.

> -	st.start = tv.tv_sec + tv.tv_usec * 1e-6;
> +	if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv))
> +		err(EX_OSERR, "clock_gettime");

The existence of CLOCK_MONOTONIC_PRECISE is a design error.  It is the
same as the standard CLOCK_MONOTONIC.  Use of the former is just
gratuitously unportable.  It ensures a compile-time failure on OSes
(including old versions of FreeBSD) that don't have the
CLOCK_MONOTONIC_PRECISE mistake.  It gives a runtime error on old
versions of FreeBSD that have clock_gettime() and CLOCK_MONOTONIC but
not CLOCK_MONOTONIC_PRECISE.  Apart from this error, clock_gettime()
is as likely to fail as gettimeofday(), or exit().

The existence of CLOCK_MONOTONIC_FAST is a larger design error.  It
is not fast, but just sloppy, broken and not very fast (sloppiness is
the intentional part of its design).  Of course there is no need to
use it to speed up the whole 2 clock_gettime() calls in dd.  It is
almost useless for other applications too.  In non-broken implementations,
the speed of CLOCK_* tends to be dominated by the syscall time even
when the timer hardware is slow.  On freefall now, the implementation
is about half as good as possible, and takes between 32 and 35 nanoseconds
for all clock ids, with the least accurate clock id TIME_SECOND tieing
with CLOCK_REALTIME for slowest (only about 1 nanosecond slower) since
it has the worst implementation.
   (The implementation on freefall avoids syscalls provided the timer
   hardware is the TSC.  The hardware timer and binuptime() are used for
   all syscalls including TIME_SECOND.  TIME_SECOND is slowest because
   it does extra work to zero tv_sec after calculating it with maximal
   precision.  Doing it this way gives the correct time in seconds, but
   is incompatible with the kernel (syscall and kernel seconds variable)
   since the kernel calculations are sloppy.  The kernel time for seconds
   and for the sloppy "FAST" clock ids and for the corresponding kernel
   timestamp functions lags the precise time by up to 1/HZ seconds.  This
   difference can easily be seen from userland.

   The hardware part of the 32-35 nanoseconds is 2-4 nanoseconds.  This is
   in sloppy tests.  Reading the TSC is unserialized, and loops reading it
   may be pipelined too well to time it accurately.  Also, the times read
   from it are not completely accurate.  But clock_gettime() doesn't serialize
   it either.  Serializing it would make it much slower (maybe 50 nanoseconds).

   2-4 nanoseconds is almost as fast as on AthlonXP 10 years ago!  rdtsc
   tends to be slower than that on Intel CPUs and on modern CPUs.
   Synchronization in hardware makes it slower on modern CPUs.  ISTR it
   took about 65 cycles on a previous generation of Intel CPUs.

   32-35 nanonseconds for the libc implementation is actually less than
   half as good as it possible if the hardware part really is only 2-4
   nanoseconds.  There is a lot of overhead from layering and from working
   around hardware bugs that usually don't exist.
   )

Bad implementations can be quite bad.  With an i8254 timecounter, it takes
about 5000 nanoseconds to read the hardware.  On a 486, it took an additional
10000-20000 nanoseconds for overheads including a syscall.  On a Pentium1,
it only took 1000-2000 nanoseconds for overheads in old versions of FreeBSD
(much more now).  With an ACPI-"fast" timecounter, it takes 1000-2000
nanoseconds to read the hardware.  HPET is only a few times faster.  This
does dominate syscall times on modern CPUs.  syscall times are more like
100-200 nanoseconds on modern CPUs.  The libc implementation reduces this
to 30-60 nanoseconds.

> +	st.start = tv.tv_sec + tv.tv_nsec * 1.0e-9;
> }
>
> static void
>
> Modified: head/bin/dd/misc.c
> ==============================================================================
> --- head/bin/dd/misc.c	Tue May  6 22:04:50 2014	(r265471)
> +++ head/bin/dd/misc.c	Tue May  6 22:06:39 2014	(r265472)
> @@ -40,14 +40,16 @@ static char sccsid[] = "@(#)misc.c	8.3 (
> __FBSDID("$FreeBSD$");
>
> #include <sys/types.h>
> -#include <sys/time.h>
>
> +#include <err.h>
> #include <errno.h>
> #include <inttypes.h>
> #include <signal.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> +#include <sysexits.h>
> +#include <time.h>
> #include <unistd.h>
>
> #include "dd.h"
> @@ -56,16 +58,20 @@ __FBSDID("$FreeBSD$");
> void
> summary(void)
> {
> -	struct timeval tv;
> -	double secs;
> +	struct timespec tv, tv_res;
> +	double secs, res;
>
> 	if (ddflags & C_NOINFO)
> 		return;
>
> -	(void)gettimeofday(&tv, NULL);
> -	secs = tv.tv_sec + tv.tv_usec * 1e-6 - st.start;
> -	if (secs < 1e-6)
> -		secs = 1e-6;

Bogus fixup.  secs had microseconds granularity except possibly for
for tiny roundoff errors.  I think the roundoff errors cannot give
a result near but not identical to 1e-6 or 0 unless tv_sec has
a garbage value.  The difference can be 0 in the unlikely event
that the run takes less than 1 microsecond, or negative when the
time goes backwards.  Fixing up 0 to 1e-6 is reasonable.  Fixing up
negative differences to 1e-6 avoids division by 0 but gives garbage
results.  I think this was only intended to fix up 0, and fixing up
negative differences is accidental.

> +	if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv))
> +		err(EX_OSERR, "clock_gettime");
> +	if (clock_getres(CLOCK_MONOTONIC_PRECISE, &tv_res))
> +		err(EX_OSERR, "clock_getres");

clock_getres() is almost useless, and is useless here.  It is broken
as designed, since the precision may be less than 1 nanosecond but
1 nanosecond is the smallest positive representable value, but that
is not a problem here since clock_gettime() also can't distinguish
differences smaller than 1 nanosecond.

> +	secs = tv.tv_sec + tv.tv_nsec * 1.0e-9 - st.start;
> +	res = tv_res.tv_sec + tv_res.tv_nsec * 1.0e-9;

IIRC, POSIX is confused about resolution vs precision and clock_getres()
actually returns the resolution, so this variable has the correct name.
I quote "precision" in the following to indicate this confusion.

> +	if (secs < res)
> +		secs = res;

The fixup is now only reachable in 3 cases that can't happen:
- when the monotonic time goes backwards due to a kernel bug
- when the monotonic time doesn't increase, so that the difference is 0.
   Oops, this can happen for timecounters with very low "precision".
   You don't need to know the "precision" to check for this.
- when the monotonic time does increase, but by an amount smaller than
   the "precision".  This indicates that the "precision" is wrong.

In the 3rd case, the actual difference may as well be used.  It is
probably slightly wrong at worst (say 999 nanoseconds when the
"precision" is 1000 nanoseconds).  The timing is going to be very
inaccurate for short runs, so another 10% inaccuracy doesn't matter.
At worst, the difference might be 1 nanosecond when it should be
hundreds of nanoseconds.  (I don't see how this can happen without
a kernel bug.  Just reading the time can take thousands of nanoseconds.)
The result will be garbage, but it won't overflow.

In the 1st case, we don't know the correct fixup, and a tiny value
from clock_getprec() is not as good as anything, except to make the
garbage in the result more obvious.

In the second case, fixing up to the "precision" may give a large
estimate.  The fixup might as well be to a nominal value like 1
nanosecond or 1 second.  CLOCK_MONOTONIC can't have a very low
precision, and the timing for runs that don't take as long as a
large multiple of the precision is inaccurate.  We could also
report the result as <indeterminate> in this case.

> 	(void)fprintf(stderr,
> 	    "%ju+%ju records in\n%ju+%ju records out\n",
> 	    st.in_full, st.in_part, st.out_full, st.out_part);
> @@ -77,7 +83,7 @@ summary(void)
> 		     st.trunc, (st.trunc == 1) ? "block" : "blocks");
> 	if (!(ddflags & C_NOXFER)) {
> 		(void)fprintf(stderr,
> -		    "%ju bytes transferred in %.6f secs (%.0f bytes/sec)\n",
> +		    "%ju bytes transferred in %.9f secs (%.0f bytes/sec)\n",

nanoseconds resolution is excessive here, and changes the output format.
The only use of it is to debug cases where the output is garbage due
to the interval being about 1 nanosecond.  Printing nanoseconds resolution
is also inconsistent with the fussy "precision" adjustment above.

> 		    st.bytes, secs, st.bytes / secs);
> 	}
> 	need_summary = 0;

In practice, we can depend on at least microseconds "precision" since
even gettimeofday() gave that as a precision and gettimeofday() was fixed
on at least x86 20-25 years ago so that it had almost the same "precision"
(resolution) as its granularity.

In ping(8), we still just use microseconds resolution and even
non-monotonic gettimeofday().  It only had milliseconds resolution in
BSD 20-25 years ago and in Linux much more recently.  dg fixed it to
use the microseconds resolution that became available when
gettimeofday() was fixed.  clock_gettime() didn't exist then.  I tried
converting it to use nanoseconds resolution and wasn't happy with the
results.  Formats like %.9f are needed to see the full resolution,
but networking latency is still too large so reasolution of more tha
1 microsecond is rarely useful, and the extra digits from %.9f format
just make the value harder to read.

Bruce