Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 7 May 2014 12:10:31 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "src-committers@freebsd.org" <src-committers@freebsd.org>, Alan Somers <asomers@freebsd.org>
Subject:   Re: svn commit: r265472 - head/bin/dd
Message-ID:  <CAOtMX2h_%2B1G18Nv5JvDE0H7_TZ96p81JotOwhq1Jm-dOOeahPw@mail.gmail.com>
In-Reply-To: <20140507113345.B923@besplex.bde.org>
References:  <201405062206.s46M6dxW060155@svn.freebsd.org> <20140507113345.B923@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, May 6, 2014 at 9:47 PM, Bruce Evans <brde@optusnet.com.au> wrote:
> On Tue, 6 May 2014, Alan Somers wrote:
>
>> Log:
>>  dd(1) uses gettimeofday(2) to compute the throughput statistics.
>> However,
>>  gettimeofday returns the system clock, which may jump forward or back,
>>  especially if NTP is in use.  If the time jumps backwards, then dd will
>> see
>>  negative elapsed time, round it up to 1usec, and print an absurdly fast
>>  transfer rate.
>>
>>  The solution is to use clock_gettime(2) with CLOCK_MONOTONIC_PRECISE as
>> the
>>  clock_id.  That clock advances steadily, regardless of changes to the
>> system
>>  clock.
>>
>>  Reviewed by:   delphij
>>  MFC after:     3 days
>>  Sponsored by:  Spectra Logic
>> ...
>>
>> Modified: head/bin/dd/dd.c
>>
>> ==============================================================================
>> --- head/bin/dd/dd.c    Tue May  6 22:04:50 2014        (r265471)
>> +++ head/bin/dd/dd.c    Tue May  6 22:06:39 2014        (r265472)
>> @@ -50,7 +50,6 @@ __FBSDID("$FreeBSD$");
>> #include <sys/conf.h>
>> #include <sys/disklabel.h>
>> #include <sys/filio.h>
>> -#include <sys/time.h>
>>
>> #include <ctype.h>
>> #include <err.h>
>> @@ -61,6 +60,8 @@ __FBSDID("$FreeBSD$");
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <string.h>
>> +#include <sysexits.h>
>
>
> Use of <sysexits.h> is a style bug.  It is not used in BSD or KNF code
> like dd used to be.

sysexits.h is recommended by the err(3) man page.  Is that
recommendation meant to apply selectively, or is it obsolete, or is
some sort of edit war being waged by man page authors?

>
>
>> +#include <time.h>
>> #include <unistd.h>
>>
>> #include "dd.h"
>> @@ -123,7 +124,7 @@ static void
>> setup(void)
>> {
>>         u_int cnt;
>> -       struct timeval tv;
>> +       struct timespec tv;
>>
>>         if (in.name == NULL) {
>>                 in.name = "stdin";
>> @@ -240,8 +241,9 @@ setup(void)
>>                 ctab = casetab;
>>         }
>>
>> -       (void)gettimeofday(&tv, NULL);
>
>
> Not checking for errors, and especially voiding the result to say that
> errors need not be checked for, was sloppy.
>
>
>> -       st.start = tv.tv_sec + tv.tv_usec * 1e-6;
>> +       if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv))
>> +               err(EX_OSERR, "clock_gettime");
>
>
> The existence of CLOCK_MONOTONIC_PRECISE is a design error.  It is the
> same as the standard CLOCK_MONOTONIC.  Use of the former is just
> gratuitously unportable.  It ensures a compile-time failure on OSes
> (including old versions of FreeBSD) that don't have the
> CLOCK_MONOTONIC_PRECISE mistake.  It gives a runtime error on old
> versions of FreeBSD that have clock_gettime() and CLOCK_MONOTONIC but
> not CLOCK_MONOTONIC_PRECISE.  Apart from this error, clock_gettime()
> is as likely to fail as gettimeofday(), or exit().
>
> The existence of CLOCK_MONOTONIC_FAST is a larger design error.  It
> is not fast, but just sloppy, broken and not very fast (sloppiness is
> the intentional part of its design).  Of course there is no need to
> use it to speed up the whole 2 clock_gettime() calls in dd.  It is
> almost useless for other applications too.  In non-broken implementations,
> the speed of CLOCK_* tends to be dominated by the syscall time even
> when the timer hardware is slow.  On freefall now, the implementation
> is about half as good as possible, and takes between 32 and 35 nanoseconds
> for all clock ids, with the least accurate clock id TIME_SECOND tieing
> with CLOCK_REALTIME for slowest (only about 1 nanosecond slower) since
> it has the worst implementation.
>   (The implementation on freefall avoids syscalls provided the timer
>   hardware is the TSC.  The hardware timer and binuptime() are used for
>   all syscalls including TIME_SECOND.  TIME_SECOND is slowest because
>   it does extra work to zero tv_sec after calculating it with maximal
>   precision.  Doing it this way gives the correct time in seconds, but
>   is incompatible with the kernel (syscall and kernel seconds variable)
>   since the kernel calculations are sloppy.  The kernel time for seconds
>   and for the sloppy "FAST" clock ids and for the corresponding kernel
>   timestamp functions lags the precise time by up to 1/HZ seconds.  This
>   difference can easily be seen from userland.
>
>   The hardware part of the 32-35 nanoseconds is 2-4 nanoseconds.  This is
>   in sloppy tests.  Reading the TSC is unserialized, and loops reading it
>   may be pipelined too well to time it accurately.  Also, the times read
>   from it are not completely accurate.  But clock_gettime() doesn't
> serialize
>   it either.  Serializing it would make it much slower (maybe 50
> nanoseconds).
>
>   2-4 nanoseconds is almost as fast as on AthlonXP 10 years ago!  rdtsc
>   tends to be slower than that on Intel CPUs and on modern CPUs.
>   Synchronization in hardware makes it slower on modern CPUs.  ISTR it
>   took about 65 cycles on a previous generation of Intel CPUs.
>
>   32-35 nanonseconds for the libc implementation is actually less than
>   half as good as it possible if the hardware part really is only 2-4
>   nanoseconds.  There is a lot of overhead from layering and from working
>   around hardware bugs that usually don't exist.
>   )
>
> Bad implementations can be quite bad.  With an i8254 timecounter, it takes
> about 5000 nanoseconds to read the hardware.  On a 486, it took an
> additional
> 10000-20000 nanoseconds for overheads including a syscall.  On a Pentium1,
> it only took 1000-2000 nanoseconds for overheads in old versions of FreeBSD
> (much more now).  With an ACPI-"fast" timecounter, it takes 1000-2000
> nanoseconds to read the hardware.  HPET is only a few times faster.  This
> does dominate syscall times on modern CPUs.  syscall times are more like
> 100-200 nanoseconds on modern CPUs.  The libc implementation reduces this
> to 30-60 nanoseconds.
>
>
>> +       st.start = tv.tv_sec + tv.tv_nsec * 1.0e-9;
>> }
>>
>> static void
>>
>> Modified: head/bin/dd/misc.c
>>
>> ==============================================================================
>> --- head/bin/dd/misc.c  Tue May  6 22:04:50 2014        (r265471)
>> +++ head/bin/dd/misc.c  Tue May  6 22:06:39 2014        (r265472)
>> @@ -40,14 +40,16 @@ static char sccsid[] = "@(#)misc.c  8.3 (
>> __FBSDID("$FreeBSD$");
>>
>> #include <sys/types.h>
>> -#include <sys/time.h>
>>
>> +#include <err.h>
>> #include <errno.h>
>> #include <inttypes.h>
>> #include <signal.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <string.h>
>> +#include <sysexits.h>
>> +#include <time.h>
>> #include <unistd.h>
>>
>> #include "dd.h"
>> @@ -56,16 +58,20 @@ __FBSDID("$FreeBSD$");
>> void
>> summary(void)
>> {
>> -       struct timeval tv;
>> -       double secs;
>> +       struct timespec tv, tv_res;
>> +       double secs, res;
>>
>>         if (ddflags & C_NOINFO)
>>                 return;
>>
>> -       (void)gettimeofday(&tv, NULL);
>> -       secs = tv.tv_sec + tv.tv_usec * 1e-6 - st.start;
>> -       if (secs < 1e-6)
>> -               secs = 1e-6;
>
>
> Bogus fixup.  secs had microseconds granularity except possibly for
> for tiny roundoff errors.  I think the roundoff errors cannot give
> a result near but not identical to 1e-6 or 0 unless tv_sec has
> a garbage value.  The difference can be 0 in the unlikely event
> that the run takes less than 1 microsecond, or negative when the
> time goes backwards.  Fixing up 0 to 1e-6 is reasonable.  Fixing up
> negative differences to 1e-6 avoids division by 0 but gives garbage
> results.  I think this was only intended to fix up 0, and fixing up
> negative differences is accidental.
>
>
>> +       if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv))
>> +               err(EX_OSERR, "clock_gettime");
>> +       if (clock_getres(CLOCK_MONOTONIC_PRECISE, &tv_res))
>> +               err(EX_OSERR, "clock_getres");
>
>
> clock_getres() is almost useless, and is useless here.  It is broken
> as designed, since the precision may be less than 1 nanosecond but
> 1 nanosecond is the smallest positive representable value, but that
> is not a problem here since clock_gettime() also can't distinguish
> differences smaller than 1 nanosecond.

Since it's reporting the clock resolution and not precision, and since
clock_gettime() only reports with 1ns resolution, I don't think it's a
problem for clock_getres to report with 1ns resolution too.

>
>
>> +       secs = tv.tv_sec + tv.tv_nsec * 1.0e-9 - st.start;
>> +       res = tv_res.tv_sec + tv_res.tv_nsec * 1.0e-9;
>
>
> IIRC, POSIX is confused about resolution vs precision and clock_getres()
> actually returns the resolution, so this variable has the correct name.
> I quote "precision" in the following to indicate this confusion.
>
>
>> +       if (secs < res)
>> +               secs = res;
>
>
> The fixup is now only reachable in 3 cases that can't happen:
> - when the monotonic time goes backwards due to a kernel bug
> - when the monotonic time doesn't increase, so that the difference is 0.
>   Oops, this can happen for timecounters with very low "precision".
>   You don't need to know the "precision" to check for this.

On my Xeon E5504 systems, I can see adjacent calls to clock_gettime
return equal values when using one of the _FAST clocks.  It can't be
proven that this case will never happen with any other clock either,
so the program needs to handle it.

> - when the monotonic time does increase, but by an amount smaller than
>   the "precision".  This indicates that the "precision" is wrong.
>
> In the 3rd case, the actual difference may as well be used.  It is
> probably slightly wrong at worst (say 999 nanoseconds when the
> "precision" is 1000 nanoseconds).  The timing is going to be very
> inaccurate for short runs, so another 10% inaccuracy doesn't matter.
> At worst, the difference might be 1 nanosecond when it should be
> hundreds of nanoseconds.  (I don't see how this can happen without
> a kernel bug.  Just reading the time can take thousands of nanoseconds.)
> The result will be garbage, but it won't overflow.
>
> In the 1st case, we don't know the correct fixup, and a tiny value
> from clock_getprec() is not as good as anything, except to make the
> garbage in the result more obvious.
>
> In the second case, fixing up to the "precision" may give a large
> estimate.  The fixup might as well be to a nominal value like 1
> nanosecond or 1 second.  CLOCK_MONOTONIC can't have a very low
> precision, and the timing for runs that don't take as long as a
> large multiple of the precision is inaccurate.  We could also
> report the result as <indeterminate> in this case.

The second case is the one I'm most concerned about.  Assuming that
the precision is correct, clock_getres() seems like the best value for
the fixup.  Anything less than the reported precision would be
unnecessarily small and give unnecessarily inaccurate results.
Anything greater would make an implicit and unportable assumption
about the speed of the hardware.  Do you really think it's a problem
to fixup to clock_getres() ?

>
>
>>         (void)fprintf(stderr,
>>             "%ju+%ju records in\n%ju+%ju records out\n",
>>             st.in_full, st.in_part, st.out_full, st.out_part);
>> @@ -77,7 +83,7 @@ summary(void)
>>                      st.trunc, (st.trunc == 1) ? "block" : "blocks");
>>         if (!(ddflags & C_NOXFER)) {
>>                 (void)fprintf(stderr,
>> -                   "%ju bytes transferred in %.6f secs (%.0f
>> bytes/sec)\n",
>> +                   "%ju bytes transferred in %.9f secs (%.0f
>> bytes/sec)\n",
>
>
> nanoseconds resolution is excessive here, and changes the output format.
> The only use of it is to debug cases where the output is garbage due
> to the interval being about 1 nanosecond.  Printing nanoseconds resolution
> is also inconsistent with the fussy "precision" adjustment above.

The higher resolution printf doesn't conflict with the resolution
adjustment above.  Freefall actually reports 1ns resolution.  But I
can buy that it's not useful to the user.  Would you like me to change
it back to %.6 ?

>
>
>>                     st.bytes, secs, st.bytes / secs);
>>         }
>>         need_summary = 0;
>
>
> In practice, we can depend on at least microseconds "precision" since
> even gettimeofday() gave that as a precision and gettimeofday() was fixed
> on at least x86 20-25 years ago so that it had almost the same "precision"
> (resolution) as its granularity.
>
> In ping(8), we still just use microseconds resolution and even
> non-monotonic gettimeofday().  It only had milliseconds resolution in
> BSD 20-25 years ago and in Linux much more recently.  dg fixed it to
> use the microseconds resolution that became available when
> gettimeofday() was fixed.  clock_gettime() didn't exist then.  I tried
> converting it to use nanoseconds resolution and wasn't happy with the
> results.  Formats like %.9f are needed to see the full resolution,
> but networking latency is still too large so reasolution of more tha
> 1 microsecond is rarely useful, and the extra digits from %.9f format
> just make the value harder to read.

Even if nanosecond resolution isn't useful, monotonicity is.  Nobody
should be using a nonmonotonic clock just to measure durations.  I
started an audit of all of FreeBSD to look for other programs that use
gettimeofday to measure durations.  I haven't finished, but I've
already found a lot, including xz, ping, hastd, fetch, systat, powerd,
and others.  I don't have time to fix them, though.  Would you be
interested, or do you know anyone else who would?

-Alan

>
> Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2h_%2B1G18Nv5JvDE0H7_TZ96p81JotOwhq1Jm-dOOeahPw>