Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Dec 2014 09:03:37 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Erik Cederstrand <erik@cederstrand.dk>
Cc:        arch@freebsd.org
Subject:   Re: Change default VFS timestamp precision?
Message-ID:  <20141218075519.J1025@besplex.bde.org>
In-Reply-To: <1087E8D0-4B2F-4941-BDCE-3D50264D7FBB@cederstrand.dk>
References:  <201412161348.41219.jhb@freebsd.org> <1087E8D0-4B2F-4941-BDCE-3D50264D7FBB@cederstrand.dk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 17 Dec 2014, Erik Cederstrand wrote:

>> Den 16/12/2014 kl. 19.48 skrev John Baldwin <jhb@freebsd.org>:
>>
>> We still ship with vfs.timestamp_precision=0 by default meaning that VFS
>> timestamps have a granularity of one second.  It is not unusual on modern
>> systems for multiple updates to a file or directory to occur within a single
>> second (and thus share the same effective timestamp).  This can break things
>> that depend on timestamps to know when something has changed or is stale (such
>> as make(1) or NFS clients).
>
> Mistaking timestamps for uniqueness is really a design error of the consumer. Changing granularity to milliseconds will diminish the problem, but also create harder-to-debug problems when multiple updates do happen in the same millisecond. Is there no other way than timestamps to find out if a file has changed (apart from md5 which is too expensive)?

Milliseconds granularity is not even supported (the sysctl description
just misspells microseconds as ms).

For unique file timestamps, the timestamps could be made unique by
fudging the nanoseconds part.  Increment by 1 on each access the current
time didn't change since the last access.  Since nansoeconds granularity
is not supported for writing, actually increment by 1000 nanoseconds
on each access if the current time changed by less than 1000
nanoseconds, and also round to a multiple of 1000.  In the unlikely
event that the accesses are more frequent than once per microsecond,
then let the timestamp clock run a bit ahead of real time, but not too
much.

Oops.  I have tests that the timestamp clock runs coherently with real
time.  These fail now unless the timestamp clock is the same as the
real time clock (consistently rounded down to the granularity of the
clock used in the tests, which is the real time clock rounded down to
seconds).  I.e., they fail for vfs.timestamp_precision = 0 or 1, since
these give an clock that lags the real time clock incoherently with
the rounding.  Fudging the timestamp clock would break the tests in other
ways, especially if "a bit a head of real time" means a few hundred
microseconds.  So perhaps don't increment by a full 1000 nanoseconds or
round to microseconds if that would advance the clock too fast.  This
fails when the accesses arrive faster than once per nanoseconds, but
that will remain physically impossible for some time (but see below
about multiple CPUs and locking).

I think Solaris does something like this.  Perhaps there are traces of it
in zfs.

This would be useless for BSD make since it only uses seconds resolution.

Note that this is also almost useless if the file system does any caching
of timestamp updates, like most file systems do.  Timestamp updates normally
occur when a program stat()s the file.  So after write to file 1; write to
file 2; stat() of file 2; stat() of file 1, all the monotonically increasing
timestamp clock does is ensure that the stat()s give a perfectly wrong order.
The only way to ensure getting the right order is to do the stat()s in the
same order as the writes, but if the right order is known then it is
unnecessary to do the stat()s to determine the order.

If this were not useless, then it might be worth implementing.  Then there
would be difficult to make it efficient:
- the simplest implementation is to axe all the caching and nanoseconds
   resolution.  This mostly gives strictly monotonically increasing timestamps,
   but not always.  Failing cases include:
   - someone steps the clock
   - the timecounter precision is much lower than nanoseconds
   - the system has multiple CPUs, and timestamps are made concurrently.
     The same timestamp may be made on different CPUs (most likely on
     different filesystems -- otherwise locking would serialize things).
     Then the write times probably aren't exactly the same, but you can't
     tell.  Even when the timestamps differ by 1 nanosecond, just making
     them has a fuzziness of more than 1 nanosecond so you still can't
     determine the exact order of the writes.
- otherwise, the cached timestamps must be serialized across CPUs and file
   systems.  E.g., in ffs where it currently marks the mtime for update
   using `ip->i_flag |= IN_UPDATE' (locked by the vnode interlock), it
   would also have to incremement a generation counter or something.  The
   lock for this would have to be across all CPUs and file systems.  It
   would be a fine source of lock contention.  Timecounter update code
   has complications to avoid such contention.  Then when the update occurs,
   somhow combine the generation counter with real-time timestamps.  I
   think it works to do something like 'timestamp.tv_sec = old_time;
   timestamp.tv_nsec = generation_count', where old_time is the time for
   the current set of generations (start a new set of generations every
   second after updating all timestamps for the old generation).
- combine these methods: replace marking for update with updating as in
   the first method, and serialize this using fine lock contention as
   in the second method.  The lock gives an ordering on the update
   operations, and while it is held it is easy to fudge the timestamps
   to represent this ordering.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141218075519.J1025>