Date: Thu, 18 Dec 2014 09:03:37 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Erik Cederstrand <erik@cederstrand.dk> Cc: arch@freebsd.org Subject: Re: Change default VFS timestamp precision? Message-ID: <20141218075519.J1025@besplex.bde.org> In-Reply-To: <1087E8D0-4B2F-4941-BDCE-3D50264D7FBB@cederstrand.dk> References: <201412161348.41219.jhb@freebsd.org> <1087E8D0-4B2F-4941-BDCE-3D50264D7FBB@cederstrand.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 17 Dec 2014, Erik Cederstrand wrote: >> Den 16/12/2014 kl. 19.48 skrev John Baldwin <jhb@freebsd.org>: >> >> We still ship with vfs.timestamp_precision=0 by default meaning that VFS >> timestamps have a granularity of one second. It is not unusual on modern >> systems for multiple updates to a file or directory to occur within a single >> second (and thus share the same effective timestamp). This can break things >> that depend on timestamps to know when something has changed or is stale (such >> as make(1) or NFS clients). > > Mistaking timestamps for uniqueness is really a design error of the consumer. Changing granularity to milliseconds will diminish the problem, but also create harder-to-debug problems when multiple updates do happen in the same millisecond. Is there no other way than timestamps to find out if a file has changed (apart from md5 which is too expensive)? Milliseconds granularity is not even supported (the sysctl description just misspells microseconds as ms). For unique file timestamps, the timestamps could be made unique by fudging the nanoseconds part. Increment by 1 on each access the current time didn't change since the last access. Since nansoeconds granularity is not supported for writing, actually increment by 1000 nanoseconds on each access if the current time changed by less than 1000 nanoseconds, and also round to a multiple of 1000. In the unlikely event that the accesses are more frequent than once per microsecond, then let the timestamp clock run a bit ahead of real time, but not too much. Oops. I have tests that the timestamp clock runs coherently with real time. These fail now unless the timestamp clock is the same as the real time clock (consistently rounded down to the granularity of the clock used in the tests, which is the real time clock rounded down to seconds). I.e., they fail for vfs.timestamp_precision = 0 or 1, since these give an clock that lags the real time clock incoherently with the rounding. Fudging the timestamp clock would break the tests in other ways, especially if "a bit a head of real time" means a few hundred microseconds. So perhaps don't increment by a full 1000 nanoseconds or round to microseconds if that would advance the clock too fast. This fails when the accesses arrive faster than once per nanoseconds, but that will remain physically impossible for some time (but see below about multiple CPUs and locking). I think Solaris does something like this. Perhaps there are traces of it in zfs. This would be useless for BSD make since it only uses seconds resolution. Note that this is also almost useless if the file system does any caching of timestamp updates, like most file systems do. Timestamp updates normally occur when a program stat()s the file. So after write to file 1; write to file 2; stat() of file 2; stat() of file 1, all the monotonically increasing timestamp clock does is ensure that the stat()s give a perfectly wrong order. The only way to ensure getting the right order is to do the stat()s in the same order as the writes, but if the right order is known then it is unnecessary to do the stat()s to determine the order. If this were not useless, then it might be worth implementing. Then there would be difficult to make it efficient: - the simplest implementation is to axe all the caching and nanoseconds resolution. This mostly gives strictly monotonically increasing timestamps, but not always. Failing cases include: - someone steps the clock - the timecounter precision is much lower than nanoseconds - the system has multiple CPUs, and timestamps are made concurrently. The same timestamp may be made on different CPUs (most likely on different filesystems -- otherwise locking would serialize things). Then the write times probably aren't exactly the same, but you can't tell. Even when the timestamps differ by 1 nanosecond, just making them has a fuzziness of more than 1 nanosecond so you still can't determine the exact order of the writes. - otherwise, the cached timestamps must be serialized across CPUs and file systems. E.g., in ffs where it currently marks the mtime for update using `ip->i_flag |= IN_UPDATE' (locked by the vnode interlock), it would also have to incremement a generation counter or something. The lock for this would have to be across all CPUs and file systems. It would be a fine source of lock contention. Timecounter update code has complications to avoid such contention. Then when the update occurs, somhow combine the generation counter with real-time timestamps. I think it works to do something like 'timestamp.tv_sec = old_time; timestamp.tv_nsec = generation_count', where old_time is the time for the current set of generations (start a new set of generations every second after updating all timestamps for the old generation). - combine these methods: replace marking for update with updating as in the first method, and serialize this using fine lock contention as in the second method. The lock gives an ordering on the update operations, and while it is held it is easy to fudge the timestamps to represent this ordering. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141218075519.J1025>