Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Apr 2009 20:08:51 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: changing semantics of the va_filerev (code review)
Message-ID:  <20090414180826.J53102@delplex.bde.org>
In-Reply-To: <Pine.GSO.4.63.0904131036310.27001@muncher.cs.uoguelph.ca>
References:  <Pine.GSO.4.63.0904121553050.3872@muncher.cs.uoguelph.ca> <20090413193936.A52183@delplex.bde.org> <Pine.GSO.4.63.0904131036310.27001@muncher.cs.uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 13 Apr 2009, Rick Macklem wrote:

> On Mon, 13 Apr 2009, Bruce Evans wrote:

>> va_gen/i_gen/di_gen/st_gen seems to be even more suitable for this
>> purpose, but it isn't actually a file generation number like its
>> comments say (it is normally set to a random value on file creation
>> then never changed) and it is exposed to userland (st_gen).
>> 
> i_gen is used by NFS to create T-stable (valid for a long time, including
> a long time after the file is removed) file handles. It is used by
> ffs_vptofh() to create the file handles for NFS that are recognized as
> representing removed files, even after an i-node gets reused such that
> the i-node number now represents another file.

Oops, I missed that since nfs's use of i_gen is indirect.  What does
nfs do for file systems that don't detect removed files, e.g., msdosfs.
vptofh and fhtovp routines seem to have too many differences.  E.g.,
file systems based on ffs return ESTALE for removed files, but
zfs_fhtovp() returns EINVAL.

I just noticed than the increment of i_gen was slightly broken for ffs
by a type mismatch in ffs2 (affects ffs1 too).  Originally, i_gen had
the same type as di_gen (int32_t).  Now i_gen has type int64_t but in
ffs1, di_gen of course still has type int32_t, and in ffs2, di_gen
still has type int32_t (apparently there was insufficient space to
expand it).  This makes the overflow check in ffs_alloc.c (++ip->i_gen
== 0) more broken than before.  Previously it only gave undefined
behaviour followed by a bogus check when overflow occurs for incrementing
from INT32_T_MAX.  Now it has no effect, since it takes 293 years of
incrementing at a rate of 1GHz to reach overflow at INT64_T_MAX.
Overflow now occurs on assignment to di_gen.

The result of this bug is almost the the same as removing the silly
part of the security code -- the re-randomization on overflow.  i_gen
may grow larger than UINT32_T_MAX, but usually refresh from the dinode
will keep it smaller.  When it starts near UINT32_T_MAX and grows
larger, the overflow on assignment and a subsequent refresh will make
it nearly 0.  Except, in 1 in every 2**32 cases, when the overflow makes
di_gen exactly 0, the subsequent refresh will randomize i_gen.

>> va_ctime should give what you want for all file systems, since it
>> should be increased whenever anything changes.  However, most file
> There are some places where IN_UPDATE gets set, but IN_CHANGE doesn't.

Are there?  This would be a bug.  I checked that ffs doesn't have this
bug.

> Since the Change attribute must change for every file modification, I
> feel safer incrementing it for both IN_UPDATE and IN_CHANGE. (It's 64bits,
> so it won't wrap around for a little while.)

It would be a large and obvious bug to modify the file data (IN_UPDATE)
without setting IN_CHANGE.

>> systems always set the nsec part to 0, so va_ctime doesn't track
>> all file changes.  This is a problem for things like make(1) too,
>> so if nsec timestamps aren't available or are take too long or are
>> not fine-grained enough, the nsec part should be abused as a generation
>> counter so that any change gives a strictly larger timestamp.  The
>> case where someone sets the clock backwards is broken but won't
>> happen often.
>> 
>> Many nonstandard file systems, e.g., msdosfs, have
>> no space for an on-disk ctime, so they fake va_ctime using an on-disk
>> mtime.  Since such file systems don't have many attributes, only a
>> few more cases are broken.
>> 
> Yep, that's why ctime/mtime aren't sufficient.
> If a read/write file system doesn't have support for it, all you
> can do is fake it and hope the client works ok. I suspect the Linux

They need to be fixed or faked well enough for make(1) too.

When the dinode has no space to spare, something can be done by keeping
state in the inode or vnode.  This won't work across reboots of course
(except by hashing a reboot counter into the generation counts or
timestamps) but might be enough for all short-term uses.  I'm not sure
how much is safe here.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090414180826.J53102>