Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Sep 2005 23:47:15 +1000 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Dmitry Pryanishnikov <dmitry@atlantis.dp.ua>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: kern/85503: panic: wrong dirclust using msdosfs in RELENG_6
Message-ID:  <20050901201602.X99455@delplex.bde.org>
In-Reply-To: <20050901113819.F95708@atlantis.atlantis.dp.ua>
References:  <20050901113819.F95708@atlantis.atlantis.dp.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 1 Sep 2005, Dmitry Pryanishnikov wrote:

> During the hunting the bug kern/85503 (panic: wrong dirclust in msdosfs)
> I've tried to think about the solution, but it seems to be 
> architecture-related. The problem is: msdosfs uses pseudo-inodes (that is,
> the offset from the start of the partition to the start of directory entry
> in bytes) which must therefore have off_t bitness (at least 64 bits). I've
> found the primary error (lack of casts leaded to 32-bit result), but then
> we should transfer this 64-bit "inode" number to vfs_hash_get(). Oops,
> it also limited to u_int (32 bits on i386). Finally, I see that the
> primary shortcoming here: in sys/vnode.h we have
>
>        /*
>         * vfs_hash:  (mount + inode) -> vnode hash.
>         */
>        LIST_ENTRY(vnode)       v_hashlist;
>        u_int                   v_hash;
>
> I think it's feasible and useful to upgrade type of v_hash to at least off_t.
> In our days of large media we will face filesystems with more than 4 billions
> files (and thus at least 64-bit inode numbers) quite often. Am I right?

This is not needed yet.

Filesystems with more than 4G files are not supported yet, since ino_t
is 32 bits and is used in critical APIs (struct stat...).  Also,
d_fileno in struct dirent is 32 bits but not identical to ino_t (POSIX
with XSI extensions requires it to be identical).  Even when ino_t is
enlarged, backwards- and sideways-compat syscalls must do something
with unrepresentable inode numbers.  It's best for filesystems to map
large inode numbers to small ones so that unrepresentable inodes are
never seen by upper layers.

So all current file systems need to generate unique 32-bit inode
numbers.  This may be difficult, but once it is done I think the inode
number can be used as a key to pass to the hash functions.  (The key
is bogusly named "hash" in the hash function args and in v_hash above.)

For msdosfs, the inode number is essentially the byte offset divided by
the size of a directory entry.  The size is 32, so this breaks at a byte
offset of 128G instead of 4G.  Details:
- the inode number of the root dir is 1
- the inode number of a non-root dir is the (byte offset of the dir as a
   file) / 32
- the inode number of a non-dir is the (byte offset of the dir entry for
   the file) / 32.

For stat() and possibly for readdir() and hashget(), I think it would
work to use the cluster offset of the file for the fake inode number.
The cluster size is always much larger than 32, so this would work for
much larger file systems.

cd9660 has a related bug suite.  Its inode numbers are byte offsets
of directory entries, so they break at 4G as in msdosfs.  In addition,
cd9660 needs to support hard links, but a 1-1 mapping from directory
entries to hard links is perfectly wrong.  This bug was worked around
for stat() and readdir() in rev.1.77 of cd9660_vnops.c, by using the
block offset of the file for the fake inode number.  cd9660_ihashget()
remained broken for directories above 4G and for hard links (having
multiple active vnodes for the same file can't be good but might not
be too bad for a read-only file system).  However, rev.1.77 was
temporarily backed out in rev.1.98 because it broke cd9660_fhtovp()
and/or nfs readdirplus().  File handles in cd9660 contain an inode
number and cd9660_fhtovp() depends on this number being compatible
with the one set by readdir().  This part of the bug suite is smaller
in msdosfs -- there file handles are directory (cluster, offset) pairs
so enough info is available to derive whatever number vget() wants.
The log message for rev.1.77 says that cd9660 did things better in
FreeBSD-1 where it used directory (cluster, offset) pairs more.  It
seems to have never used these for file handles, so I now think that
it needs to use them in some places where it now abuses inodes to look
up directory entries, but that cd9660_fhtovp() is not one of these
places (or hard to fix).  cd9660_fhtovp() only needs to look up vnodes
and a unique inode number is enough for that; the (cluster, offset)
pair in msdosfs's file handles is just to match deget()'s interface.

Try always using the cluster offset of the file for the fake inode
number in msdosfs.

cd9660 is harder to fix since it has hard links and symlinks.  IIRC,
its "inode" numbers are needed only to look up directory entries for
symlinks.  Using the directory entry for the inode in cd9660_ihashget()
is bad for all types of links.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050901201602.X99455>