From owner-freebsd-arch  Mon Mar 11 13:17:30 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id B7C2237B402
	for <arch@FreeBSD.ORG>; Mon, 11 Mar 2002 13:17:11 -0800 (PST)
Received: from fledge.watson.org (fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.6/8.11.6) with SMTP id g2BLGni47870;
	Mon, 11 Mar 2002 16:16:49 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 11 Mar 2002 16:16:48 -0500 (EST)
From: Robert Watson <rwatson@FreeBSD.ORG>
X-Sender: robert@fledge.watson.org
To: Harti Brandt <brandt@fokus.gmd.de>
Cc: Garance A Drosihn <drosih@rpi.edu>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>, arch@FreeBSD.ORG
Subject: Re: Increasing the size of dev_t and ino_t
In-Reply-To: <20020311172142.K1371-100000@beagle.fokus.gmd.de>
Message-ID: <Pine.NEB.3.96L.1020311160835.46602A-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, 11 Mar 2002, Harti Brandt wrote:

> I suppose the AFS volumes themself have some kind of unique identifier,
> otherwise there would be no way to tell that you are mounting the same
> volume in different places, there wouldn't even be the notion of 'the
> same volume'. Given that, it should be simple to map between those AFS
> volume identifiers and st_dev's. How this mapping is done depends on the
> kind of the volume id. If you have 33,000 mounts in you system, adding a
> uint32_t to each of these mounts will not be your main problem. 

AFS, Coda, and various other "global scale" filesystems rely on a much
larger unique identifier space than the traditional 64-bit (dev_t, ino_t)
pair.  Coda, for example, uses a 96-bit "Vice ID" which is per-realm.
That is partitioned into volume ID's and individual file ID's, which are
similar to "filesystems" and "inode numbers".  However, the problem occurs
because our mount system doesn't scale to the level required for Coda or
AFS to function.  As such, Coda and AFS have their own light-weight
mounting scheme inside the filesystem implementation, so it appears to the
kernel as though it's a single huge filesystem, rather than a composite of
many filesystems.  In AFS, these mountpoints are stored in symlinks
identifying the realm and volume name of the target.

The complicating factor comes when you try and map the 96-bit (plus realm)
into the 32-bit inode number.  FreeBSD runs fine, but some applications
assuming the POSIX device number/inode number equality behave poorly.  For
example, gnu tar may find collisions and assume files are a hard link when
they are not.  Linux, on the other hand, uses the inode numbers within the
kernel, and may panic if there is a collision.

The "uniqueness" aspect for these numbers is a serious scaling problem: 
global filesystems can and will name trillions of file system objects. 
Squeezing them into a single 32-bit number, or even a pair, simply doesn't
work.  Moving to a 64-bit inode number in FreeBSD would reduce the chances
of a collision dramatically, and probably enough that the risk would
become acceptable.

A preferred solution approximates the POSIX conventions but allows for a
special call into the filesystem to check collision cases.  I actually
implemented this on FreeBSD at one point.  The filesystem implementation
attempts to maintain a unique inode number by hashing the vice ID.  For
applications maintaining tables, such as tar, a collision can be resolved
by calling samefile() or fsamefile(), which compare the vnode pointers, or
call into the individual filesystem to inquire using a VOP.  In this
manner, the efficiency gains are largely still present, except that the
identical values are a hint as opposed to a guarantee. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message