From owner-freebsd-arch Mon Mar 11 13:17:30 2002 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id B7C2237B402 for ; Mon, 11 Mar 2002 13:17:11 -0800 (PST) Received: from fledge.watson.org (fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.6/8.11.6) with SMTP id g2BLGni47870; Mon, 11 Mar 2002 16:16:49 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Mon, 11 Mar 2002 16:16:48 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Harti Brandt Cc: Garance A Drosihn , Poul-Henning Kamp , arch@FreeBSD.ORG Subject: Re: Increasing the size of dev_t and ino_t In-Reply-To: <20020311172142.K1371-100000@beagle.fokus.gmd.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 11 Mar 2002, Harti Brandt wrote: > I suppose the AFS volumes themself have some kind of unique identifier, > otherwise there would be no way to tell that you are mounting the same > volume in different places, there wouldn't even be the notion of 'the > same volume'. Given that, it should be simple to map between those AFS > volume identifiers and st_dev's. How this mapping is done depends on the > kind of the volume id. If you have 33,000 mounts in you system, adding a > uint32_t to each of these mounts will not be your main problem. AFS, Coda, and various other "global scale" filesystems rely on a much larger unique identifier space than the traditional 64-bit (dev_t, ino_t) pair. Coda, for example, uses a 96-bit "Vice ID" which is per-realm. That is partitioned into volume ID's and individual file ID's, which are similar to "filesystems" and "inode numbers". However, the problem occurs because our mount system doesn't scale to the level required for Coda or AFS to function. As such, Coda and AFS have their own light-weight mounting scheme inside the filesystem implementation, so it appears to the kernel as though it's a single huge filesystem, rather than a composite of many filesystems. In AFS, these mountpoints are stored in symlinks identifying the realm and volume name of the target. The complicating factor comes when you try and map the 96-bit (plus realm) into the 32-bit inode number. FreeBSD runs fine, but some applications assuming the POSIX device number/inode number equality behave poorly. For example, gnu tar may find collisions and assume files are a hard link when they are not. Linux, on the other hand, uses the inode numbers within the kernel, and may panic if there is a collision. The "uniqueness" aspect for these numbers is a serious scaling problem: global filesystems can and will name trillions of file system objects. Squeezing them into a single 32-bit number, or even a pair, simply doesn't work. Moving to a 64-bit inode number in FreeBSD would reduce the chances of a collision dramatically, and probably enough that the risk would become acceptable. A preferred solution approximates the POSIX conventions but allows for a special call into the filesystem to check collision cases. I actually implemented this on FreeBSD at one point. The filesystem implementation attempts to maintain a unique inode number by hashing the vice ID. For applications maintaining tables, such as tar, a collision can be resolved by calling samefile() or fsamefile(), which compare the vnode pointers, or call into the individual filesystem to inquire using a VOP. In this manner, the efficiency gains are largely still present, except that the identical values are a hint as opposed to a guarantee. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message