From owner-freebsd-arch@FreeBSD.ORG  Sat Sep  3 23:43:50 2005
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1D1B116A41F
	for <freebsd-arch@FreeBSD.org>; Sat,  3 Sep 2005 23:43:50 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 871D543D45
	for <freebsd-arch@FreeBSD.org>; Sat,  3 Sep 2005 23:43:49 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.0.86])
	by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j83NhY1R017681; Sun, 4 Sep 2005 09:43:34 +1000
Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j83NhVHJ023858; Sun, 4 Sep 2005 09:43:33 +1000
Date: Sun, 4 Sep 2005 09:43:31 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@epsplex.bde.org
To: Poul-Henning Kamp <phk@haven.freebsd.dk>
In-Reply-To: <44604.1125782260@phk.freebsd.dk>
Message-ID: <20050904090740.L2820@epsplex.bde.org>
References: <44604.1125782260@phk.freebsd.dk>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Dmitry Pryanishnikov <dmitry@atlantis.dp.ua>, freebsd-arch@FreeBSD.org
Subject: Re: kern/85503: panic: wrong dirclust using msdosfs in RELENG_6 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Sep 2005 23:43:50 -0000

On Sat, 3 Sep 2005, Poul-Henning Kamp wrote:

> In message <20050904065305.T2366@epsplex.bde.org>, Bruce Evans writes:
>> On Sat, 3 Sep 2005, Dmitry Pryanishnikov wrote:
>>
>>> On Sat, 3 Sep 2005, Dmitry Pryanishnikov wrote:
>>>>> I think I said that the inode number in msdosfs should be the cluster
>>>>> number of the first cluster in the file.  This would be broken by

>>> Ups, how about empty files? They haven't any allocated clusters, have
>>> they? So, alas, we can't go this route.
>>
>> Urk.  It also doesn't work for cd9660.  So the block number can be
>> used at most as a hint getting a unique fake inode number, and in
>> msdosfs file systems don't have to be much larger than 128GB to have
>>> = 4G files -- a 128+GB file system can consist of 128GB of directories
>> all containing empty files :-).
>
> Uhm, did none of you guys see my email about how this must be
> done correctly the say way NFS does it correctly ?

Yes.  I even mentioned it in my reply.

> To repeat:
>
> NFS has the same sort of problem, it has 16 or 32 *bytes* filehandles
> that need to hash to 32 bit "inode numbers".
>
> If you look at vfs_hash_get calls in sys/nfsclient you can see that
> it calculates a 32bit hash but then provides a "nfs_vncmpf" function
> to do the actual comparison to resolve hash collisions.
>
> You need to do the same thing.

This doesn't handle the problem of getting unique inode numbers for
user APIs.  The vnode hashing is much easier because collisions are
permitted, the number of vnodes is relatively limited, and the final
hash number doesn't have to live longer than the vnode.

> Making the hashes be 64bit is pointless since no filesystems will
> have that many inodes and it still doesn't solve the problem properly.

Never? :-)

nfs file systems can probably have 2^33 inodes now, and nfsv3 handles
this poorly by blindly truncating to "long va_fileid" or "uint32_t
d_fileno".  The former actually works on 64-bit machines, but then
stat() blindly truncates to "ino_t st_ino".  nfsv4 may be better.  It
converts directly to 32 bits by ORing bits 32-63 into bits 0-31 and
only setting 32 bits in va_fileid or d_fileno (except cookies are
sometimes used for d_fileid).  This conversion looks more like a simple
hash than part of a protocol (I don't understand the protocol).

The d_fileno (readdir()) case shows that we shouldn't try very hard
to adjust the id passed by the server.  The server's id must be trusted
to be unique for d_fileno (for POSIXish file systems) since we would
have to read much more than directory entries to fix up non-uniqueness.
setattr() must produce the same fileids as readdir(), so more can't be
done when we have full vnode info.

Bruce