From owner-freebsd-arch@FreeBSD.ORG Sat Sep 3 23:43:50 2005 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1D1B116A41F for ; Sat, 3 Sep 2005 23:43:50 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 871D543D45 for ; Sat, 3 Sep 2005 23:43:49 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.0.86]) by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j83NhY1R017681; Sun, 4 Sep 2005 09:43:34 +1000 Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j83NhVHJ023858; Sun, 4 Sep 2005 09:43:33 +1000 Date: Sun, 4 Sep 2005 09:43:31 +1000 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: Poul-Henning Kamp In-Reply-To: <44604.1125782260@phk.freebsd.dk> Message-ID: <20050904090740.L2820@epsplex.bde.org> References: <44604.1125782260@phk.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Dmitry Pryanishnikov , freebsd-arch@FreeBSD.org Subject: Re: kern/85503: panic: wrong dirclust using msdosfs in RELENG_6 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Sep 2005 23:43:50 -0000 On Sat, 3 Sep 2005, Poul-Henning Kamp wrote: > In message <20050904065305.T2366@epsplex.bde.org>, Bruce Evans writes: >> On Sat, 3 Sep 2005, Dmitry Pryanishnikov wrote: >> >>> On Sat, 3 Sep 2005, Dmitry Pryanishnikov wrote: >>>>> I think I said that the inode number in msdosfs should be the cluster >>>>> number of the first cluster in the file. This would be broken by >>> Ups, how about empty files? They haven't any allocated clusters, have >>> they? So, alas, we can't go this route. >> >> Urk. It also doesn't work for cd9660. So the block number can be >> used at most as a hint getting a unique fake inode number, and in >> msdosfs file systems don't have to be much larger than 128GB to have >>> = 4G files -- a 128+GB file system can consist of 128GB of directories >> all containing empty files :-). > > Uhm, did none of you guys see my email about how this must be > done correctly the say way NFS does it correctly ? Yes. I even mentioned it in my reply. > To repeat: > > NFS has the same sort of problem, it has 16 or 32 *bytes* filehandles > that need to hash to 32 bit "inode numbers". > > If you look at vfs_hash_get calls in sys/nfsclient you can see that > it calculates a 32bit hash but then provides a "nfs_vncmpf" function > to do the actual comparison to resolve hash collisions. > > You need to do the same thing. This doesn't handle the problem of getting unique inode numbers for user APIs. The vnode hashing is much easier because collisions are permitted, the number of vnodes is relatively limited, and the final hash number doesn't have to live longer than the vnode. > Making the hashes be 64bit is pointless since no filesystems will > have that many inodes and it still doesn't solve the problem properly. Never? :-) nfs file systems can probably have 2^33 inodes now, and nfsv3 handles this poorly by blindly truncating to "long va_fileid" or "uint32_t d_fileno". The former actually works on 64-bit machines, but then stat() blindly truncates to "ino_t st_ino". nfsv4 may be better. It converts directly to 32 bits by ORing bits 32-63 into bits 0-31 and only setting 32 bits in va_fileid or d_fileno (except cookies are sometimes used for d_fileid). This conversion looks more like a simple hash than part of a protocol (I don't understand the protocol). The d_fileno (readdir()) case shows that we shouldn't try very hard to adjust the id passed by the server. The server's id must be trusted to be unique for d_fileno (for POSIXish file systems) since we would have to read much more than directory entries to fix up non-uniqueness. setattr() must produce the same fileids as readdir(), so more can't be done when we have full vnode info. Bruce