From owner-freebsd-arch@FreeBSD.ORG  Fri Sep  2 11:39:24 2005
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D498016A41F
	for <freebsd-arch@FreeBSD.org>; Fri,  2 Sep 2005 11:39:24 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4064A43D46
	for <freebsd-arch@FreeBSD.org>; Fri,  2 Sep 2005 11:39:24 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])
	by mailout2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j82Bd8cw001167; Fri, 2 Sep 2005 21:39:08 +1000
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j82Bd7HH007500; Fri, 2 Sep 2005 21:39:07 +1000
Date: Fri, 2 Sep 2005 21:39:06 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Dmitry Pryanishnikov <dmitry@atlantis.dp.ua>
In-Reply-To: <20050901183311.D62325@atlantis.atlantis.dp.ua>
Message-ID: <20050902205456.S2885@delplex.bde.org>
References: <20050901183311.D62325@atlantis.atlantis.dp.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@FreeBSD.org
Subject: Re: kern/85503: panic: wrong dirclust using msdosfs in RELENG_6
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Sep 2005 11:39:25 -0000

On Thu, 1 Sep 2005, Dmitry Pryanishnikov wrote:

>>> I think it's feasible and useful to upgrade type of v_hash to at least 
>>> off_t.
>> 
>> This is not needed yet.
>> 
>> Filesystems with more than 4G files are not supported yet, since ino_t
>> is 32 bits and is used in critical APIs (struct stat...).  Also,
>
> Sorry, I don't agree with you. The current situation is ugly: not only
> it forces us to play dirty tricks within filesystems in order to generate
> unique 32-bit inode numbers, but also it creates an artificial limit

If you want to fix this, first work on the much larger problems of enlarging
ino_t and changing the not-unused ffs file system to support more than 4G
files.  Note that this was considered too hard to do for ffs2.

Tricks to map to the API's inode number space are unavoidable due to
the existence of compatibility APIs and belong in individual file
systems since they are too hard to do generally.  General code could
only hash from a larger v_hash type to a smaller compat_subsystem_ino_t
type and then somehow make the hash unique.  It is only necessary for
the result to be unique for files actually returned the the smaller
ino_t's since boot time (or since mount time for a poor implementation
that doesn't work as well as possible  for at least nfs servers), but
even this seems to require storing up to
SMALLER_INO_T_MAX*sizeof(smaller_ino_t) bytes of history of recycled
vnodes.

> on maximum number of files for 32-bit architectures. E.g., on FreeBSD/ia64
> u_int is 64 bits, and thus it would be no problem for it's API to create and 
> handle more than 4G files/fs. But such a file system will be incompatible

Actually u_int is 32 bits for ia64, and the ino_t API/ABI is indenpendent
of the size of u_int.  ino_t is uint32_t.

> with FreeBSD/i386! Isn't this ugly? u_int has nothing to do with storage
> size, while off_t has. It is clear that no media with maximum size of

Neither u_int nor off_t has anything to do with the correct storage
size here.  off_t is a signed integer type suitable for representing
offsets within files.  Sicne off_t is unsigned, it is unsuitable for
representing offsets within file systems.  It just happens to work
because it is 64 bits and an offset of 2^63-1 bytes is enough for
anyone ;-).  (Actually it is not even enough for offsets within files
since offsets in /dev/kmem are often > 2^63 on 64-bit systems.) ino_t
is closer to being the correct type.  The type of v_hash certainly needs
to be larger than ino_t.  My main point is that although it could be
larger so that file systems can easily create a (unique) id from things
like (dirclust, diroffset) pairs, it is not useful for it to be larger
since file systems need to create an id for the inode number anyway.
(Creation in some file system, e.g. ffs, is just copying the inode
number from the inode.)

> off_t will contain more than off_t files, while we can't guarantee this
> for u_int, which is bounded to CPU abilities. I think UNIX is about
> compatibility between different architectures, isn't it?

Unix is mostly about source-level compatibility.

>> So all current file systems need to generate unique 32-bit inode
>> numbers.  This may be difficult, but once it is done I think the inode
>                 ^^^^^^^^^^^^^^^^
>
>  ...and may be close-to-impossible. What if e.g. Microsoft invites say 
> FAT-2005 with variable-length directory entries? I'm not sure that for
> every third-party filesystem it would be possible to generate 32-bit
> pseudoinode. And it's very bad that we can't handle >4Gfiles/fs at all.

It already invented variable-length entries for long names in 1990-1995 :-).
But the sizes of the entries are multiples of 32.  This is required for
compatibility and won't change.

I think I said that the inode number in msdosfs should be the cluster
number of the first cluster in the file.  This would be broken by
variable-sized clusters (unlikely, and even less useful) or new file
types like symlinks (useful and not so unlikely -- FreeBSD could add
them as an extension).

>> For msdosfs, the inode number is essentially the byte offset divided by
>> the size of a directory entry.  The size is 32, so this breaks at a byte
>> offset of 128G instead of 4G.  Details:
>
> This is also imperfect: it creates a lot of pain and limitations for
>
> options         MSDOSFS_LARGE

So use the cluster number and only worry about the limit of 16TB for
4K-clusters, etc.

> So, while I understand complexity of such a transitions, but it's clear
> that for long-term solution ino_t should be upgraded to the size of off_t 
> everywhere. For short-term one... Well, msdosfs isn't the worst case.

Indeed.  The only important cases are ffs and some network file systems
that already support >= 4G files.

Bruce