Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Jun 2011 07:06:59 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Kostik Belousov <kostikbel@gmail.com>
Cc:        freebsd-fs@FreeBSD.org, Garance A Drosehn <gad@FreeBSD.org>
Subject:   Re: [rfc] 64-bit inode numbers
Message-ID:  <20110624054322.V1086@besplex.bde.org>
In-Reply-To: <20110623081140.GQ48734@deviant.kiev.zoral.com.ua>
References:  <20101201091203.GA3933@tops> <20110104175558.GR3140@deviant.kiev.zoral.com.ua> <20110120124108.GA32866@tops.skynet.lt> <4E027897.8080700@FreeBSD.org> <20110623064333.GA2823@tops> <20110623081140.GQ48734@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 23 Jun 2011, Kostik Belousov wrote:

> On Thu, Jun 23, 2011 at 09:43:33AM +0300, Gleb Kurtsou wrote:
>> On (22/06/2011 19:19), Garance A Drosehn wrote:
>>> On 1/20/11 7:41 AM, Gleb Kurtsou wrote:
>>>> I've updated the patch. New version is available here:
>>>> https://github.com/downloads/glk/freebsd-ino64/freebsd-ino64-patch-2011-01-20.tgz
>>>>
>>>> Changelog:
>>>> * Add fts, ftw, nftw compat shims in libc
>>>> * Place libc compat shims in separate files, don't hack original
>>>>    implementations.
>>>> * Fix dump/restore
>>>> * Use ino_t in UFS code (suggested by Kirk McKusick)

Of course in must not use ino_t in the parts of ffs related to the on-disk
inode.  Your patch does this, but I wonder if converts from the disk inode
to ino_t too early in some places.  C's type system is too weak to find
wrong conversions easily.  On an old system, I once use funky types like
double or a pointer for at least mode_t to find all the places that assumed
mode_t to be an int.  This helped find all the places that assumed it to
be an int of a particular size.

>>>> * Keep ufs_ino_t (32 bit) for boot2 not to increase size
>>>>
>>> Sorry for replying to an older message, but a reply made in a different
>>> thread reminded me about this project...
>>>
>>> Also, I may have asked this before.  In fact, I'm almost sure that I started
>>> a reply to this back in Jan/Feb, but my email client claims I never replied
>>> to this topic...
>>>
>>> Are you increasing only the size of ino_t, or could you also look at
>>> increasing the size of dev_t?   (just curious...)
>>
>> Sure. Incorporating as much of similar changes as possible is good.

Increasing the size of dev_t would be negatively good.  Even when the
minor number was meaningful and was abused to encode device control
sparsely, 4 billion devices is thousands of times as many as needed.
Without the sparse mapping, it is millions as many as needed.  Reducing
it back to 16 bits like it was in FreeBSD-1 would be good, but would
break portability.  Finding all the places that assume that it is 32
bits and changing them to uint32_t would be good.

ffs is already partly correct here (unlike for ino_t).  Its di_rdev
is di_db[0], and di_db is either ufs1_daddr_t (int32_t) or ufs2_daddr_t
(int64_t).  Thus the on-disk type is already independent of dev_t.
But this is only the start of being correct.  ffs does blind assignments
to and from va_rdev to dev_t's, and suffer overflows if the types are
different.  I hope the new ino_t code doesn't do blind assignments.
Since opening of device nodes on ffs file systems is no longer supported,
the device numbers in di_rdev are only used for compatibility:
- mknod() still works to create specified device numbers, provided they
   fit in a 32-bit dev_t (strictly, 32-bit ones don't fit since
   ufs1_daddr_t only has 31 value bits, but the overflow for blind
   assigment of the 32nd value bit is benign on all supported arches).
   So you can still back up your FreeBSD-4 /dev or maybe your Linux
   /dev on an ffs file system.
- mknod() is still abused by badsect(8) to encode bad sector numbers in
   di_rdev.  This may even still work for ffs1.  It is broken for ffs2 by
   the type mismatch, and the blind assignments result in the error not
   being detected (ffs2 has 64-bit sector numbers, and its di_rdev can
   encode these, but mknod() can only pass 32-bit device numbers).
   FreeBSD-1 had the same problem with 16-bit device numbers not being
   able to encode 32-bit sector numbers.  I hoped I fixed badsect(8)
   enough to detect all cases where the blind assignment will fail.

>> I've added Kostik and Matthew to CC list, it's for them to decide.
>>
>> dev_t on other OSes:
>> 	NetBSD - uint64_t
>> 	DragonFly - uint32_t
>> 	Darwin - __int32_t
>> 	OpenSolaris - ulong_t
>> 	Linux - __u32
>>
>> Considering this I think 3rd party software is not ready for such
>> change.

Well, it should be ready, since the size depends on the O/S.  Suppose
a NetBSD system actually uses 64-bit device numbers.  FreeBSD cannot
support this now, so it should give an error for an attempt to back
up a NetBSD /dev, but the blind assignments may break this.  ulong_t
on Solaris might give the same problem on 64-bit machines, but I
guess ulong_t is actually an obfuscation of uint32_t.

>> Major/minor mapping to dev_t will get more complicated.
>>
>> And the most important question: what would you want it for? As far as I
> Indeed, this is the right question.
>
>> can see major/minor numbers are ignored nowadays, major is zero, minor
>> increases independently of device type:
> This is only because you have too little /dev nodes.

How can he have >= 4G /dev nodes to test this? :-)  Ah, I think I see:
for devfs, the major number is normally 0, and minor numbers don't encode
anything and are allocated sequentially and may differ across boots.
But there are only 24 minor number bits according to major/minor, so
the major must change from 0 to 1 on the 2**24 ~= 16 millionth device
or earlier (I think actually on the 2**8th = 256th device, due to the
encoding of major/minor being for compatibility with 16-bit dev_t).

> Look at the definitions of the major/minor in sys/types.h.

These are only for compatibility.  Even expanding dev_t would break this
compatibility.  The types of breakage are easier to see for reducing
dev_t back to 16 bits.  Then for devfs, the major number should change
from 0 to 1 on the 256'th device, but nothing should break until the
65536th device; the major/minor split that is still displayed by ls(1)
is meaningless.  For non-devfs, things like backing up OtherOS's /dev
or even your own /dev to an ffs file system will break on the 65536th
device; anything depending on the encoding of minor numbers or the
major/minor split will break on the 256th minor, but I can't see how
anything in FreeBSD can reasonably depend (dynamically) on this encoding
or split -- the device number is just an index for an actual device,
and you can't do anything with it in a device node except copy the node.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110624054322.V1086>