Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 10 Mar 2002 16:15:55 -0500
From:      Garance A Drosihn <drosih@rpi.edu>
To:        Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc:        arch@FreeBSD.ORG
Subject:   Re: Increasing the size of dev_t and ino_t
Message-ID:  <p05101537b8b1707a3659@[128.113.24.47]>
In-Reply-To: <35384.1015748266@critter.freebsd.dk>
References:  <35384.1015748266@critter.freebsd.dk>

next in thread | previous in thread | raw e-mail | index | archive | help
At 9:17 AM +0100 3/10/02, Poul-Henning Kamp wrote:
>In message Garance A Drosihn writes:
>  >I don't see how this would work for OpenAFS.  By that I mean that
>>I do not know how the dev_t-pointer that you're talking about is
>>used when implementing something like OpenAFS or ARLA support.
>
>I have no idea what the problem would be, so you will have to tell
>me before I can answer you...

Well, this will be an answer from the user-land perspective.  It
is only an observation of the number of "devices" involved,
because I don't know the details of the underlying implementation.
So, pick up a few grains of salt, and let's try the following...

First, my starting assumption on the significance of the st_dev
value.  My take on that value is that if two files have the same
value for their device, then you could remove one of those files
and hardlink the other file to the name of the removed file.
Hardlinks can not cross device boundaries, but if these two files
have the same value for st_dev then that hard link would not be
crossing a device boundry.  Or, another way to think of it is that
if two files have the same device-number, and if they both have an
st_nlink count of 1, then removing one of those files will result
in more space being available for the expansion of the other file.
(perhaps after a reboot, to eliminate the question of open file
descriptors keeping that first file around even though you have
unlinked it).  I do not know if the appropriate standards would
agree with me on these views, but they seem like a logical premise.
Otherwise, a st_dev value would have no special meaning at all.

In afs/openafs/arla, the "device" (in the above sense of the word)
is the AFS-volume.  Disk quotas are applied at the AFS-volume-level.
AFS also has the notion that the administrator can move a volume
around between disk-partitions, or even disk-servers, without the
user noticing.  So, for disk-balancing purposes (among other things),
it is ideal to have a lot of small-ish volumes instead of trying
to cram as much as possible in each volume.  There is also the
concept of "read-only" vs "read-write" volumes.  Every "read-only"
cell would have a matching "read-write" cell, but they would be
different devices as far as this st_dev value is concerned.  Each
read-write volume can also have a "backup volume", which is the
snapshot of that read-write volume as it was at the time of the
most recent backup (it is also read-only in nature).

Thus, an AFS-cell tends to have a lot of volumes.  In the AFS
cell at RPI, there are over 12,000 user accounts, each of which
has it's own AFS-volume (for disk-quota purposes), and each of
which has a AFS-backup-volume.  That's 24,000 volumes just for
home directories, and I am sure we have well over 32K AFS-volumes
in the AFS-cell at RPI.  It's possible we have over 64K distinct
AFS-volumes in the cell, but I don't know how to come up with the
exact count for that.

When running AFS, the machine effectively mounts all AFS cells
that are defined in a file called 'CellServDB'.  On our public
unix machines, we define 163 different AFS-cells.  Most of those
AFS-cells are smaller than RPI's AFS-cell, but certainly all
the volumes in those cells add even more unique devices.

One way around all these devices would be to just create st_dev
numbers on the fly, as each volume is referenced, and cache that
value until the next reboot.  That is probably workable, but I
am a little uneasy about it because we (RPI) also had at least
one professor who liked to do a 'find' of EVERYTHING in RPI's
afs-cell, looking for any publicly-readable files, which he
then provided in a file listing for anyone who was curious.
What would happen to the machine he runs that 'find' command on?

So, let's drop back and say my initial premise is wrong.  Maybe
the st_dev value is just an arbitrary number with absolutely no
special meaning.  We then have the question of how to map all
of these AFS-volumes into st_dev values (where you might map
multiple AFS-volumes into a single st_dev value, just so you
have fewer unique st_dev values).  Some care would have to be
taken in how that mapping is done, just to be sure that two
files which are in different AFS-volumes are recognized as
different files even if they have the same value for st_ino.

I have not looked into the openAFS source code yet, but from
what I can see I would guess that what AFS uses for a volume
ID (what *it* uses to keep track of each volume) is a 32-bit
number.  I'm seeing volume-id values like 537,315,825, for
instance.

At the same time that we have all these volumes, we can't assume
that all volumes will have less than (say) 32K distinct inodes
in them.  The AFS-volumes for user's home directories are pretty
small, but we (RPI) have other AFS-volumes which are hundreds
of megabytes, and which thus can contain a lot of files.  I
*think* we even have a few AFS-volumes which are gigabyte-sized,
but I know we try our best to discourage larger AFS-volumes.

So, my basic observation is that with UFS2 we're probably going
to want to increase the size of st_ino, but I would argue that
we can't do that by *shrinking* the size of st_dev, and that I
would also argue that it would make more sense to increase the
size of st_dev at the same time we increase st_ino.

-- 
Garance Alistair Drosehn            =   gad@eclipse.acs.rpi.edu
Senior Systems Programmer           or  gad@freebsd.org
Rensselaer Polytechnic Institute    or  drosih@rpi.edu

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?p05101537b8b1707a3659>