Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 29 May 2001 16:32:46 +1000
From:      Peter Jeremy <peter.jeremy@alcatel.com.au>
To:        hackers@FreeBSD.ORG
Subject:   Re: technical comparison
Message-ID:  <20010529163246.H89950@gsmx07.alcatel.com.au>

next in thread | raw e-mail | index | archive | help
On Sun, 27 May 2001 22:50:48 -0300 (BRST), Rik van Riel <riel@conectiva.com.br> wrote:
>On Sat, 26 May 2001, Peter Wemm wrote:
>> Which is more expensive?  Maintaining an on-disk hashed (or b+tree)
>> directory format for *everything* or maintaining a simple low-cost
>> format on disk with in-memory hashing for fast lookups?
>
>I bet that for modest directory sizes the cost of disk IO outweighs
>the added CPU usage by so much that you may as well take the trouble
>of using the more scalable directory format.

I'm not sure I follow this.  Reading sequentially is always going to
be much faster than reading randomly.  For a modest directory size,
you run the risk that randomly accessing fewer blocks will actually
take longer than just reading the entire directory sequentially.

>> For the small directory case I suspect the FFS+namecache way is more
>> cost effective.  For the medium to large directory case (10,000 to
>> 100,000 entries), I suspect the FFS+namecache method isn't too shabby,
>> providing you are not starved for memory.  For the insanely large
>> cases - I dont want to think about :-).
>
>The ext2 fs, which uses roughly the same directory structure as
>UFS and has a name cache which isn't limited in size, seems to
>bog down at about 10,000 directory entries.

As has been pointed out earlier, hash algorithms need a `maximum
number of entries' parameter as part of their algorithm.  Beyond some
point, defined by this number, the hash will degenerate to (typically)
O(N).  It sounds like the Linux name cache hashing algorithm is not
intended to handle so many directory entries.

>Daniel Phillips is working on a hash extension to ext2; not a
>replacement of the directory format, but a way to tack a hashed
>index after the normal directory index.

I think a tree structure is better than a hash because there is no
inherent limit to the size (though the downside is O(log N) rather
than close to fixed time).  It may be possible to build a tree
structure around the UFS directory block structure in such a way that
it would be backward compatible[1].  Of course managing to correctly
handle soft-updates write ordering for a tree re-balance is
non-trivial.

One point that hasn't come out so far is that reading a UFS is quite
easy - hence boot2 can locate a loader or kernel by name within the
root filesystem, rather than needing to hardware block numbers to
load.  If the directory structure does change, we need to ensure that
it's possible to (possibly inefficiently) parse the structure in
a fairly small amount of code.

>It also has the advantage of being able to keep using the
>tried&tested fsck utilities.

Whatever is done, fsck would need to be enhanced to validate the
directory structure, otherwise you could wind up with files that
can't be found/deleted because they aren't where the hash/tree
algorithm expects them.

Suggestion for the "lets use the filesystem as a general purpose
relational database" crowd: A userland implementation of the existing
directory search scheme (ignoring name caching) would be trivial (see
/usr/include/ufs/ufs/dir.h and dir(5) for details).  Modify postmark
(or similar) to simulate the creation/deletion of files in a userland
`directory' structure and demonstrate an algorithm that is faster for
the massive directory case and doesn't pessimize small directories.
The effect of the name cache and datafile I/O should be able to be
ignored since you just want to compare directory algorithms.

[1] Keep entries within each block sorted.  Reserve space at the end
    of the block for left and right child branch pointers and other
    overheads, with the left branch being less than the first entry in
    the block and the right branch being greater than the last entry.
    The reserved space is counted in d_reclen of the last entry (which
    makes it backward compatible).  I haven't thought through the
    block splitting/merging algorithm so this may not work.

Peter

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010529163246.H89950>