FreeBSD Mail Archives

Date:      Sun, 29 Oct 2000 19:48:32 -0600 (CST)
From:      Ryan Thompson <ryan@sasknow.com>
To:        Matt Dillon <dillon@earth.backplane.com>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: Filesystem holes
Message-ID:  <Pine.BSF.4.21.0010291907320.27883-100000@ren.sasknow.com>
In-Reply-To: <200010300007.e9U07nY71868@earth.backplane.com>

Matt Dillon wrote to Ryan Thompson:

> :> :> storage is rather inefficient for our table of about 2,850,000 members
> :> :> (~2.1 GB total storage).  There are 64M possible hash values in our
> :> :> current implementation, and our record size is variable, but could be
> :> :> safely fixed at about 1.5KB... So, total storage if all values were used
> :> :> would be about 96GB.  (See where I'm going with this?)
> :...
> :
> :Remember, though, I'm not proposing a hash table.  I have been lucky
> :enough to design a "hash" function that will crunch down all possible
> :values into about 64M unique keys--so, no collisions.  I propose to use
> :this function with address calculation.  So, O(1) everything, for single
> :random lookups, which is (virtually) all this data structure must deal
> :with.  Sorting isn't necessary, and that could be accomplished in O(n)
> :anyway.
> 
>     Are you still talking about creating a 96GB file to manage 2.1GB worth
>     of data?  I gotta say, that sounds kinda nuts to me!  Cpu cycles are
>     cheap, I/O and memory is not.

Hi, Matt!  Thanks for the replies.  I'll try and keep you interested ;-)

Hmm... Perhaps you're still missing my original point?  I'm talking about
a file with 96GB in addressable bytes (well, probably a bunch of files,
given logical filesize limitations, but let's say for simplicity's sake
that we have a single file).  It's actual "size" (in terms of allocated
blocks) will be only a bit larger than 2.1GB.  (Directly proportional to
the used size of the dataset.  Discrepancies only come into play when
record size != block size, but that can be worked around somewhat)

In other words, ls -ls will report the "size" as some ridiculously large
number, will show a much smaller block count.  So, assuming four records
are added to the file on block boundaries, the file will actually only use
four blocks... nowhere near 96GB!

In the UNIX filesystem (ya, I know.. just pick one :-), size of file !=
space allocated for file.  Thus, my original questions were centered
around filesystem holes.  I.e., non-allocated chunks in the middle of a
file.  When trying to READ from within a hole, the kernel just sends back
a buffer of zeros... which is enough to show that the record is not
initialized.  Actually, something like an "exists" function for a record
wouldn't touch the disk at all!  When writing to a hole, the kernel simply
allocates the necessary block(s).  This is really fast, too, for creation,
as the empty set can be written to disk with touch(1), and uses far less
memory than virtual initialization or memory structures ;-)

As an example, try 

fseek(f, 0-1, SEEK_SET);
fputc('\n', f);

And analyze the reported filesize, as well as the reported block count of
the file.  You should see a 2GB "size", and maybe 8K in allocated blocks
(depends how you ran newfs ;-).  This is behaviour that has been around
since the original UFS.


>     Take the BTree example again -- if you think about it, all internal nodes
>     in the tree will fit into memory trivially.  It is only the leaf nodes
>     that do not.  So in regards to actual disk accesses you still wind up
>     only doing *ONE* for the btree, even if it winds up being four or five
>     levels deep.

Right.  And, actually, (without looking a bit more closely), I wouldn't be
suprised if you could replace the four-line address calculation I have
with your B+Tree structure and come up with the same result.  Only
difference would be a few hundred lines of code, much more testing, and
quite a few megs of RAM...  ;-)

What you referred to as "nuts", above, is just a logical way to provide a
huge address space for a set of data, without actually allocating blocks
in the filesystem for the entire address space until they are used.


>     The result is that you will be able to create an index to your data,
>     which is only 2.8 million records, in around 32 bytes per record or
>     89 Megabytes.  Not 96GB.  Since you can cache all the internal nodes
>     of the btree you are done.  The machine will do a much better job
>     caching a 89MB index then a 96GB index.
> 
>     Do not make the mistake of equating cpu to I/O.  The cpu required to
>     iterate through 4 levels in the btree (assuming 64 entries per node,
>     it only takes 4 to cover 16 million records) is *ZERO* ... as in,
>     probably less then a microsecond, whereas accessing the disk is going
>     to be milliseconds.

CPU time for what I'm proposing is even closer to zero than for a tree...
But, you're right, it doesn't make any real difference when compared to
disk I/O...  B-Trees are good for a lot of things.  Address calculation
can be really good, too, given a finite key set, and a way to represent
that finite key set without wasting space.


> > > A B+Tree will also scale with the size of the dataset being managed,
> > > so you do not have to preallocate or prereserve file space. 
>
> > So will address calculation + filesystem holes, and
> > sufficiently large filesizes :-)
> 
>     Uh, not with a 50:1 file size factor it won't.

See my above discussion on file size != allocated blocks.

And, address calculation implies that no index need be created or
maintained (in memory, or on disk).  Memory utilization is practically
nil, disk utilization is proportional to the actual data size, and all
record operations are done with one seek (O(1)).

Perhaps performance could be improved by keeping a small memory cache, but
that is tangential to the actual data structure, and will increase
performance by percentages, not orders of magnitude.

> 
> 					-Matt
> 


- Ryan

-- 
  Ryan Thompson <ryan@sasknow.com>
  Network Administrator, Accounts
  Phone: +1 (306) 664-1161

  SaskNow Technologies     http://www.sasknow.com
  #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0010291907320.27883-100000>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation