Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Aug 2004 07:59:30 -0400 (EDT)
From:      Daniel Ellard <ellard@eecs.harvard.edu>
To:        Kathy Quinlan <kat-free@kaqelectronics.dyndns.org>
Cc:        freebsd-hardware@freebsd.org
Subject:   Re: Big Problem
Message-ID:  <20040804074504.L22815@bowser.eecs.harvard.edu>
In-Reply-To: <4110C9C7.6080506@kaqelectronics.dyndns.org>
References:  <4110C9C7.6080506@kaqelectronics.dyndns.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 4 Aug 2004, Kathy Quinlan wrote:

> First off I am not a troll, this is a serious email.  I can not go
> into to many fine points as I am bound by an NDA.

That's too bad, because it will make this a little more complicated.
But nevertheless...

> The problem:
>
> I need to hold a text file in ram, the text file in the forseable
> future could be up to 10TB in size.
>
> My Options:
>
> Design a computer (probably multiple AMD 64's) to handle 10TB of
> memory (+ a few extra Gb of ram for system overhead) and hold the
> file in one physical computer system.

If you can find/construct a mobo with sockets for 10TB of RAM...
That's 10,000 1 GB sticks.  That would be quite a design exercise.

> Build a server farm and have each server hold a portion eg 4GB each
> Server (250 servers (plus a few extra for system overhead)
>
> The reason the file needs to be in ram is that I need speed of
> search for paterns in the data (less than 1 second to pull out
> relevent chunks)

If what you're doing is searching for relatively small subsets of the
data (i.e.  a particular record or handful of records) then you don't
need to do this entirely in RAM.  If you use an appropriately-sized
B-Tree and cache the high levels of the tree, it only takes a few I/Os
to find any particular record.  If you can spread the tree around a
bit (split it among a bunch of hosts, each caching the upper few GB of
their subtre) then it's even better.  Arrange the data over lots of
thinly-allocated disks (you can get very good read performance from
disks if you aren't concerned about space efficiency, and if you've
got the budget to buy 10 TB of RAM and design a custom machine to put
it in, I'm guessing that buying a few racks of disks won't be an
issue).

If, on the other hand, you're looking for something interesting in the
data (i.e.  you're not just searching for keys, but are doing some
processing) then the issue probably isn't RAM and I/O, but raw
processing power.  It takes a long time to scan through 10 TB of data,
whether that data is in RAM or on disk -- you'll never get it done in
a second.  In this case, heaps of processors are probably your only
hope.  Controlling them will be an interesting challenge.

Of course, the problem is probably somewhere in the middle.  Tell us
what you can without violating your NDA...

-Dan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040804074504.L22815>