FreeBSD Mail Archives

Date:      Wed, 29 May 2002 13:42:11 +0930
From:      Greg 'groggy' Lehey <grog@FreeBSD.org>
To:        Scott Long <Scott_Long@adaptec.com>
Cc:        Doug White <dwhite@resnet.uoregon.edu>, Cy Schubert - CITS Open Systems Group <Cy.Schubert@uumail.gov.bc.ca>, Kirk Strauser <kirk@strauser.com>, freebsd-stable@freebsd.org
Subject:   Re: Hardware RAID vs vinum
Message-ID:  <20020529134211.L82424@wantadilla.lemis.com>
In-Reply-To: <2C7CBDC6EA58D6119E4A00065B3A24CB04636E@btcexc01.btc.adaptec.com>
References:  <2C7CBDC6EA58D6119E4A00065B3A24CB04636E@btcexc01.btc.adaptec.com>

On Tuesday, 28 May 2002 at 18:59:26 -0600, Scott Long wrote:
>> On Tuesday, 28 May 2002 at 13:04:19 -0700, Doug White wrote:
>>> On Tue, 28 May 2002, Greg 'groggy' Lehey wrote:
>>>
>>>> This can only happen if there's something very wrong with the
>>>> software RAID-5, either the implementation or the configuration.
>>>> One thing that springs to mind is too small a stripe size.
>>>> There's a myth going around that this improves performance, when
>>>> in fact it seriously degrades it.
>>>
>>> Mind expounding on this topic?
>>
>> Sure, I have done in the man page.
>
> I just read the man page, and...
>
> Your ascertation in the man page that stripes of 2-4MB are ideal might
> warrant some discussion.  I'll assume that this is in the context of
> RAID 5, but a lot of this discussion applies to RAID 0 and 1 too.

In fact, I don't see any of the following discussion that applies to
RAID-[01].  Feel free to correct me.

> 1.  Your calculations are correct only in the case of purely
> sequential reads of the stripes.

No, they're much more important with random access.  With sequential
reads, it doesn't make much difference one way or another.  As the
text states, the big issue is latency.

> I have not dug deep enough into vinum to see how this works, but I
> can only assume that it will issue as many bio requests as it needs
> to retrieve requested data, and do so in parallel.

Correct.  You can see more of the logic at
http://www.vinumvm.org/vinum/implementation.html.

> The underlying disk driver will most likely issue these requests to
> the drives in parallel also, using tagged queuing.  Even with IDE,
> some parallelism can be achieved since the disks will most likely be
> on different controllers

Correct.

> (side note, anyone who runs an array with more than one IDE disk per
> controller is crazy).

Correct, this is mentioned in the man page.

> Of course, the data will not arrive back at vinum completely in
> parallel, but most of the accumulated seek latency that you worry
> about will greatly diminish.  A larger stripe size here might help,
> since it will result in less interrupts being taken, and less runs
> through your bio_complete handler, but there are other factors to
> consider.

The interrupts are not the issue.  The real issue is that you're using
more disk bandwidth to service the request.  As the discussion states,

     Consider a typical news article or web page of 24 kB, which will probably
     be read in a single I/O.  Take disks with a transfer rate of 6 MB/s and
     an average positioning time of 8 ms, and a file system with 4 kB blocks.
     Since it's 24 kB, we don't have to worry about fragments, so the file
     will start on a 4 kB boundary.  The number of transfers required depends
     on where the block starts: it's (S + F - 1) / S, where S is the stripe
     size in file system blocks, and F is the file size in file system blocks.

     1.   Stripe size of 4 kB.  You'll have 6 transfers.  Total subsystem
          load: 48 ms latency, 2 ms transfer, 50 ms total.

     2.   Stripe size of 8 kB.  On average, you'll have 3.5 transfers.  Total
          subsystem load: 28 ms latency, 2 ms transfer, 30 ms total.

     3.   Stripe size of 16 kB.  On average, you'll have 2.25 transfers.
          Total subsystem load: 18 ms latency, 2 ms transfer, 20 ms total.

     4.   Stripe size of 256 kB.  On average, you'll have 1.08 transfers.
          Total subsystem load: 8.6 ms latency, 2 ms transfer, 10.6 ms total.

     5.   Stripe size of 4 MB.  On average, you'll have 1.0009 transfers.
          Total subsystem load: 8.01 ms latency, 2 ms transfer, 10.01 ms
          total.

> 2.  You briefly concede that smaller stripes are beneficial to multiple
> access scenarios, but then brush this off as not important.

I don't brush it off as not important, I state that the gains are
insignificant compared to the losses.  That's the example above again.

> It is important, because not all RAID applications are single access
> video streamers.

You seem to have misunderstood the argument.  Single access streamers
are good examples where it might make sense to have small stripe
sizes (in order to maximize bandwidth).

> It is quite useful to put real files on an array, read and write to
> them in a random fashion, even access them simultaneously from
> multiple processes.  Database servers, and even file server, are
> notorious for this.

Exactly.  And that's where you need large stripes.

> 3.  You focus only on read performance and totally ignore write
> performance.

There's no difference in this argument until you get to RAID-5.

> The reason that most hardware RAID controllers use smaller stripes,
> and why they try to write in full-stripe chunks, is to eliminate the
> read-modify-write penalty of the parity data.

Ah, you're talking about RAID-[45].  I've argued that one elsewhere,
but I need to put it into this document.

> Take for example a RAID 5 array, composed of 5 disks, and having an
> 8k stripe size.  Now imagine a 32k file on that array.

Well, shall we call it a 32 kB transfer?

> To read that file, you would need to issue 4 8k reads to the disk
> subsystem.

Assuming the block is aligned.  Otherwise you'd need to tidy up at the
edges.

> This would be done nearly in parallel, and the data would arrive
> rather quickly.

Modifying  the figures above, we have:

   Stripe size of 8 kB.  On average, you'll have 4.5 transfers.
   Total subsystem load: 36 ms latency, 5.3 ms transfer, 41.3 ms total.

You're talking about aligned transfers, though, so we'll have exactly
4 transfers, reducing the time to 37.3 ms.

> Now write this file.  Since it covers all 4 data disks in the
> stripe, there is no need to read-modify-write the parity.  Just
> compute it, and write out the 5 8k chunks.

Correct, in this one case it's effectively the same as the read.  37.3
ms of disk time.

> The only penalty here is computing the parity, but that would have
> to be done no matter what (and this is why high end RAID cards with
> hardware parity accelerators are so cool).

In fact, the penalty of parity computations is vastly overrated.  On a
1 GHz machine, you'd go through the checksum loop 8,192 times.  It's
about 20 cycles, so we're looking at 80 µs.  You only need to do it
once in this scenario, though, so it works out better than the Vinum
implementation, which needs to do it twice.  See the reference to the
web page for that.

> Take the same file and array, but change the stripe size to 2MB.
> The 32k file size will clearly fit into just one chunk, 

Well, most of the time.  In 1.5% of the time it'll span a block
boundary.

> but you have to update the parity too.

Correct.

> So you need to issue a single read down to the disk to get the 32k
> of parity, wait for it to arrive,

In fact, it's worse than that.  You need two reads, one for the old
data and one for the old parity, and correspondingly two writes.

   Stripe size of 2 MB.  On average, you'll have 2.0303 transfers.
   Total subsystem load: 16.2 ms latency, 10.7 ms transfer, 26.9 ms
   total.

> compute the parity, and write out 2 32k chunks (one for the data,
> and one for the parity).

Another 26.9 ms, so we have here the average time of 53.8 ms for this
approach as opposed to 37.3 ms for your suggestion.  If all transfers
were aligned 32 kB transfers, your approach would be better.
Unfortunately, real life isn't like that.  Consider the more normal
case where the transfer is not aligned on the stripe.  In this case,
you'll have an average of (S + F - 1) / S or 1.984 transfers.  This
also means that you can't use the convenient full stripe write
scenario.  Instead, you have:

   Stripe size of 32 kB.  On average, you'll have 3.969 transfers (two
   lots of 1.984).  Total subsystem load: 31.75 ms latency, 10.7 ms
   transfer, 41.75 ms total.

You have to do this twice, once reading and once writing, so the total
time is 83.5 ms.

32 kB is 64 blocks, so once every 64 times you hit lucky and can use
the optimized approach.  The real average is thus (83.5 * 63 + 37.3) /
64, or 82.78 ms average, compared to 53.8 ms for my approach.

Note also that this is only on write.  We know that RAID-5 is poor on
writes, so it's generally used in read-mainly applications.  On read,
we have times of 41.3 ms for your approach and 13.5 ms for mine.

> It's only 3 transactions,

I see at least 4.

> but the second two have to wait on the first to complete.  There's
> really no way around this, either.

Sure.

> Since the FreeBSD block layer limits transactions to a max of 128k,
> it's not like you can issue huge 8MB writes that cover the entire
> stripe; the block layer will break them down into 128k parcels, and
> vinum will have to do the read-modify-write dance many, many times.

Sure.

> Again, you must weigh your needs.  If your need is to serve huge
> static files, and do little writing, a large stripe size might work
> very well.  On the other hand, if you need to read and write to
> multiple small-to-medium sized files, a small stripe size works much
> better.

No.

> Greg, the companies that make RAID hardware are not filled with a
> bunch of idiots.

Nor are the people who write free software.

I'm sure that there are good reasons for small stripe sizes.  I don't
believe that efficiency is one of them.  I think your problem is that
you're looking at single-request latency.  That's usually not an
issue, though it may impact copy performance.  As I said, though,
that's not usually where you would use RAID-5.  The real issue is not
so much latency as throughput.  I hope I've been able to make it more
understandable.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020529134211.L82424>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation