Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 May 2002 18:59:26 -0600
From:      "Long, Scott" <Scott_Long@adaptec.com>
To:        "'Greg 'groggy' Lehey'" <grog@freebsd.org>, Doug White <dwhite@resnet.uoregon.edu>
Cc:        Cy Schubert - CITS Open Systems Group <Cy.Schubert@uumail.gov.bc.ca>, Kirk Strauser <kirk@strauser.com>, freebsd-stable@freebsd.org
Subject:   RE: Hardware RAID vs vinum
Message-ID:  <2C7CBDC6EA58D6119E4A00065B3A24CB04636E@btcexc01.btc.adaptec.com>

next in thread | raw e-mail | index | archive | help
> On Tuesday, 28 May 2002 at 13:04:19 -0700, Doug White wrote:
> > On Tue, 28 May 2002, Greg 'groggy' Lehey wrote:
> >
> >> This can only happen if there's something very wrong with 
> the software
> >> RAID-5, either the implementation or the configuration.  One thing
> >> that springs to mind is too small a stripe size.  There's 
> a myth going
> >> around that this improves performance, when in fact it seriously
> >> degrades it.
> >
> > Mind expounding on this topic?
> 
> Sure, I have done in the man page.

I just read the man page, and...

> 
> > The people at BayDel would like to disagree with you -- they ship a
> > RAID 3 box that has good performance.
> 
> RAID-3 doesn't have stripes.  But it's optimized towards single file
> access.

Your ascertation in the man page that stripes of 2-4MB are ideal might
warrant some discussion.  I'll assume that this is in the context of
RAID 5, but a lot of this discussion applies to RAID 0 and 1 too.

1.  Your calculations are correct only in the case of purely sequential
reads of the stripes.  I have not dug deep enough into vinum to see how
this works, but I can only assume that it will issue as many bio requests 
as it needs to retrieve requested data, and do so in parallel.  The
underlying disk driver will most likely issue these requests to the 
drives in parallel also, using tagged queuing.  Even with IDE, some 
parallelism can be achieved since the disks will most likely be on 
different controllers (side note, anyone who runs an array with more 
than one IDE disk per controller is crazy).  Of course, the data will 
not arrive back at vinum completely in parallel, but most of the
accumulated seek latency that you worry about will greatly diminish.
A larger stripe size here might help, since it will result in less
interrupts being taken, and less runs through your bio_complete
handler, but there are other factors to consider.

2.  You briefly concede that smaller stripes are beneficial to multiple
access scenarios, but then brush this off as not important.  It is
important, because not all RAID applications are single access video
streamers.  It is quite useful to put real files on an array, read
and write to them in a random fashion, even access them simultaneously
from multiple processes.  Database servers, and even file server, are
notorious for this.

3.  You focus only on read performance and totally ignore write
performance.  The reason that most hardware RAID controllers use
smaller stripes, and why they try to write in full-stripe chunks, is
to eliminate the read-modify-write penalty of the parity data.

Take for example a RAID 5 array, composed of 5 disks, and having an
8k stripe size.  Now imagine a 32k file on that array.  To read that
file, you would need to issue 4 8k reads to the disk subsystem.  This
would be done nearly in parallel, and the data would arrive rather
quickly.  Now write this file.  Since it covers all 4 data disks in
the stripe, there is no need to read-modify-write the parity.  Just
compute it, and write out the 5 8k chunks.  The only penalty here is
computing the parity, but that would have to be done no matter what
(and this is why high end RAID cards with hardware parity accelerators
are so cool).  Take the same file and array, but change the
stripe size to 2MB.  The 32k file size will clearly fit into just one
chunk, but you have to update the parity too.  So you need to issue a
single read down to the disk to get the 32k of parity, wait for it to
arrive, compute the parity, and write out 2 32k chunks (one for the
data, and one for the parity).  It's only 3 transactions, but the
second two have to wait on the first to complete.  There's really no
way around this, either.  Since the FreeBSD block layer limits 
transactions to a max of 128k, it's not like you can issue huge
8MB writes that cover the entire stripe; the block layer will 
break them down into 128k parcels, and vinum will have to do the
read-modify-write dance many, many times.

Again, you must weigh your needs.  If your need is to serve huge
static files, and do little writing, a large stripe size might
work very well.  On the other hand, if you need to read and write 
to multiple small-to-medium sized files, a small stripe size works
much better.

Greg, the companies that make RAID hardware are not filled with a
bunch of idiots.  

Scott

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2C7CBDC6EA58D6119E4A00065B3A24CB04636E>