From owner-freebsd-stable Tue May 28 17:59:46 2002 Delivered-To: freebsd-stable@freebsd.org Received: from magic.adaptec.com (magic.adaptec.com [208.236.45.80]) by hub.freebsd.org (Postfix) with ESMTP id 22A7337B400; Tue, 28 May 2002 17:59:41 -0700 (PDT) Received: from redfish.adaptec.com (redfish.adaptec.com [162.62.50.11]) by magic.adaptec.com (8.10.2+Sun/8.10.2) with ESMTP id g4T0xaj11300; Tue, 28 May 2002 17:59:36 -0700 (PDT) Received: from btc.btc.adaptec.com (btc.btc.adaptec.com [10.100.0.52]) by redfish.adaptec.com (8.8.8+Sun/8.8.8) with ESMTP id RAA25178; Tue, 28 May 2002 17:59:36 -0700 (PDT) Received: from btcexc01.btc.adaptec.com (btcexc01 [10.100.0.23]) by btc.btc.adaptec.com (8.8.8+Sun/8.8.8) with ESMTP id SAA02404; Tue, 28 May 2002 18:59:33 -0600 (MDT) Received: by btcexc01.btc.adaptec.com with Internet Mail Service (5.5.2653.19) id ; Tue, 28 May 2002 18:59:34 -0600 Message-ID: <2C7CBDC6EA58D6119E4A00065B3A24CB04636E@btcexc01.btc.adaptec.com> From: "Long, Scott" To: "'Greg 'groggy' Lehey'" , Doug White Cc: Cy Schubert - CITS Open Systems Group , Kirk Strauser , freebsd-stable@freebsd.org Subject: RE: Hardware RAID vs vinum Date: Tue, 28 May 2002 18:59:26 -0600 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG > On Tuesday, 28 May 2002 at 13:04:19 -0700, Doug White wrote: > > On Tue, 28 May 2002, Greg 'groggy' Lehey wrote: > > > >> This can only happen if there's something very wrong with > the software > >> RAID-5, either the implementation or the configuration. One thing > >> that springs to mind is too small a stripe size. There's > a myth going > >> around that this improves performance, when in fact it seriously > >> degrades it. > > > > Mind expounding on this topic? > > Sure, I have done in the man page. I just read the man page, and... > > > The people at BayDel would like to disagree with you -- they ship a > > RAID 3 box that has good performance. > > RAID-3 doesn't have stripes. But it's optimized towards single file > access. Your ascertation in the man page that stripes of 2-4MB are ideal might warrant some discussion. I'll assume that this is in the context of RAID 5, but a lot of this discussion applies to RAID 0 and 1 too. 1. Your calculations are correct only in the case of purely sequential reads of the stripes. I have not dug deep enough into vinum to see how this works, but I can only assume that it will issue as many bio requests as it needs to retrieve requested data, and do so in parallel. The underlying disk driver will most likely issue these requests to the drives in parallel also, using tagged queuing. Even with IDE, some parallelism can be achieved since the disks will most likely be on different controllers (side note, anyone who runs an array with more than one IDE disk per controller is crazy). Of course, the data will not arrive back at vinum completely in parallel, but most of the accumulated seek latency that you worry about will greatly diminish. A larger stripe size here might help, since it will result in less interrupts being taken, and less runs through your bio_complete handler, but there are other factors to consider. 2. You briefly concede that smaller stripes are beneficial to multiple access scenarios, but then brush this off as not important. It is important, because not all RAID applications are single access video streamers. It is quite useful to put real files on an array, read and write to them in a random fashion, even access them simultaneously from multiple processes. Database servers, and even file server, are notorious for this. 3. You focus only on read performance and totally ignore write performance. The reason that most hardware RAID controllers use smaller stripes, and why they try to write in full-stripe chunks, is to eliminate the read-modify-write penalty of the parity data. Take for example a RAID 5 array, composed of 5 disks, and having an 8k stripe size. Now imagine a 32k file on that array. To read that file, you would need to issue 4 8k reads to the disk subsystem. This would be done nearly in parallel, and the data would arrive rather quickly. Now write this file. Since it covers all 4 data disks in the stripe, there is no need to read-modify-write the parity. Just compute it, and write out the 5 8k chunks. The only penalty here is computing the parity, but that would have to be done no matter what (and this is why high end RAID cards with hardware parity accelerators are so cool). Take the same file and array, but change the stripe size to 2MB. The 32k file size will clearly fit into just one chunk, but you have to update the parity too. So you need to issue a single read down to the disk to get the 32k of parity, wait for it to arrive, compute the parity, and write out 2 32k chunks (one for the data, and one for the parity). It's only 3 transactions, but the second two have to wait on the first to complete. There's really no way around this, either. Since the FreeBSD block layer limits transactions to a max of 128k, it's not like you can issue huge 8MB writes that cover the entire stripe; the block layer will break them down into 128k parcels, and vinum will have to do the read-modify-write dance many, many times. Again, you must weigh your needs. If your need is to serve huge static files, and do little writing, a large stripe size might work very well. On the other hand, if you need to read and write to multiple small-to-medium sized files, a small stripe size works much better. Greg, the companies that make RAID hardware are not filled with a bunch of idiots. Scott To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message