Date: Thu, 6 Sep 2001 15:23:01 +0930 From: Greg Lehey <grog@FreeBSD.org> To: Doug Hardie <bc979@lafn.org> Cc: David Gilbert <dgilbert@velocet.ca>, Lawrence Farr <l.farr@epcdirect.co.uk>, 'Lawrence Farr' <lawrence@epcdirect.co.uk>, 'Chris BeHanna' <behanna@zbzoom.net>, 'FreeBSD-Stable' <stable@FreeBSD.ORG> Subject: Re: [stable] Re: RAID5 Message-ID: <20010906152301.J24413@wantadilla.lemis.com> In-Reply-To: <f0433010ab7bc414d46c7@[10.0.1.100]>; from bc979@lafn.org on Wed, Sep 05, 2001 at 01:57:50PM -0700 References: <002c01c135e4$69c924d0$c80aa8c0@lfarr> <f04330116b7bbe195e610@[10.0.1.100]> <15254.16593.350305.548246@trooper.velocet.net> <f0433010ab7bc414d46c7@[10.0.1.100]>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, 5 September 2001 at 13:57:50 -0700, Doug Hardie wrote: > At 11:12 -0400 9/5/01, David Gilbert wrote: >> Well... FreeBSD doesn't use a 'fast write' disk (although this is an >> interesting idea), but writing a single block of RAID-5 data requires >> a read of the previous data, a read of the parity block then a write >> of the data and a write of the parity block --- 4 I/O operations. > > It is the distributing of the data among the disks that is required > for write that makes it slower than read. No, on RAID-5 the issue is more complicated. You first need to calculate parity. There are two basic approaches: 1. Aim for whole-stripe writes. That way, you can calculate the parity from the data you have. 2. First read (or cache) the old contents of the parity block. Use them to calculate the new parity, write back. Consider the two alternatives with a, say, 5 disk plex (set). The only way to get situation (1) is to use small blocks. UFS transfers tend to be in the order of 6 kB, though they can be as high as 60 kB (and yes, they have nothing much to do with the file system block size). So you go for small transfers, say 1.5 kB (because it fits in with this example). You perform five writes, and that's all. Because the writes go do different disks, they can go in parallel. Total time is about the same as for a normal write. There's obviously the problem here that you can't rely on having ideally sized blocks. That's OK, though; you'll get enough for this approach to look attractive. In the case of (2), by contrast, you need to read the entire contents of the data you're changing, and then write it out again. Twice the number of transfers. Half the speed? If you use the same block sizes, yes, that's half the speed. But the whole argument is flawed. You can't look at the elapsed time for a single transfer. Look at the time you're keeping the disks busy, times the number of disks. In version (1) you're positioning (5.9 ms) and transferring (0.1 ms) five disks. Total time 30 ms. In example (2) it would take 60 ms--*if* you use this stripe size. Now increase the stripe size to 512 kB. Presto! In all probability, your 6 kB transfer will go to 1 disk only. You still need 4 transfers, but that's only 24 ms, while 30 ms is the theoretical minimum for version (1). This shows the real issue: far too many people measure performance in a completely different environment from practice. The result is frequently meaningless. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010906152301.J24413>