Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Dec 2001 03:06:14 +0100
From:      Bernd Walter <ticso@cicely8.cicely.de>
To:        Greg Lehey <grog@FreeBSD.org>
Cc:        Matthew Dillon <dillon@apollo.backplane.com>, Wilko Bulte <wkb@freebie.xs4all.nl>, Mike Smith <msmith@FreeBSD.org>, Terry Lambert <tlambert2@mindspring.com>, Joerg Wunsch <joerg_wunsch@uriah.heep.sax.de>, freebsd-current@FreeBSD.org
Subject:   Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))
Message-ID:  <20011213030613.A18679@cicely8.cicely.de>
In-Reply-To: <20011213105413.G76019@monorchid.lemis.com>
References:  <200112101754.fBAHsRV01202@mass.dis.org> <200112101813.fBAIDKo47460@apollo.backplane.com> <20011210192251.A65380@freebie.xs4all.nl> <200112101830.fBAIU4w47648@apollo.backplane.com> <20011211110633.M63585@monorchid.lemis.com> <20011211031120.G11774@cicely8.cicely.de> <20011212162205.I82733@monorchid.lemis.com> <20011212125337.D15654@cicely8.cicely.de> <20011213105413.G76019@monorchid.lemis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Dec 13, 2001 at 10:54:13AM +1030, Greg Lehey wrote:
> On Wednesday, 12 December 2001 at 12:53:37 +0100, Bernd Walter wrote:
> > On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote:
> >> On Tuesday, 11 December 2001 at  3:11:21 +0100, Bernd Walter wrote:
> >> 2.  Cache the parity blocks.  This is an optimization which I think
> >>     would be very valuable, but which Vinum doesn't currently perform.
> >
> > I thought of connecting the parity to the wait lock.
> > If there's a waiter for the same parity data it's not droped.
> > This way we don't waste memory but still have an efect.
> 
> That's a possibility, though it doesn't directly address parity block
> caching.  The problem is that by the time you find another lock,
> you've already performed part of the parity calculation, and probably
> part of the I/O transfer.  But it's an interesting consideration.

I know that it doesn't do the best, but it's easy to implement.
A more complex handling for the better results can still be done.

> >>> If we had a fine grained locking which only locks the accessed sectors
> >>> in the parity we would be able to have more than a single ascending
> >>> write transaction onto a single drive.
> >>
> >> Hmm.  This is something I hadn't thought about.  Note that sequential
> >> writes to a RAID-5 volume don't go to sequential addresses on the
> >> spindles; they will work up to the end of the stripe on one spindle,
> >> then start on the next spindle at the start of the stripe.  You can do
> >> that as long as the address ranges in the parity block don't overlap,
> >> but the larger the stripe, the greater the likelihood of this would
> >> be. This might also explain the following observed behaviour:
> >>
> >> 1.  RAID-5 writes slow down when the stripe size gets > 256 kB or so.
> >>     I don't know if this happens on all disks, but I've seen it often
> >>     enough.
> >
> > I would guess it when the stripe size is bigger than the preread cache
> > the drives uses.
> > This would mean we have a less chance to get parity data out of the
> > drive cache.
> 
> Yes, this was one of the possibilities we considered.  

It should be measured and compared after I changed the looking.
It will look different after that and may lead to other reasons,
because we will have a different load characteristic on the drives.
Currently if we have two writes in two stripes each, all initated before
the first finished, the drive has to seek between the two stripes, as
the second write to the same stripe has to wait.

> >> Note that there's another possible optimization here: delay the writes
> >> by a certain period of time and coalesce them if possible.  I haven't
> >> finished thinking about the implications.
> >
> > That's exactly what the ufs clustering and softupdates does.
> > If it doesn't fit modern drives anymore it should get tuned there.
> 
> This doesn't have too much to do with modern drives; it's just as
> applicable to 70s drives.

One of softupdates job is to eliminate redundant writes and to do async
writes without loosing consistency of the on media structure.
This also means that we have a better chance that data is written in big
chunks.
In general the wire speed of data to the drive is increased with every new
bus generation but usually big parts of the overhead is keeped for
compatibility with older drives.
I agree that the parity based raid situation does depend more on principle
than on the age of the drive.

> > Whenever a write hits a driver there is a waiter for it.
> > Either a softdep, a memory freeing or an application doing an sync
> > transfer.
> > I'm almost shure delaying writes will harm performance in upper layers.
> 
> I'm not so sure.  Full stripe writes, where needed, are *much* faster
> than partial strip writes.

Hardware raid usually comes with NVRAM and can cache write data without
delaying the acklowledge to the initiator.
That option is not available to software raid.

-- 
B.Walter              COSMO-Project         http://www.cosmo-project.de
ticso@cicely.de         Usergroup           info@cosmo-project.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011213030613.A18679>