FreeBSD Mail Archives

Date:      Wed, 12 Dec 2001 12:53:37 +0100
From:      Bernd Walter <ticso@cicely8.cicely.de>
To:        Greg Lehey <grog@FreeBSD.ORG>
Cc:        Matthew Dillon <dillon@apollo.backplane.com>, Wilko Bulte <wkb@freebie.xs4all.nl>, Mike Smith <msmith@FreeBSD.ORG>, Terry Lambert <tlambert2@mindspring.com>, Joerg Wunsch <joerg_wunsch@uriah.heep.sax.de>, freebsd-current@FreeBSD.ORG
Subject:   Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))
Message-ID:  <20011212125337.D15654@cicely8.cicely.de>
In-Reply-To: <20011212162205.I82733@monorchid.lemis.com>
References:  <200112101754.fBAHsRV01202@mass.dis.org> <200112101813.fBAIDKo47460@apollo.backplane.com> <20011210192251.A65380@freebie.xs4all.nl> <200112101830.fBAIU4w47648@apollo.backplane.com> <20011211110633.M63585@monorchid.lemis.com> <20011211031120.G11774@cicely8.cicely.de> <20011212162205.I82733@monorchid.lemis.com>

On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote:
> On Tuesday, 11 December 2001 at  3:11:21 +0100, Bernd Walter wrote:
> > striped:
> > If you have 512byte stripes and have 2 disks.
> > You access 64k which is put into 2 32k transactions onto the disk.
> 
> Only if your software optimizes the transfers.  There are reasons why
> it should not.  Without optimization, you get 128 individual
> transfers.

If the software does not we end with 128 transactions anyway, which is
not very good becuase of the overhead for each of them.
UFS does a more or less good job in doing this.

> > Linear speed could be about twice the speed of a single drive.  But
> > this is more theoretic today than real.  The average transaction
> > size per disk decreases with growing number of spindles and you get
> > more transaction overhead.  Also the voice coil technology used in
> > drives since many years add a random amount of time to the access
> > time, which invalidates some of the spindle sync potential.  Plus it
> > may break some benefits of precaching mechanisms in drives.  I'm
> > almost shure there is no real performance gain with modern drives.
> 
> The real problem with this scenario is that you're missing a couple of
> points:
> 
> 1.  Typically it's not the latency that matters.  If you have to wait
>     a few ms longer, that's not important.  What's interesting is the
>     case of a heavily loaded system, where the throughput is much more
>     important than the latency.

Agreed - especially because we don't wait for writes as most are async.

> 2.  Throughput is the data transferred per unit time.  There's active
>     transfer time, nowadays in the order of 500 µs, and positioning
>     time, in the order of 6 ms.  Clearly the fewer positioning
>     operations, the better.  This means that you should want to put
>     most transfers on a single spindle, not a single stripe.  To do
>     this, you need big stripes.

In the general case yes.

> > raid5:
> > For a write you have two read transactions and two writes.
> 
> This is the way Vinum does it.  There are other possibilities:
> 
> 1.  Always do full-stripe writes.  Then you don't need to read the old
>     contents.

Which isn't that good with the big stripes we usually want.

> 2.  Cache the parity blocks.  This is an optimization which I think
>     would be very valuable, but which Vinum doesn't currently perform.

I thought of connecting the parity to the wait lock.
If there's a waiter for the same parity data it's not droped.
This way we don't waste memory but still have an efect.

> > There are easier things to raise performance.
> > Ever wondered why people claim vinums raid5 writes are slow?
> > The answer is astonishing simple:
> > Vinum does striped based locking, while the ufs tries to lay out data
> > mostly ascending sectors.
> > What happens here is that the first write has to wait for two reads
> > and two writes.
> > If we have an ascending write it has to wait for the first write to
> > finish, because the stripe is still locked.
> > The first is unlocked after both physical writes are on disk.
> > Now we start our two reads which are (thanks to drives precache)
> > most likely in the drives cache - than we write.
> >
> > The problem here is that physical writes gets serialized and the drive
> > has to wait a complete rotation between each.
> 
> Not if the data is in the drive cache.

This example was for writing.
Reads get precached by the drive and have a very good chance of beeing
in the cache.
It doesn't matter on IDE disks, because if you have write cache enabled
the write gets acked from the cache and not the media.  If write cache
is disabled writes gets serialized anyway.

> > If we had a fine grained locking which only locks the accessed sectors
> > in the parity we would be able to have more than a single ascending
> > write transaction onto a single drive.
> 
> Hmm.  This is something I hadn't thought about.  Note that sequential
> writes to a RAID-5 volume don't go to sequential addresses on the
> spindles; they will work up to the end of the stripe on one spindle,
> then start on the next spindle at the start of the stripe.  You can do
> that as long as the address ranges in the parity block don't overlap,
> but the larger the stripe, the greater the likelihood of this would
> be. This might also explain the following observed behaviour:
> 
> 1.  RAID-5 writes slow down when the stripe size gets > 256 kB or so.
>     I don't know if this happens on all disks, but I've seen it often
>     enough.

I would guess it when the stripe size is bigger than the preread cache
the drives uses.
This would mean we have a less chance to get parity data out of the
drive cache.

> 2.  rawio write performance is better than ufs write performance.
>     rawio does "truly" random transfers, where ufs is a mixture.

The current problem is to increase linear write performance.
I don't see a chance that rawio benefit of it, but ufs will.

> Do you feel like changing the locking code?  It shouldn't be that much
> work, and I'd be interested to see how much performance difference it
> makes.

I put it onto my todo list.

> Note that there's another possible optimization here: delay the writes
> by a certain period of time and coalesce them if possible.  I haven't
> finished thinking about the implications.

That's exactly what the ufs clustering and softupdates does.
If it doesn't fit modern drives anymore it should get tuned there.

Whenever a write hits a driver there is a waiter for it.
Either a softdep, a memory freeing or an application doing an sync
transfer.
I'm almost shure delaying writes will harm performance in upper layers.

-- 
B.Walter              COSMO-Project         http://www.cosmo-project.de
ticso@cicely.de         Usergroup           info@cosmo-project.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011212125337.D15654>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation