Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Jul 1997 23:21:48 -0700 (PDT)
From:      Simon Shapiro <Shimon@i-Connect.Net>
To:        "Gary Palmer" <gpalmer@FreeBSD.ORG>
Cc:        freebsd-scsi@FreeBSD.ORG
Subject:   Re: New DPT Driver
Message-ID:  <XFMail.970723232148.Shimon@i-Connect.Net>
In-Reply-To: <6771.869689036@orion.webspan.net>

next in thread | previous in thread | raw e-mail | index | archive | help

Hi "Gary Palmer";  On 23-Jul-97 you wrote: 

...

> > REQUEST:
> > 
> > Please try and build several different RAID configurations and try to
> see
> > which is the best solution for news servers.  I have enough questions
> on
> > this matter i think I will start selling tickets :-)
> 
> The most common scenario I've heard of for large news installations,
> where drive numbers are not a problem and you can get as many drives
> as you ask for, they use RAID0+1, i.e. take 2 identical stripes and
> mirror then.

Sounds interesting.  gives me somewhat of a motivation to adopt DPT's
in-kernel RAID-0 option, to allow you to stripe across controllers.
You can stripe acrooss busses today.

> Of course, the RAID format you choose is only one part of a very large
> equation ... some people swear by 1gig drives, and lots of them, to
> get the needed performance. Others swear by 4gig drives. Then you get
> into the whole narrow/wide/ultrawide argument, and how many drives are
> optimal to put on a single scsi chain, how many chains you need, etc.

This is why my request came in in the first place.  For RDBMS work, 
actually, 300MB/spindle is about it (Try to get a hot performing 300MB
drive, though :-)

An interesting analysis I saw sometimes ago claims that the performance
gap between processors and disks grows at 50%/year.  The gap between
processors and DRAM grows at abount the same rate.  Based on this logic,
more smaller drives is better.  There is I/O logic overhead, cache sizes,
etc. to be considered too.

> Then there is the sticky issue of what stripe size to use. You, in the
> past, have recomended small stripe sizes for use in the DPT
> controller. I am not sure if that is because of hardware/software
> limitations or an over-generalisation. The growing consensus is that
> larger stripe sizes are better for news (sizes on the order of 32MB,
> yes, thats 32 megabytes) are non unusal. Can DPT support that?

32MB per stripe?  That means that the current DPT can cache only 2
stripes.  Consider the following too (this is a friendly discussion, not
an acusation session right? :-)

* Most news servers use the Unix file system.  Right?
* Last I saw, ALL F/S I/O was done in 4Kbytes chunks.
* Typical news posting is on the order of 8Kbytes or less (the pornography
  and stolen software postings exempted)
* As news age and expire and new postings come in, the linearity of the
  F/S will decrease, not improve.  I/E randomness will increase.

* A typical disk drive will take, on a single I/O (I am not sure the SCSI
  standard will even allow that.  Justin?) about 5-6 seconds to read/write
  this size stripe.  This means that every cache miss will take 5-6
  seconds to complete.

* We ran some very carefully timed tests (on Slowlaris, Linux and FreeBSD
  (NOT on the DPT - did not have a driver yet) to confirm my findings from
  another job.  The number of I/O operations per seconds is almost 
  independant of block size, SCSI bus width and SCSI bus rate up to about
  8KB blocks.  With blocks larger than 8KB, the data transfer time begins
  to be visible.

  [ In another job, we had access to a very nice ($38,000 or so nice) SCSI
    analyzer.  The ``paper accounting'' of SCSI bus cycles was very
    precisely confirmed on the CRT tube ]

  The reason for that is that most bus negotiations happen in slow/narrow
  mode.  Now FCAL is another story.

The above will lead you to the conclusion that smaller is better.  The
reality of course says otherwise.  For many purposes, the number is around
32-128MB stripes.  Remember that a good SCSI HBA driver (or firmware) will
elevator sort the I/O and collapse ajacent requests into single or linked
requests.

The only way I can see 32MB stripes being even usable, is in setting them
up as CCD stripes of this size.  Of course, then you really do all your
I/O in...  4096 bytes.

If you want all these whole performance theories to go down the drain
fast, consider RDBMS (or any true database work) and start thinking about
ZERO LOSS I/O, where a return from a WRITE operation means ``I guarantee
that the data is on disk, the disk will not lose the data, my WRITE does
not collide with your WRITE, etc. or I will sell my in-laws to slavery and
give you the money''.

In 25 years of RDBMS design, I have never encountered database block size
larger than 8K or I/O atomic operation larger than 64K.  If I remember,
the Berkeley F/S research indicated block sizes around 8KB to be optimal
for file systems, but am not sure of the validity of this data anymore.

Last point to ponder;  The bus bandwidth on a P6 memory bus is about
532MB/sec.  Theoretical limit for PCI is 132.  Practical?  Never seen
anything useful going faster than 75MB/Sec, with peak demo stuff at
100-110MB/Sec.  This gives you (on most motherboards) 4:1 ratio.  With 
PCI-SCSI ratio of theoretical 2:1 and practical 4-10:1.

Simon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.970723232148.Shimon>