From owner-freebsd-scsi  Fri Jun 19 12:56:52 1998
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id MAA22772
          for freebsd-scsi-outgoing; Fri, 19 Jun 1998 12:56:52 -0700 (PDT)
          (envelope-from owner-freebsd-scsi@FreeBSD.ORG)
Received: from nomis.simon-shapiro.org ([209.86.126.163])
          by hub.freebsd.org (8.8.8/8.8.8) with SMTP id MAA22417
          for <freebsd-SCSI@freebsd.org>; Fri, 19 Jun 1998 12:55:01 -0700 (PDT)
          (envelope-from shimon@nomis.Simon-Shapiro.ORG)
Received: (qmail 2328 invoked by uid 1000); 19 Jun 1998 19:48:37 -0000
Message-ID: <XFMail.980619154837.shimon@simon-shapiro.org>
X-Mailer: XFMail 1.3 [p0] on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.NEB.3.96.980619083513.2503V-100000@leaf.juniper.net>
Date: Fri, 19 Jun 1998 15:48:37 -0400 (EDT)
Reply-To: shimon@simon-shapiro.org
Organization: The Simon Shapiro Foundation
From: Simon Shapiro <shimon@simon-shapiro.org>
To: Chris Parry <laotzu@juniper.net>, freebsd-questions@FreeBSD.ORG,
        freebsd-SCSI@FreeBSD.ORG
Subject: RE: DPT support binaries - How to Setup
Sender: owner-freebsd-scsi@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


Chris, I hope you do not mind me forwarding this to FreBSD questions...

On 19-Jun-98 Chris Parry wrote:

>> Since I moved, my web server has been down.  DPT drivers for FreeBSD are
>> integral part of FreeBSD now.  If you want to see the contents of my ftp
>> server use ftp://simon-shapiro.org/crash
>> 
>> There is no dptmgr for FreeBSD yet.  In the works.
> 
> Excellent.  Would you know how people are currently setting up RAID on
> DPT's in FreeBSD?  I'm assuming something like having a dos partition,
> and
> doing the config there, and then just mounting the volume as sd0?

I have been asked this question many times, and have also seen much
misinformation in this regard, so here is a brief review.

The DPT controller creates and manages RAID array in a manner totally
transparent to the (ANY) operating system.  Say, you have 45 disk drives,
and you attach them to one DPT controller (I have several ``customers'' who
do that;  You need a DPT PM3334UDW, and seven disk shelves, and a very
large UPS).  Then you boot DOS from a floppy, take the DPT install floppy
number one (comes with the controller), put it in the floppy drive and type
``dptmgr/fw0'' and press the return key.  After a short while, a
windows-like application starts.  You do not need windows, DOS or anything
installed on the machine.  Just boot dos 6.22 or later (I use IBM PC-DOS
7.0) from the floppy drive.

We will go to the review in a minute, but here are the the steps to create
a very complex RAID subsystem, for ANY operating system, FreeBSD included.

For brevity, I will use the notation cXbYtZ to define disk drives.  The DPT
controllers (PM3334 series can have up to three SCSI busses attached to the
same controller.  BTW, the correct name for a SCSI controller is HBA, as in
Host Bus Adapter.  Let's say we have two controllers.  The first controller
has 1 disk connected to the first channel, and the disk is setup (via its
jumpers) to be target 1.  The second controller has one disk connected to
the third channel, and is setup to be target ID 15.  In this example, I
will call the first disk c0b0t1.  The second disk I will call c1tb2t15.
OK?

Now, back to our monster system.  Step by step:

*  Hook up everything.  If you are not using DPT, nor DEC StorageWorks disk
   shelves, make SURE you bought the MOST expensive cables you can find. 
   Ultra SCSI runs 16 bit bus at 20MHz, using electrical signals similar to
   ISA/PCI busses.  You do not expect your PC cards to work over 30 feet of
   sloppy, cheap cable.  Right?  Do not expect SCSI to be any different.  If
   you need long cables, or more than 5-6 drives per very short cable, get
   differential controllers, and good shelves.

   Half the people that contact me with ``DPT problems'' have cheap cables
   and/or disk shelves. the other half is using some Quantum Atlas drives
   that have old/bad/wrong firmware.  About 1 in a hundred actually has a
   deficiency in the driver, or the firmware.  this is not bragging.  These
   are facts.  I am happy to help each and every one, but your system is
   yours, not mine, and you need to set it up correctly. 

*  As indicated above, boot DOS, install DPT floppy 1, and start the dptmgr
   with the option fw0. I will tell you later why.

*  If this is the very first time these disk drives are connected to the
   DPT controller, DPTMGR will invite you to choose an O/S.  Say, Linux, or
   Other.  Do NOT say BSDi.  This is important.

*  In this example, We will assume a pair of DPT PM3334UDW (Ultra-wide,
   differential, with 3 busses) HBAs. 84 4GB drives that are all identical,
   We will assume that the drives are installed in DPT shelves, 7 drives
   per shelf.  Each channel in each controller has 2 shelves attached to it;
   2 shelves contain 14 drives, for a total of 6 shelves per controller, or
   42 drives per controller.  The shelves have to be configured for the
   starting TARGET ID of the shelf.  One shelf has targets 0-6, the second
   shelf has targets 8-15.  The DPT HBA has target ID 7 on all three busses.

*  We want to arrange the drives in this manner:

   1   RAID-1 array to be used for booting the system, most installed
       file systems, etc.  This will give us full redundancy, and yet be
       able to READ faster than a single disk, and WRITE almost as fast.
       4GB will be enough, so this will consume exactly two drives.

   1   RAID-0 array to be used for swap, /tmp, /var/tmp, /usr/obj, etc.
       RAID-0 is very fast, but if any disk in the array fails, the whole
       array will lose its data.  For the indicated use, this is acceptable
       to us (Remember, this is just an example).
       We do not need an awful lot of space, but we need the speed of at
       least 15MB/Sec, so we will use 6 drives here.

   1   Huge RAID-0 array to contain news articles.  Again, we do not care if
       we loose the article.  This array needs to be big and as fast as
       possible.  We will use 33 disk drives here.

   1   Huge RAID-5 array to contain our E-mail.  We need reliability and
       capacity, so we will use 33 drives here.  In reality, RAID-5 arrays
       are not so effective at this size, but this is just an exadurated
       example.

   1    Very large RAID-5 array to contain our CVS tree.  Since reliability
        and performance are important, we will use 8 drives here.

   If you add up the drives, you will see that we have two drives
   unassigned here.  Hold on...

*  Before we go on to configure/create any RAID arrays, here is a bit of
   madness;  The order in which the BIOS finds adaptors (controllers) on
   the PCI bus, and the order in which the SAME BIOS boots, and/or FreeBSD
   scans the PCI bus, can be reversed on some motherboards.  What that means
   is that when we refer to c0bYtZ, in the context of DPTMGR, it actually
   may be C1bYtZ as far as Unix is concerned.  In ALL my systems things are
   reversed:  What the DPT calls HBA0, is actually HBA 1 for Unix.

*  Make sure the DPT sees all your devices, without errors. If you use DPT
   disk shelves, and see a cabling error, unplug, correct, and use the
   File->Read system configuration to force the DPTMGR to re-scan the
   busses.  No need to reboot.

*  The next step is to define the role of each drive in the system.  Drives
   can be themselves, part of RAID array, or Hot Spares.  Drives that are
    Hot Spares or part of a RAID array, are invisible TO THE O/S.  Again;
   There is no way for the O/S to see a drive that is part of a RAID array,
   or a hot spare.

* In defining RAID arrays, DPTMGR asks you for a stripe size.  Unless you
   have a specific reason to override the default stripe size, leave it
   alone. Chances are the DPT people who write the firmware know their
   hardware, SCSI and RAID theory better than you do.

*  Using the DPTMGR utility, we create a RAID-1 array using c1b0t0 and
   c1b1t0.  Use the File->Set system Parameters to save the configuration
and    have the DPT start building the array.  When you are done defining
the    array, it's icons will have black flags.  When it builds, the array
icon    will be blue, and the drives' will be white.

*  While the array builds, double click on it, click on the Name button,
   and type in an appropriate name, for Example ``Mad_Boot-1'' to remind
   yourself this is the Mad system, the Boot ``disk'' (more on that later),
   and it is a RAID-1.  Choose File->Set system Parameters to save the new
   name.

*  Double click on c1b2t0 and click on Make hot Spare.  This will make this
   drive invisible to ?Unix, but will allow the DPT to automatically replace
   any defective drive with the hot spare.  We will talk about that some
   more later.

*  Start creating a RAID-0 array.  Add devices to this array in this order:
   c1b0t1, c1b1t1, c1b2t1, c1b0t2, c1b1t2, c1b2t2, c1b0t3, c1b1t3, c1b2t3...
   The idea is to specify the drives, alternating between busses.  This
   gives the DPT the opportunity to load0share the busses.  Performance
   gains are impressive.  When you are done, File->Set system Parameters.
   Do not forget to change the array name to ``Mad-News-0''

*  Do the same with the last arrays on c0.  Remember to designate a hot
   spare, to alternate the drives as you Add them to an array, to File->Set
   system Parameters.

*  The theory says that you could now shut your systems down, and install
   Unix on it.  Not so fast:  While the arrays are building, they are NOT
   available to you.  Current firmware (7M0) will show the arrays to the
   O/S with size of zero or one sector.  FreeBSD will crash (panic) in
   response to that.  Leave the system alone until it is all done.

   Handling failures on arrays that are already built is totally different.
   See below.

*  If you follow my example, when you re-boot the system, BIOS, DOS,
   Windows, Linux, FreeBSD, they will only see FIVE disk drives.  What
   happened to the other 79 drives?!!!
   Listen Carefully:  Every RAID array appears to the O/S as ONE DISK DRIVE.
                      Hot spares are TOTALLY INVISIBLE to the O/S
   Since we defined 5 RAID arrays and two hot spares (one per HBA.  Hot
   Spares cannot cross HBA lines), all the system gets to see is FIVE
   DRIVES.  If you look at the drive model, it will show the array Name you
   chose when setting up.  The revision level for the ``drive'' is DPTxYz,
   where xYz is the firmware version.

   Currently, there is no way to get to the underlying drives in FreeBSD.

Operation: 

Once the arrays completed building (the blue flags will be gone, and if you
double click on the array icon, its Status files will say ``optimal''.
Go Install whatever O/S, using whatever method you choose.
Please beware that some versions of FreeBSD barf on disks with capacity of
20GB or larger.  So it may barf on filesystems this huge.  This seems to be
an ``attribute'' of sysinstall, not the standard fdisk, disklabel, and
newfs.  you may choose to only install and configure the boot disk, as this
one appears to the O/S as a simple, 4GB disk (if you used 4GB drives to
define it).

Failures:

What does the DPT do in case of disk failure?  

First, the FreeBSD O/S, the DPT driver in the O/S, have no clue about the
discussion below, except as explicitly indicated.

General:  If you use DPT shelves, a failed drive (unless it is totally
          dead) will turn its fault light on.  The disk canister has a
          Fault light, that the DPT can turn on and off.  In addition, the
          DPT controller will beep.  The beeping pattern actually will tell
          you which disk on which bus  has failed.  If you use DPT (or DEC)
          shelves, simply pull out the bad drive and plug in a new drive.

RAID-0:  If any disk in a RAID-0 array fails, the whole array is flagged by
         the DPT as ``Dead''.  Any I/O to the array will immediately fail. 
         If you boot DOS/DPTMGR, the array will have black flag on it.  Your
         only option is to delete the array, and create a new one.  Any data
         on the array will be lost.  horrible? Not necessarily.  If you use
         RAID-0 arrays for data you can live without, you will be fine. 
         With drives touting 800,000 hours MTBF, an 8 disks RAID-0 array
         will have MTBF of about 5 years.

RAID-1/5:  If a drive fails, the RAID array will go into degraded mode
           (Yellow flag in dptmgr).  If you have a Hot Spare connected to
           the DPT, it will automatically make the hot spare a new member of
           the degraded array and start rebuilding the array onto the drive
           (that was the Hot Spare).  If you replace the dead drive with a
           good one, the newly inserted drive will be recognized by the DPT
           and be made a new Hot Spare.  The new hot Spare will not be
           available until the building array has completed its re-build.

           Important:  The degraded array is available for I/O while in
           degraded or constructing mode.  This has been verified more than
           once, and actually works.  However:

           *  RAID-5 in degraded mode is very slow.
           *  Array rebuild in RAID-5 is done by reading all the good
              drives, computing the missing data, and writing it to the
              new/replacement drive.  This is not exactly fast.  It is
              downright SLOW.
           *  RAID-1 arrays build by copying the entire good disk onto the
              replacement disk.  This sucks all the bandwidth off the SCSI
              bus.

           *  If there is another error while the arrays are in
              degraded/rebuild mode, you are hosed.  there is not redundant
              data and you may lose your data.  If possible and/or
              practical, do not WRITE to a degraded array.  Back it up
              instead.

Common Failure:  I cannot overstate the commonality of this failure
scenario:

One uses cheap shelves (disk cabinets), cheap cables, marginal drives and
the following happens:

A certain drive hangs, or goes on the fritz, or the SCSI bus stalls.  the
DPT makes a valiant effort to reset the bus, but at least one drive is out
cold.  the DPT raises the alarm, and drafts a HotSpare into service.  It
then starts rebuilding the array.  Since building the array is very I/O
intensive, another drive goes on the fritz.  Now the array is DEAD as far
as the DPT is concerned.

At this point the operator notices the problem, shuts the system down,
reboots into the DPTMGR and comes out saying ``Nothing is WRONG!''  
The operator sees red flags on the drives, if lucky, runs the DPT
OPTIMAL.EXE utility (NOT available from me!), runs DOS based diagnostics,
and start his/her news server again.  Within minutes/hours/days, the whole
scenario repeats.  What happened?

Under DOS (dptmgr is no exception), I/O is poled or sedately slow. 
Failures are rare and recovery almost certain.

Under FreeBSD, a new server can peak at well over 1,000 disk I/Os per
second.  This is when marginal systems break.  This is the most frustrating
scenario for me;

*  The problem is NOT an O/S problem (FreeBSD simply pushes as much I/O
   into the driver as it can.  It knows not a RAID array;  The RAID array
   is simply a disk.

*  It is not a driver problem;  the driver does not know a RAID array form
   a cucumber;  It simply receives SCSI commands and passes them along.  It
   never looks inside to even see what command is passing through.

*  the DPT firmware is not at fault either;  It simply pushes commands to
   the drives according th the ANSI spec.

So, who is at fault?

Typically, the user who buys a $3,000.00 disk controller, attaching it to
$20,000 worth of disk drives, using a $5.000 cable.

In some cases, the user elects to buy the cheapest drive the can find, so
as to reduce the $20,000 cost in disks to maybe $15,000.  Some of these
drives simply cannot do I/O correctly, or cannot do it correctly and
quickly.

Sometimes the problem is a combination of marginal interconnect and
marginal drives.

What to Do?

If you have a mission critical data storage that you want to be reliable
and fast:

*  Get a DPT controller
*  Get the ECC memory form DPT (Yes, bitch them out on the price, and say
   that Simon said the price is absurdly high).
*  Get the disk shelves from DPT (Or get DEC StorageWorks)
*  Get the DISKS from DPT.  Make sure they supply you with the ECC ready
   disks).  you will pay about $100.00/per drive extra, but will get the
   carrier for free, so the total cost is just about the same.
*  Get the cables from DPT, DEC, or Amphenol.
*  SCSI Cables are precision instruments.  Keep the hammer away form them.
*  Use only the version of Firmware I recommend for the DPT.  It is
   currently 7Li, not the newer 7M0.

I have done the above, and on that day, my I/O errors disappeared.
I run a total of over 60 disk drives on DPTs.  Some in very stressed
environments.  Most in mission critical environments.  Any failure we have
to date is a direct result of violating these rules.

What is the ECC option?

The ECC option comprises of special ECC SIMMs for the DPT cache, proper
cabling, and proper disk shelves (cabinets, enclosures).  Using this option,
the DPT guarantees that the data recorded to a device, or read from a
device goes through a complete ECC data path.  any small errors in the data
are transparently corrected.  Large errors are detected and alarmed.

How does it work?

*  When data arrives from the host into the controller, it is put into the
   cache memory, and a 16 byte ECC is computed on every 512 byte ``sector''.

*  The disk drive is formatted to 528 bytes/sector, instead of the normal
   512.  Please note that not every disk can do that.  sometimes it is
   simply a matter of having the proper firmware on the disk.  sometimes the
   disk has to be different.

* When the DPT writes the sector to disk, it writes the entire 528 bytes to
  disk.  when it READs the sector, it reads the entire 528 bytes, performs
  the ECC check/correction, and puts the data in the cache memory.

All disk drives have either CRC or ECC.  What's the big deal?

disk drives uses ECC (or some such) to make sure that what came into their
possession was recored correctly on the disk, or that the recorded data is
read correctly.  The disk drive still has no clue if the data it receives
from the initiator (HBA) is correct.  When a disk sends data to the HBA, it
does not know what arrives at the host.

Yes, SCSI bus support parity.  But parity cannot correct errors, and will
totally miss even number of missing bits.  Yes, that happens easily with bad
cables.


So, what is it good for?

Aside from peace of mind, there is one important use for this:  Hot Plug
drives.  Let me explain:

The SCSI bus (ribbon cable genre, not FCAL) was never designed to sustain
hot insertion.  IF you add/remove device on the bus, you will invariably
glitch it.  These glitches will find themselves doing one of two things:

a.  Corrupt some handshake signal.  this is typically easily detected by
    the devices and corrected via re-try.

b.  Corrupt some data which is in transfer.  This is the more common case. 
    If the corruption went undetected, something will go wrong much later
    and will typically be blamed on software.

The ECC option goes hand in hand with the cabinet/disk-canister.  The
cabinet/canister combination makes sure that the minimal disruption appears
on the bus by using special circuitry.  the ECC option complements it by
making sure that the entire data path, from the CPU to the disk and back is
monitored for quality, and in most cases automatically repaired.

What other wonders does a DPT controller perform?

If you expect it to actually write code for you, sorry.  It does not.
But, if you use the correct enclosure, it will tell you about P/S failures,
fan failures, overheating, and internal failures.

A near-future release of the driver will even allow you to get these alarms
into user space.  You can then write a shell/perl/whatever script/program
to take some action when a failure occurs in your disk system.

What Performance can I expect out of a DPT system?

It depends.  Caching systems are slower in sequential access than
individual disks.  Things like Bonnie will not show what the DPT can really
do.  RAID-1 is slightly slower than a single disk in WRITE and can be
slightly faster in READ operations.  RAID-5 is considerably slower in WRITE
and slightly faster in READ.  RAID-0 is fast but very fragile.

The main difference is in the disk subsystem reaction to increasing load
and its handling of failures.

A single disk is a single point of failure.  A DPT, correctly configured
will present ``perfect./non-stop'' disks to the O/S.  

In terms of load handling, the DPT controller has the lowest load per
operation in a FreeBSD system.  In terms of interrupts per operation,
number of disks per PCI slot, size of file systems, number of CPU cycles
per logical operation, and handling of heavy disk loads, it is probably the
best option available to FreeBSD today.

In terms of RAW I/O, doing random read/write to large disk partitions,
these are the numbers:

RAID-0:  Approximately 1,930 disk operations per second.  About 18-21
         MB/Sec.

RAID-1:  About 6.5 MB/Sec WRITE, About 8-14 MB/Sec READ, the wide range
         stems form increasing cache hit ratio.

RAID-5:  About 5.5 MB/Sec  write, 8.5 MB/.Sec READ.

These are optimal numbers, derived from large arrays, and a PM3334UDW with
64MB of ECC cache.  To achieve these numbers you have to have at least 500
processes reading and writing continuedly to random areas on the disk. 
This translates to Load Average of 150-980, depending on the array type,
etc.

>From daily use, I'dd say that until your disk I/O reached 300 I/O ops/sed,
you will not feel the load at all.

I home the above answers some of the most common questions about the DPT
controller, and its interaction with FreeBSD.  If you have some more, then
let me know.

Simon


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message