Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 Mar 1998 19:13:18 +0100 (MET)
From:      Wilko Bulte <wilko@yedi.iaf.nl>
To:        grog@lemis.com (Greg Lehey)
Cc:        sbabkin@dcn.att.com, tlambert@primenet.com, shimon@simon-shapiro.org, jdn@acp.qiv.com, blkirk@float.eli.net, hackers@FreeBSD.ORG
Subject:   Re: SCSI Bus redundancy...
Message-ID:  <199803031813.TAA01249@yedi.iaf.nl>
In-Reply-To: <19980303191755.14264@freebie.lemis.com> from Greg Lehey at "Mar 3, 98 07:17:55 pm"

next in thread | previous in thread | raw e-mail | index | archive | help
As Greg Lehey wrote...
> On Mon,  2 March 1998 at 23:57:44 +0100, Wilko Bulte wrote:
> > As Greg Lehey wrote...
> >> On Mon,  2 March 1998 at 14:23:50 -0500, sbabkin@dcn.att.com wrote:
> >>>> ----------
> >>>> From: 	Terry Lambert[SMTP:tlambert@primenet.com]
> >>>>
> >>>>>>> I think Julian's SLICE code has something in that direction.
> >>>> DPT
> >>>>>>> supports INCREASING the size of a RAID-5 array by adding drives.
> >>>>>>
> >>>>>> How can that work?
> >>>>>
> >>>>> Something like
> >>>>> 	- read N RAID blocks from K disks
> >>>>> 	- compute new checksum for K+1 disks and write as less number
> >>>>>         of RAID blocks but each one of bigger size (K+1/K times)
> >>>>>       - add empty blocks at the end of RAID in the added space
> >>>>
> >>>> You would have to remember to grab the blocks to be relocated with
> >>>> the same O(n) randomness as their allocation.  8-).
> >>>>
> >>> Huh ? Probably I've missed something about RAIDs. I've thought
> >>> that, for example, RAID block 0 consists of blocks 0 of all
> >>> the physical disks. And so on. And I've thought that RAID itself
> >>> does not allocate any blocks, the upper level like filesystem or
> >>> volume manager does it, RAID just makes chechsuming. Am I wrong again ?
> >>
> >> That's not the point.  OK, we were talking about RAID 5 here, which
> >> also has parity blocks, but the point is that if you add another disk,
> >> you're effectively adding another block every n blocks in the file
> >> system address space.  It requires some non-trivial data movement to
> >> rearrange all the data (more specifically, except for the first n (n =
> >> old number of drives) blocks, you must move *everything*, and you must
> >> recalculate parity for every stripe.
> >>
> >> My question ("How can that work?") was based on the misassumption that
> >> this would be too much work to be justifiable.
> >
> > And apart from the work involved to get it implemented: how long would it
> > take a RAIDset to get re-organised/enlarged. Reason #1 for doing things like
> > this is because you don't want downtime. And I don't want to think about
> > some hardware failure (say a disk) halfway during this process. That would
> > really result in a dis[k]array ;-)
> 
> Obviously there are a number of problems.  But in fact it's not as
> difficult as it sounds.  There's a problem with RAID 5 anyway if
> there's, say, a power failure during a write.  After bringing it back
> up again, you can recognize that there's a parity error, but where?

This is called the 'write hole' in the literature. The trick is to
use battery backed cache not only for RAID5 (write)performance
reasons, but also to keep the data until date AND parity have safely landed
on the disks.

Same problems for mirror sets BTW. And don't enable the write caches *on the
disks themselves* unless you feel suicidal ;-)

> The question of reorganizing isn't as critical: run an asynchronous
> process which updates the array a stripe at a time.  In addition to
> the data, let it write a magic number in the entire first sector
> following the updated slice.  If the array does go down during the
> update, a recovery run can can find this magic number and know where
> to restart the reorganization.  Not ideal, but better than nothing.

Throw hardware at it... Maybe you could also use a Prestoserve memory
card of some sort for it.

> Vinum offers another alternative: attach a second plex with the same
> data, maybe only a few megabytes at a time.  During the time this area
> of the volume is being updated, the plex supplies a backup in case of
> failure.  When the region is left, the plex is detached and reattached
> at the next point in the array.  If anything goes down, the correct
> data will be in the auxiliary plex.

Hmm. Sounds reasonable. 

Wilko
_     ______________________________________________________________________
 |   / o / /  _  Bulte email: wilko @ yedi.iaf.nl http://www.tcja.nl/~wilko
 |/|/ / / /( (_) Arnhem, The Netherlands - Do, or do not. There is no 'try'
---------------  Support your local daemons: run [Free,Net,Open]BSD Unix  --

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199803031813.TAA01249>