From owner-freebsd-hackers  Wed Mar  4 14:49:30 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id OAA10996
          for freebsd-hackers-outgoing; Wed, 4 Mar 1998 14:49:30 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from att.com (cagw2.att.com [192.128.52.90])
          by hub.freebsd.org (8.8.8/8.8.8) with SMTP id OAA10972
          for <hackers@freebsd.org>; Wed, 4 Mar 1998 14:49:14 -0800 (PST)
          (envelope-from sbabkin@dcn.att.com)
From: sbabkin@dcn.att.com
Received: by cagw2.att.com; Wed Mar  4 17:00 EST 1998
Received: from dcn71.dcn.att.com (dcn71.dcn.att.com [135.44.192.112])
	by caig2.att.att.com (AT&T/GW-1.0) with ESMTP id RAA01608
	for <hackers@freebsd.org>; Wed, 4 Mar 1998 17:04:34 -0500 (EST)
Received: by dcn71.dcn.att.com with Internet Mail Service (5.0.1458.49)
	id <G11GP65Q>; Wed, 4 Mar 1998 17:06:51 -0500
Message-ID: <C50B6FBA632FD111AF0F0000C0AD71EE4132D6@dcn71.dcn.att.com>
To: shimon@simon-shapiro.org
Cc: wilko@yedi.iaf.nl, tlambert@primenet.com, jdn@acp.qiv.com,
        blkirk@float.eli.net, hackers@FreeBSD.ORG, grog@lemis.com,
        karl@mcs.net
Subject: RE: SCSI Bus redundancy...
Date: Wed, 4 Mar 1998 17:06:48 -0500
X-Priority: 3
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.0.1458.49)
Content-Type: text/plain;
	charset="iso-8859-1"
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> ----------
> From: 	Simon Shapiro[SMTP:shimon@simon-shapiro.org]
> 
> >> I wrote a white paper at Oracle some years ago, claiming that
> >> databases
> >> over a certain size simply cannot be backed up.  I became very
> >> UN-popular
> >> very quickly.  In you moderate setup, you already see the proof of
> >> corectness.
> >> 
> > IMHO they CAN be backed up. As long as you have enough spare
> equipment.
> > At my previous work in bank where we were paranoid
> > about backup and downtime I think I have found a scaleable way
> > of doing so. We used it on a relatively small database (~15G) but
> > I can't see why it can not be scaled. First, forget about exports. 
> > Copy the database files and archived logs. Additionally to the 
> > production instance have two more instances. One gets archived logs
> > copied and rolled forward immediately. Another one gets archived
> > logs copied immediately, but rolled forward only after they aged.
> > Copy this third instance to tapes time to time. Copy archived
> > logs to tape as fast as they get produced.
> 
> Yes.  This scheme works, but you ar not backing up the database, nor
> is it
> scalable.  Operating on a databse (from a backup point of view) makes
> arbitrary changes to the files.  If you back them up, you will have an
> inconsistent view of the data.
> 
But it makes archived logs (I mean Oracle database) that can
be used to roll forward an outdated copy.

> Problem number 2:  If your system's storage I/O is utilized at higher
> that
> 50%, you cannot dump the files at all.
> 
The third copy is outdated by definition. So if you stop it
rolling forward for half a day or a whole day and make full
tape backup from it then it will make no problem at all. Yes,
it will be outdated, but because all the archived logs are
also written to tape after they are created, you can
later restore them together with this full backup and apply them
after that.

> > If the production instance crashes, use the second one. If someone
> > removed a table and that was more recently than the age of third
> > instance, start this instance and get this table from it. If this
> > removal was noticed too late, there will be big PITA with restoring
> > from tapes. 
> 
> What you describe here is application-level mirroring.  It works after
> a
> 
Yes, with the difference that the second copy may be located
on machine in another building connected by something like FDDI,
so you are protected against things like fire in computer room.
And it is not quite mirror, they are out of sync by something
like 10 minutes all the time. Of course, the primary system should have
all the hardware mirroring and like things (or may not, in my exact
case it was not done by political reasons. Personally I would
prefer having mirroring, even instead of this scheme, but there
were political reasons), so you can lose these
10 minutes of operational data only if you have the primary system
significantly destroyed.

> fasion, but in case the two databases go out of sync, you have no way
> of
> proving which side is correct.  Also, it is not a deterministic
> system; 
> 
They can not go far out of sync if everything is working. One of 
them is master, it generates the database archive logs during the
operation and these logs get applied to the secondary database.
They are all the time out of sync by the time necessary to
generate, transfer and apply these logs but it can't become worse.

> You cannot really commit the master until the slave committed.  This
> gets
> nasty in a hurry.  One database with one mirror may work.  Twenty of
> them?
> 
It does not try to sync. It is just an auxiliary backup system. If
your primary system goes completely down, you can start
the secondary system in 10 minutes as primary. Yes, you will lose
something like last 0...10 minutes of operation. But you will
still be able to provide service.

> > Do offline (better but with downtime) or online backup if you do
> reset 
> > logs. This can be done fast if the I/O subsystem is has enough
> > throughput to copy all the disks of database to backup disks in
> > parallel, and if the disks can be remapped between machines
> > easily. For 4G disks this will be not more than 1 hour.
> 
> There are databases which cannot go offline.  Banks have the unique
> position where they hold the customer's money behind a locked door :-)
> An ISPs Radius database cannot shut down.  A telephone company
> authentication server cannot shutdown, A web server should not shut
> down. 
> A mail server can shutdown.  A DNS server cannot shutdown.
> You may disagree with some of these classifications, but some of them
> cannot be shutdown, and actually cannot get out of sync either.
> 
Agreed. For these cases Oracle has opportunity named online backup.
You tell the RDBMS that you are going to do backup, after that
copy the database files (the RDBMS is still running, only performance
is degraded due to competition for disks). Later you can apply
the archived logs to this image and get working database.

> > Nope. Databases must have dedicated filesystems. And as long
> > there are no files created or removed in these filesystems
> > or blocks added or removed to/from any files in them
> > (in other words, no change of metadata, what is normal for
> databases) 
> > there is no chance that you will lose your database.
> > I know that not everyone follows this rule (looks like everyone
> > in AT&T does not do it) but this is their personal problem and
> > not the problem of Unix.
> 
> I was hoping you will say that :-)
> You are talking theory.  I am talking practice.  I have demonstrated
> cases,
> (many times) where you boot a system, mount everything, crash it and
> upon
> re-boot, the filesystem is severely corrupt.
> 
Yes, I had it too. But don't forget that the booting changes
files, at least logs, utmp/wtmp, pipes, etc. If you just
mount some filesystem and don't touch it after this, it
can not get corrupted.

> Besides, a living database will change things on disk.  There is no
> Unix
> semantics to pre-alloacte blocks to a file in Unix.  Some of you may
> remember the old Oracle ccf utility.  It did exactly that.  Therfore,
> you
> may add a block to file A, which shares superblock sector with file B,
> have
> the system crash three days later, and then fsck will decide that file
> A
> belongs in lost+found, or, less commonly, rearrange it a bit.  If you
> never
> saw it, you simply did not look long enough.
> 
Don't know about ccf, never saw it. But if I create the database,
all the blocks are allocated during the creation and later the
file sizes never change (and no, they don't have gaps inside). 
And as far as I know if I write to some block in file that is
already allocated, the data will go to this block and it will never
be reallocated by the filesystem. So you do not have any blocks
allocated or deallocated during normal operation and the filesystem
can not get corrupted.

> Please do not misunderstand me;  I like Unix, I love FreeBSD, but
> perfect
> for all occasions neither one is.
> 
I do not :-) After all, you can use logical volumes: they are
as much convenient as files (if you do have LVM) but have
less overhead.

-SB


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message