FreeBSD Mail Archives

Date:      Wed, 04 Mar 1998 15:58:30 -0800 (PST)
From:      Simon Shapiro <shimon@simon-shapiro.org>
To:        sbabkin@dcn.att.com
Cc:        wilko@yedi.iaf.nl, tlambert@primenet.com, jdn@acp.qiv.com, blkirk@float.eli.net, hackers@FreeBSD.ORG, grog@lemis.com, karl@mcs.net
Subject:   RE: SCSI Bus redundancy...
Message-ID:  <XFMail.980304155830.shimon@simon-shapiro.org>
In-Reply-To: <C50B6FBA632FD111AF0F0000C0AD71EE4132D6@dcn71.dcn.att.com>

On 04-Mar-98 sbabkin@dcn.att.com wrote:

...

>> What you describe here is application-level mirroring.  It works after

> Yes, with the difference that the second copy may be located
> on machine in another building connected by something like FDDI,
> so you are protected against things like fire in computer room.
> And it is not quite mirror, they are out of sync by something
> like 10 minutes all the time. Of course, the primary system should have
> all the hardware mirroring and like things (or may not, in my exact
> case it was not done by political reasons. Personally I would
> prefer having mirroring, even instead of this scheme, but there
> were political reasons), so you can lose these
> 10 minutes of operational data only if you have the primary system
> significantly destroyed.

I have a better solution, which was implemented here at my work (and I'll
repeat if my employer ahd the guy that wote it do not contribute it);
This is NOT my original idea, but so old I do not remember who did it first;

You start with two identical databases.  You modify the (postgres) libpq,
or (Oracle) SQL*Net interface to intercept all data-modifying SQL
statements (you do not care about SELECT and such).  You cache those until
you see a COMMIT.  If you see a ROLLBACK, you discard all that cache.
When you capture the SQL statements, you stamp each with a high precision
time stamp (does not have to be accurate, but has to be precise).  When you
see a COMMIT, you ship the whole thing to a remote machine.  The remote
machine can simply log these, or apply them against a reference database.
If you just logthem, you sort them by the timestamp before you apply them.
The quality of the resultant database is surprisingly good.  Especially for
an OLTP system.  the advantages are obvious.

...

> They can not go far out of sync if everything is working. One of 
> them is master, it generates the database archive logs during the
> operation and these logs get applied to the secondary database.
> They are all the time out of sync by the time necessary to
> generate, transfer and apply these logs but it can't become worse.

This will only work if you can switch the database clients to the
alternative system.  Otherwise you will have long interruption of service.
Something your employer does not routinely like.

I normally classify these schemes as part of disaster recovery plan, not
routine operation.  In my terminology, backup is part of routine operation.
 Truely hot databases cannot be routinely backed up, nor restored without
unacceptable disruption of service.  Your scheme, which is good for
disaster recovery, is not acceptable for a hot, non-stop operation, unless
modified as indicated above.

 ...

> It does not try to sync. It is just an auxiliary backup system. If
> your primary system goes completely down, you can start
> the secondary system in 10 minutes as primary. Yes, you will lose
> something like last 0...10 minutes of operation. But you will
> still be able to provide service.

Last time AT&T lost service for 10 minutes it ended up on TV.  Besides, if
you promise 1 minute, demonstrate 10 minutes, in a real disaster it will be
4 hours.  What is the revenue loss, per 5ESS switch for 10 minutes loss of
service?  What is the contractual obligation for downtime?  One of the
readers of this list (works for Sprint, I think) reminded me the 5
minutes/year or something on that scale.

 ...

> Agreed. For these cases Oracle has opportunity named online backup.
> You tell the RDBMS that you are going to do backup, after that
> copy the database files (the RDBMS is still running, only performance
> is degraded due to competition for disks). Later you can apply
> the archived logs to this image and get working database.

I know of that option.  I also had to listen to a Telco customer who
detailed, in public, how this feature takes 18 hours to bring the database
back up after a software induced crash, using exactly this mechanism.
It sounds good in a brochure.  Not worth a damn for non-stop operations in
real-life.

Besides, Oracle does not support a FreeBSD port, costs a yearly salary per
copy and does not provide source yet.

 ...

> Yes, I had it too. But don't forget that the booting changes
> files, at least logs, utmp/wtmp, pipes, etc. If you just
> mount some filesystem and don't touch it after this, it
> can not get corrupted.

If you say so.  Are you willing to bet your salary, carreer or life on that
statement?  In FreeBSD, I have lost /usr/src twice, and /usr/local three
times in the last three or six months (Each of these is on a separate F/S,
of course.  None of them is modifyable by the boot process (other than the
clean umount bit), but I lost them all the same.  Once it was atttributed
to a bug/glitch in the fdisk/disklabel/partitions/slices logic, the other
times, I have no clue.

I did not bitch about it as it is under current, which has no warranty, etc.
I had similar losses under other O/Ss and versions.  

Finally, even if you were totally right (which I do not think you are), no
technical executive will allow a critical database on a Unix filesystem. 
Databases get corrupted all the time, on and off Unix filesystems.  But to
allow mission critical databases on Unix filesystems is prophane to these
people.  Reality notwithstanding.

 ...

> Don't know about ccf, never saw it. But if I create the database,
> all the blocks are allocated during the creation and later the
> file sizes never change (and no, they don't have gaps inside). 
> And as far as I know if I write to some block in file that is
> already allocated, the data will go to this block and it will never
> be reallocated by the filesystem. So you do not have any blocks
> allocated or deallocated during normal operation and the filesystem
> can not get corrupted.

The reason you see these nice solid files, is that ccf (which used to be a
stand-alone utility up to Oracle version 4.1.3, is now part of the program
which creates the database.  It still does the same exact thing;  Goes and
writes every byte in the Unix file.  If you have a Unix filesystem with
three-four files that were totally pre-written, and nothing else, and then
go through the girations the Oracle OSD does to circumvent the caching and
such of the filesystem, you are in effect on a raw device.  the only
difference is that you are executing 19,438 lines of ufs code, plus who
knows how many lines of VFS, FFS, whatever, in addition to the code
required to run to the device itself.

how can that be faster, or more reliable than not running that code, I do
not grasp.  Remember, not executing logic is more reliable and faster than
executing it.  The contents of that logic is immaterial.

Simon

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980304155830.shimon>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation