From owner-freebsd-arch  Thu Nov  2 15: 6:47 2000
Delivered-To: freebsd-arch@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP id 261A237B4E5
	for <arch@FreeBSD.ORG>; Thu,  2 Nov 2000 15:06:40 -0800 (PST)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id QAA05141;
	Thu, 2 Nov 2000 16:03:22 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp04.primenet.com, id smtpdAAAvraaSj; Thu Nov  2 16:03:05 2000
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id QAA20908;
	Thu, 2 Nov 2000 16:06:16 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200011022306.QAA20908@usr09.primenet.com>
Subject: Re: Like to commit my diskprep
To: mbendiks@eunet.no (Marius Bendiksen)
Date: Thu, 2 Nov 2000 23:06:16 +0000 (GMT)
Cc: dillon@earth.backplane.com (Matt Dillon),
	rjesup@wgate.com (Randell Jesup), arch@FreeBSD.ORG
In-Reply-To: <Pine.BSF.4.05.10011022216250.13255-100000@login-1.eunet.no> from "Marius Bendiksen" at Nov 02, 2000 10:19:09 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> FFS is woefully inadequate at handling databases, due to the block
> indirection, but e.g. Oracle will allow you to run directly on top
> of a device.

Wrong.  The problem is in not exporting a transaction interface
to user space, which means that the databases have to resort to
multistage commit tactics, which can seriously damage performance.

Even worse, there is no such thing as a "region sync" (at least
in FreeBSD: other OSs have it), so you have to sync out all data
via a linear traversal of the dirty block list, using fsync() to
do the job, which further degrades performance.

This is greatly exacerbated by the fact that you really only want
to commit one transaction at a time, not all transactions, and
each transaction is going to do the same thing, not knowing about
the others (this is where the block indirection huts you, but it's
not because block indirection is bad, it's because of what has to
be done on top of the FS semantics to get transaction guarantees
interacts badly with it, if the API is deficient).

All of this is further exacerbated out of all proportion by the
fact that this means that concurrent transactions can not proceed
concurrently across an fscync(2), since by its nature, you can't
fsync(2) unless all data to be fsync(2)'ed represents completed
transactions.  So you must implement phase concurrency, to ensure
that the second phase of a two phase commit isn't startes with
the first phase of a two phase commit outstanding, ot you must
serialize operations through a turnstile algorithm.

Either way, all of this turns fsync(2) into a phase stall barrier
(at best) or a full on stalling barrier (at worst).

For an O(2) taste of this O(6) (O(4)*O(2)) problem, compare
normal write operations between an FS mounted with soft updates,
and the same OS mounted synchronous.

And now you see the database problem.

It should be obvious that transactions could be implemented as
two node single edges, in the context of soft updates, and the
resulting transaction interface exported to user space, and
used by a database application, to turn it back into an O(1)
problem: which is what a database vendor does when they use a
raw disk.

In other words, the FFS _is_ a database, with some important
semantics not being exported for use by database software which
may be layered on top of it.  This is also a good place to see
that implmeneting soft updates as a graph of precomputed node
relationships was probably not the wisest move, since mount
time computation of the relationships (accompanied by node-node
edge [dependency] resolving code), would have let yo implement
a transaction layer as a stacking layer, not to mention that
it would let yo apply soft updates to other layered FSs, even
statically configured stacks (like, say, EXT2FS).

NB: obviously O(1) is the base order, since dependent transactions
will increase the natural order by the number of dependants, but
for an N of 3, would you rather have an O(1)*O(3)=O(3) event, or
an O(2)*(4)*O(3)=O(9) event... and that with no concurrency of
other independant operations hitting the database at the same time?

For 1024 records, that's 1024**3 (10**9) or 1024**9 (6*(10**243)),
for the worst case.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message