From owner-freebsd-arch Thu Nov 2 15: 6:47 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 261A237B4E5 for ; Thu, 2 Nov 2000 15:06:40 -0800 (PST) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id QAA05141; Thu, 2 Nov 2000 16:03:22 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpdAAAvraaSj; Thu Nov 2 16:03:05 2000 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id QAA20908; Thu, 2 Nov 2000 16:06:16 -0700 (MST) From: Terry Lambert Message-Id: <200011022306.QAA20908@usr09.primenet.com> Subject: Re: Like to commit my diskprep To: mbendiks@eunet.no (Marius Bendiksen) Date: Thu, 2 Nov 2000 23:06:16 +0000 (GMT) Cc: dillon@earth.backplane.com (Matt Dillon), rjesup@wgate.com (Randell Jesup), arch@FreeBSD.ORG In-Reply-To: from "Marius Bendiksen" at Nov 02, 2000 10:19:09 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > FFS is woefully inadequate at handling databases, due to the block > indirection, but e.g. Oracle will allow you to run directly on top > of a device. Wrong. The problem is in not exporting a transaction interface to user space, which means that the databases have to resort to multistage commit tactics, which can seriously damage performance. Even worse, there is no such thing as a "region sync" (at least in FreeBSD: other OSs have it), so you have to sync out all data via a linear traversal of the dirty block list, using fsync() to do the job, which further degrades performance. This is greatly exacerbated by the fact that you really only want to commit one transaction at a time, not all transactions, and each transaction is going to do the same thing, not knowing about the others (this is where the block indirection huts you, but it's not because block indirection is bad, it's because of what has to be done on top of the FS semantics to get transaction guarantees interacts badly with it, if the API is deficient). All of this is further exacerbated out of all proportion by the fact that this means that concurrent transactions can not proceed concurrently across an fscync(2), since by its nature, you can't fsync(2) unless all data to be fsync(2)'ed represents completed transactions. So you must implement phase concurrency, to ensure that the second phase of a two phase commit isn't startes with the first phase of a two phase commit outstanding, ot you must serialize operations through a turnstile algorithm. Either way, all of this turns fsync(2) into a phase stall barrier (at best) or a full on stalling barrier (at worst). For an O(2) taste of this O(6) (O(4)*O(2)) problem, compare normal write operations between an FS mounted with soft updates, and the same OS mounted synchronous. And now you see the database problem. It should be obvious that transactions could be implemented as two node single edges, in the context of soft updates, and the resulting transaction interface exported to user space, and used by a database application, to turn it back into an O(1) problem: which is what a database vendor does when they use a raw disk. In other words, the FFS _is_ a database, with some important semantics not being exported for use by database software which may be layered on top of it. This is also a good place to see that implmeneting soft updates as a graph of precomputed node relationships was probably not the wisest move, since mount time computation of the relationships (accompanied by node-node edge [dependency] resolving code), would have let yo implement a transaction layer as a stacking layer, not to mention that it would let yo apply soft updates to other layered FSs, even statically configured stacks (like, say, EXT2FS). NB: obviously O(1) is the base order, since dependent transactions will increase the natural order by the number of dependants, but for an N of 3, would you rather have an O(1)*O(3)=O(3) event, or an O(2)*(4)*O(3)=O(9) event... and that with no concurrency of other independant operations hitting the database at the same time? For 1024 records, that's 1024**3 (10**9) or 1024**9 (6*(10**243)), for the worst case. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message