From owner-freebsd-scsi  Sun Aug  4 23:43: 0 2002
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id ADD7537B400; Sun,  4 Aug 2002 23:42:51 -0700 (PDT)
Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 3EFD143E6A; Sun,  4 Aug 2002 23:42:51 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0179.cvx40-bradley.dialup.earthlink.net ([216.244.42.179] helo=mindspring.com)
	by harrier.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 17bbZf-0007gI-00; Sun, 04 Aug 2002 23:42:40 -0700
Message-ID: <3D4E1E0D.582EBE7C@mindspring.com>
Date: Sun, 04 Aug 2002 23:41:17 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Lamont Granquist <lamont@scriptkiddie.org>
Cc: "Justin T. Gibbs" <gibbs@scsiguy.com>,
	Zhihui Zhang <zzhang@cs.binghamton.edu>, freebsd-hackers@FreeBSD.ORG,
	freebsd-scsi@FreeBSD.ORG
Subject: Re: transaction ordering in SCSI subsystem
References: <20020804223605.X892-100000@coredump.scriptkiddie.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-scsi@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-scsi.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-scsi>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-scsi>
X-Loop: FreeBSD.org

Lamont Granquist wrote:
> So what exactly gets ordered and how do things get tagged?
> 
> I tried following this in the code from VOP_STRATEGY and never quite
> figured it out.  Basically when you do a write are you just tagging the
> data writes along with the metadata writes and then sequencing them so
> that they have to complete in a given order?  And can operations with
> different tags be mixed around randomly?
> 
> Also, how does the feedback from the SCSI controller that the write
> completed get used by the O/S

Requests are issued to CAM.  CAM issues requests with tags to
SCSI controller.  SCSI controller issues commands to target on
SCSI bus using a tag.  Target completes command, issues "completed"
on tag.  SCSI controller write status to memory for request struct.
SCSI controller issues interrupt.  ISR in SCSI driver runs, and
notes completed request.  ISR notifies CAM.

Operations on tags may be concurrently outstanding.  There are a
limited number of concurrent operations permitted to be outstanding,
as dictated by the number of tags supported by a physical disk
drive.

Operations which can occur concurrently are requested concurrently;
the order in which they complete does not matter.

Operations which can *not* occur concurrently are requested only
serially.  This serialization is called a "stall barrier": the
next operation is not attempted until the previous operation has
been committed to stable storage.

Operations at the CAM layer are proxied transactions; as Justin
stated, operations queued to CAM are guaranteed to be queued to
the underlying physical device in the same order.

The FS is responsible for introducing stall barriers, as necessary,
to enbsure metadata integrity.  If the FS guarantees user data
integrity as well, then it must introduce stall barriers for that,
as well.

The minimal requirement for end-to-end data integrity is for the
operating system to guarantee metadata integrity -- transactional
idempotence of operations in order to guarantee atomicity -- and
the application to provide user data integrity through proper use
of metadata operation ordering in order to implement user data
transactioning.  Usually, this includes explicit data sychronization
to disk using fsync(2) calls, if user data integrity is required.

In most cases, user data integrity is implied; if, on the other
hand, you have seperate files for data record indexing and data
record storage, you must provide for explicit synchronization,
because you are implying application metadata within user data
regions of files, in order to provide services on top of the OS
platform, which the OS platform itself does not provide.

There are several ways for an FS to ensure metadata integrity.

The easiest to implement is synchronous metadata operations.  This
implies a stall barrier after each metadata operation, prohibiting
subsequent metadata operations until the single outstanding
operations permitted by the FS is committed to stable storage.  In
this way, metadata operations ordering is assurred.

The second easiest to implement is ordered metadata operations.
This is accomplished by dividing metadata operations into sets of
"dependent" and "independent" operations.  Operations which are
"independent" are permitted to occurr concurrently.  Operations
which are "dependent" imply a stall barrier.  This method is
formally called "Delayed Order Writes", or "DOW".  There are two
USL patenets on this (both assigned to Novell).  For this reason,
if you want to sell your FS in the U.S., you will not use this
approach.

The third method is much more difficult to implement, since it
requires an understanding of graph thoery.  It's called "soft
updates" (sometimes it's called "soft dependencies") and was
invented by Gregory Ganger and Yale Patt.  Operations are
registered in dependency order into a graph, and stal barriers
are only introduced on non-commutive edge traversals.  This ends
up introducing much fewer stall barriers, overall.  In addition,
operations which roll forward then backward (e.g. access timestamp
updates on intermediate object files which are deleted as part of
a compilation process) are never committed to disk; thus only
permanent changes end up committed, so long as the operations occur
within the update clock time window.  If an operation occurs that
requires a stall barrier, then a stall barrier is introduced.

While it's technically possible to export a transactioning interface
to user space programs for all three of these approaches, in practice
it is difficult to implement properly.  The easiest approach is to
simply extend the graph edge in the soft updates case.  This has the
additional benefit, in a stacking vnode architecture, of avoiding the
normally introduced stall barriers that occur between VFS layers,
unless there are real dependencies (i.e. the VFS/VFS boundary will
normally introduce an artificial stall barrier).  For this to be done
in FreeBSD would require generalizing the soft updates dependency
graph relationship code, to permit registration of node/node edge
dependency resolvers (which are explicit in the current soft updates
implementation).

So the answer to your question is that metadata writes and data
writes are treated seperately, and you must write code in your
application to deal with user data, rather than relying on the OS to
do it for you.  For more information on how to deal with this, take
a 300 level database class at your local university and/or do a
search on the phrase "two stage commit".


> (and the corollary being how does IDE write
> caching lying about completion affect the O/S and the data integrity)?

If a drive lies about having committed data to stable storage (it
doesn't matter if it's an IDE drive or a SCSI drive, but IDE drives
tend to be crhronic liars), then it causes the SCSI controller to
lie to CAM.  When the SCSI controller lies to CAM, then it causes
CAM to lie to the VFS.  When the CAM lies to the VFS, the VFS lies
about metadata integrity guarantees, and lies about user data having
been commited to stable storage before the fsync(2) call returns.
After which the kernel lies to the application program, and the
application program lies to the human running it.

Moral: do not buy hardware which lies to you, unless you want to
have your software lie to you.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message