From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 23:55:22 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AE2441065672
	for <arch@freebsd.org>; Mon, 31 Mar 2008 23:55:22 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 831758FC15
	for <arch@freebsd.org>; Mon, 31 Mar 2008 23:55:22 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VNt7Jf030309;
	Mon, 31 Mar 2008 16:55:07 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VNt7ZY030308;
	Mon, 31 Mar 2008 16:55:07 -0700 (PDT)
Date: Mon, 31 Mar 2008 16:55:07 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312355.m2VNt7ZY030308@apollo.backplane.com>
To: Bakul Shah <bakul@bitblocks.com>
References: <20080331231829.26D865B50@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	Alfred Perlstein <alfred@freebsd.org>, qpadla@gmail.com,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 23:55:22 -0000


:Let us take the mtron MSD-SATA3025-032 device for example. It
:has a capacity of 32GB + can do 16,000 random & 78,000
:sequential reads per second (of 512 byte blocks). If you
:write a root block every megabyte, you have 2^15 potential
:root blocks.  Locating the latest one will require 16 random
:reads + a scan of at most 1MB; which translates to about
:26ms.  Not too bad since this cost is incurred only on
:the first mount or reboot.

    Yah, I think for NAND filesystems crash recovery is actually the easiest
    issue to deal with since all you need to do, pretty much, is rewind
    your append pointer a bit.  Not only can you rewind the filesystem
    to an older state, but you can also reconstruct a great deal of the
    unsynced in-memory data by doing a limited reverse scan.  It just does
    not take very long to do that and it can be done automatically by the
    filesystem at mount time.

    For example, if you write some file data and the new file block is
    flushed to flash but the related meta-data changes for the pointers to
    the now relocated data block have not yet been flushed to flash, on
    crash recovery it is possible to note this condition (the old physical
    block number would be stored in the aux data area of the new one) and
    regenerate the missing meta-data changes instead of being forced to
    back-out the write.

    Being able to do this has fairly substantial consequences because it
    means the fsync() only needs to flush some of the modified pages, not
    all of them, and that the remaining modified pages could in fact
    remain unflushed in the system cache and STILL be recoverable after
    a crash because their modification was mearly a side effect of the
    operation that *was* flushed to flash.  Once you are able to do that,
    you can also simply decide not to synchronously flush that meta data
    at all and thus allow multiple changes to accumulate in system memory
    before flushing to flash.

    There is one issue here and that is the transactional nature of most
    filesystem operations.  For example, if you append to a file with
    a write() you are not only writing new data to the backing store, you
    are also updating the on-disk inode's st_size field.  Those are two
    widely separated pieces of information which must be transactionally
    bound --- all or nothing.  In this regard the crash recovery code needs
    to understand that they are bound and either recover the whole
    transaction or none of it. 

    Once you get to that point you also have to worry about
    interdependancies between transactions... a really sticky issue
    that is the reason for a huge chunk of softupdate's complexity in
    UFS.  Basically you can wind up with interdependant transactions
    which must be properly recovered.  An example would be doing a write()
    and then doing another write().  The second write() cannot be recovered
    unless the first can also be recovered.  Separate transactions, but with
    a dependancy.

    Such interdependancies can become arbitrarily complex the longer
    you leave meta-data changes unflushed.  The question ultimately becomes
    whether the recovery code can deal with the complexity or not.  If not
    you may be forced to flush rather then create the interdependancy.

    HAMMER has precisely this issue with it's UNDO sequencing.  The complexity
    is in the algorith, but not the (fairly small) amount of time it takes
    to actually perform the recovery operation.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>