From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 19:15:55 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C8479106564A
	for <arch@freebsd.org>; Sun, 30 Mar 2008 19:15:55 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outI.internet-mail-service.net (outi.internet-mail-service.net
	[216.240.47.232])
	by mx1.freebsd.org (Postfix) with ESMTP id A6D348FC19
	for <arch@freebsd.org>; Sun, 30 Mar 2008 19:15:55 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Sun, 30 Mar 2008 17:47:33 -0700
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id 52F7E2D6B2B;
	Sun, 30 Mar 2008 12:15:53 -0700 (PDT)
Message-ID: <47EFE6EA.4000804@elischer.org>
Date: Sun, 30 Mar 2008 12:15:54 -0700
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213)
MIME-Version: 1.0
To: Kirk McKusick <mckusick@mckusick.com>
References: <200803292353.m2TNrCOW094875@chez.mckusick.com>
In-Reply-To: <200803292353.m2TNrCOW094875@chez.mckusick.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 19:15:55 -0000

Kirk McKusick wrote:
> You should try running your experiment using ZFS. Because it is a
> non-overwriting filesystem, it might work better with flash.

trouble is the amount of ram it needs might be unsuitable for embedded 
systems.

> 
> 	Kirk McKusick
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"


From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 20:16:56 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D0B9E106564A
	for <arch@freebsd.org>; Sun, 30 Mar 2008 20:16:56 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id A0FC58FC25
	for <arch@freebsd.org>; Sun, 30 Mar 2008 20:16:56 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2UKGuqg015128;
	Sun, 30 Mar 2008 13:16:56 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2UKGuZA015127;
	Sun, 30 Mar 2008 13:16:56 -0700 (PDT)
Date: Sun, 30 Mar 2008 13:16:56 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803302016.m2UKGuZA015127@apollo.backplane.com>
To: Kirk McKusick <mckusick@mckusick.com>, arch@freebsd.org
References: <200803292353.m2TNrCOW094875@chez.mckusick.com>
Cc: 
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 20:16:57 -0000

:You should try running your experiment using ZFS. Because it is a
:non-overwriting filesystem, it might work better with flash.
:
:	Kirk McKusick

   I'm assuming ZFS still has to update indices and indirect blocks, though,
   which is the primary source for random updates in all filesystems.

   The right way to deal with flash is *NOT* to require that the filesystem
   be smart about flash storage, but instead to implement an intermediate
   storage layer which linearizes the writes to flash and removes all
   random erases from the critical path.  This also causes erasures to
   be evenly spread out on the flash unit and *GREATLY* extends the life
   of the flash device (to the point where you can just treat it as a disk
   and not have to worry about wearing out cells).

   I wrote precisely that 20 years ago for the flash filesystem I built
   for our telemetry RTUs.  Of course, 20 years ago flash devices were
   much smaller, only 1-4MB per chip.  But the concept is sound and with
   proper design can be implemented for much larger devices.

   Basically the general idea is as follows:  Break the flash into three
   pieces:  Two sector translation tables and one bulk storage area.
   Whenever a modification is made that involves transitioning bits
   from 0->1 (1->0 doesn't need an erase cycle) instead of erasing the
   flash sector all you do is allocate a new flash sector, append
   an entry to the translation table, and write the data out to the 
   new flash sector.  The logical block is now renumbered.  You cache
   (some or all of) the translation table in-memory for fast access.

   * Appends to the translation table only involve 1->0 transitions.  You
     don't even have to zero-out the old translation but can use it for
     crash recovery purposes.  Thus no erasures are needed until the table
     becomes full.

   * Any non-trivial overwrites append a new sector, again involving only
     1->0 transitions and requiring no erasures.

   * When the translation table becomes full you repack it into the second
     translation table (which then becomes the primary table), and erase
     the previous table.  You ping-pong the tables (that's why there are
     two).

   * Bulk space can be allocated linearly until the flash becomes full,
     then erased/repacked (you also switch to the alternate translation
     table when doing the repacking of the bulk space).  This can be a
     little tricky but as long as you leave one erase-sector's worth of
     space available you can always repack the flash without any
     possibility of losing data.

     This latter operation is the most expensive but once some space is freed
     up it is possible to pack simultaniously with running new ops, or to
     repack continuously as a background operation when space is tight, as
     long as you don't get twisted up with a full translation table.

    The only 'hard' bit about this design is you need to come up with a
    translation table topology that works for large flash devices.  My
    flash filesystem of long ago just used a linear array and cached the
    whole thing with a hash table in memory, so it didn't require a
    sophisticated topology on-flash.  But for a large flash device you
    probably need something a bit more sophisticated that still does not
    involve erase cycles in the critical path.  The critical point, however,
    is that the on-flash translation table does NOT need to be optimal
    because you can mirror or cache elements of it in-memory.

    In anycase, that's really the only acceptable way to do a flash
    filesystem and still be able to guarantee proper wear characteristics
    for the flash cells.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 20:31:26 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9CAFE1065672
	for <arch@freebsd.org>; Sun, 30 Mar 2008 20:31:26 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 52B878FC20
	for <arch@freebsd.org>; Sun, 30 Mar 2008 20:31:26 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id DF50B17104;
	Sun, 30 Mar 2008 20:31:24 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2UKVNjh012859;
	Sun, 30 Mar 2008 20:31:24 GMT (envelope-from phk@critter.freebsd.dk)
To: Matthew Dillon <dillon@apollo.backplane.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Sun, 30 Mar 2008 13:16:56 MST."
	<200803302016.m2UKGuZA015127@apollo.backplane.com> 
Date: Sun, 30 Mar 2008 20:31:23 +0000
Message-ID: <12858.1206909083@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Kirk McKusick <mckusick@mckusick.com>, arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 20:31:26 -0000

In message <200803302016.m2UKGuZA015127@apollo.backplane.com>, Matthew Dillon w
rites:

>   The right way to deal with flash is *NOT* to require that the filesystem
>   be smart about flash storage, but instead to implement an intermediate
>   storage layer which linearizes the writes to flash and removes all
>   random erases from the critical path.

Your description of a simplified version of what is commonly called
a "Flash Adaptation Layer", is a very good example of why there is
a clear difference between "camera grade" flash devices, like most
CF cards, and the new generation of "SSD" devices, like the M-Tron
disk now in my laptop.

The Camera grade Flash devices get lousy random write performance
because they implement in essense what you describe, only in a more
complete fashion where they have error correction, both the data
and on the bitmaps.

The newer generation of SSD devices do things much smarter than
that, which is why their random write performance is much better
than camera-grade devices.

See my earlier emails for references to how to do the really smart
thing.


-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 21:00:16 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C36931065672
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:00:16 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 79F9C8FC1F
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:00:16 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2UL0FmF015655;
	Sun, 30 Mar 2008 14:00:15 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2UL0FTd015654;
	Sun, 30 Mar 2008 14:00:15 -0700 (PDT)
Date: Sun, 30 Mar 2008 14:00:15 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803302100.m2UL0FTd015654@apollo.backplane.com>
To: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
References: <12858.1206909083@critter.freebsd.dk>
Cc: Kirk McKusick <mckusick@mckusick.com>, arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 21:00:16 -0000


:Your description of a simplified version of what is commonly called
:a "Flash Adaptation Layer", is a very good example of why there is
:a clear difference between "camera grade" flash devices, like most
:CF cards, and the new generation of "SSD" devices, like the M-Tron
:disk now in my laptop.
:
:The Camera grade Flash devices get lousy random write performance
:because they implement in essense what you describe, only in a more
:complete fashion where they have error correction, both the data
:and on the bitmaps.
:
:The newer generation of SSD devices do things much smarter than
:that, which is why their random write performance is much better
:than camera-grade devices.
:
:See my earlier emails for references to how to do the really smart
:thing.
:
:-- 
:Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20

    Er, why don't you explain it again, because I can't find the reference.

    You can only write to flash so fast.  What I described is a fairly
    maximal implementation.  The only way to make things faster is to
    add some dime-cap-backed static ram as a front-end cache and to gang
    writes to multiple flash chips (which is fairly standard).  A
    dime-cap-backed static ram will retain the cache for upwards of a month.
    If you go LI-battery backed static ram then cache retention is
    around 5-years.

    Most 'camera grade' devices are one or two physical chips.  Write
    performance, particularly when writing out large linear files, tends
    to be limited by the fact that there aren't very many flash chips
    and so you have no ability to gang writes in parallel.

    Any sort of SSD device is typically going to have anywhere from four
    to 'many' physical flash devices on board.  Write performance to such
    devices will be an order of magnitude faster, really only limited by
    design choices on how the flash devices are ganged.  A 'wide data' bus
    is the most convenient way to gang writes.  There are also current
    limitations which limit how many physical chips you can write to in
    parallel, though modern flash devices have much lower write current
    requirements then older ones and if it is packaged as a SATA drive then
    it has tons of current capability simply by having access to a power
    connector capable of delivering the currents required by normal hard
    drives.  CF and other small-format flash devices do not have NEARLY
    the same current delivery capabilities.

    In anycase, there is nothing magical about any of this.  You still need
    to spread the data out on the physical flash devices to avoid wearing out
    cells.  Perceived improvements in performance are entirely due to having
    a front-end non-volatile ram cache and ganging writes in parallel.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 21:06:58 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C371E106564A
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:06:58 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 78C448FC1C
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:06:58 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id 38A8417104;
	Sun, 30 Mar 2008 21:06:57 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2UL6vYu013180;
	Sun, 30 Mar 2008 21:06:57 GMT (envelope-from phk@critter.freebsd.dk)
To: Matthew Dillon <dillon@apollo.backplane.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Sun, 30 Mar 2008 14:00:15 MST."
	<200803302100.m2UL0FTd015654@apollo.backplane.com> 
Date: Sun, 30 Mar 2008 21:06:57 +0000
Message-ID: <13179.1206911217@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Kirk McKusick <mckusick@mckusick.com>, arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 21:06:58 -0000

In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon w
rites:

>    Er, why don't you explain it again, because I can't find the reference.

You'll find it if you search for it.

And no, I really don't want to discuss it any further with you.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 21:09:29 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5869A1065673
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:09:29 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 122BB8FC17
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:09:29 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2UL9S9H015763;
	Sun, 30 Mar 2008 14:09:28 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2UL9SV1015762;
	Sun, 30 Mar 2008 14:09:28 -0700 (PDT)
Date: Sun, 30 Mar 2008 14:09:28 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803302109.m2UL9SV1015762@apollo.backplane.com>
To: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
References: <13179.1206911217@critter.freebsd.dk>
Cc: Kirk McKusick <mckusick@mckusick.com>, arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 21:09:29 -0000


:
:In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon w
:rites:
:
:>    Er, why don't you explain it again, because I can't find the reference.
:
:You'll find it if you search for it.
:
:And no, I really don't want to discuss it any further with you.
:
:-- 
:Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20

    Well, no skin off my nose.  I will say that I am not at all impressed
    with your idiotic answer, though.

					-Matt


From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 21:11:07 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BC54E1065677
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:11:07 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 71A098FC1A
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:11:07 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id 892F417107;
	Sun, 30 Mar 2008 21:11:06 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2ULB6fM013325;
	Sun, 30 Mar 2008 21:11:06 GMT (envelope-from phk@critter.freebsd.dk)
To: Matthew Dillon <dillon@apollo.backplane.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Sun, 30 Mar 2008 14:09:28 MST."
	<200803302109.m2UL9SV1015762@apollo.backplane.com> 
Date: Sun, 30 Mar 2008 21:11:06 +0000
Message-ID: <13324.1206911466@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Kirk McKusick <mckusick@mckusick.com>, arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 21:11:07 -0000

In message <200803302109.m2UL9SV1015762@apollo.backplane.com>, Matthew Dillon w
rites:
>
>:
>:In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon w
>:rites:
>:
>:>    Er, why don't you explain it again, because I can't find the reference.
>:
>:You'll find it if you search for it.
>:
>:And no, I really don't want to discuss it any further with you.
>:
>:-- 
>:Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
>
>    Well, no skin off my nose.  I will say that I am not at all impressed
>    with your idiotic answer, though.

... and that's why.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 21:15:01 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D669C1065678
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:15:01 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 900F98FC31
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:15:01 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2ULExU2015829;
	Sun, 30 Mar 2008 14:15:01 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2ULExWt015828;
	Sun, 30 Mar 2008 14:14:59 -0700 (PDT)
Date: Sun, 30 Mar 2008 14:14:59 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803302114.m2ULExWt015828@apollo.backplane.com>
To: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
References: <13324.1206911466@critter.freebsd.dk>
Cc: Kirk McKusick <mckusick@mckusick.com>, arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 21:15:01 -0000

:>    Well, no skin off my nose.  I will say that I am not at all impressed
:>    with your idiotic answer, though.
:
:... and that's why.
:
:-- 
:Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20

    You really like making declarations by fiat.  I actually explain
    the reason, in depth.  If you are unable or unwilling to have a technical
    conversation and insist on simply putting down one-liners with nothing
    to back them up, then that's your problem, not mine.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Sun Mar 30 21:42:43 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B3967106564A
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:42:43 +0000 (UTC)
	(envelope-from chris@arnold.se)
Received: from mailstore.infotropic.com (mailstore.infotropic.com
	[213.136.34.3]) by mx1.freebsd.org (Postfix) with ESMTP id E9B4E8FC18
	for <arch@freebsd.org>; Sun, 30 Mar 2008 21:42:42 +0000 (UTC)
	(envelope-from chris@arnold.se)
Received: (qmail 96681 invoked by uid 89); 30 Mar 2008 21:15:58 -0000
Received: by simscan 1.2.0 ppid: 96676, pid: 96678, t: 0.1362s
	scanners: attach: 1.2.0 clamav: 0.90/m:42
Received: from unknown (HELO ?192.168.123.123?) (chris@arnold.se@212.71.168.45)
	by mailstore.infotropic.com with ESMTPA; 30 Mar 2008 21:15:57 -0000
Date: Sun, 30 Mar 2008 23:15:57 +0200 (CEST)
From: Christopher Arnold <chris@arnold.se>
X-X-Sender: chris@localhost
To: arch@freebsd.org
Message-ID: <20080330231544.A96475@localhost>
X-message-flag: =?ISO-8859-1?Q?Outlook_isn=B4t_compliant_with_current_standards?=
	=?ISO-8859-1?Q?_please_install_another_mail_client!?=
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2008 21:42:43 -0000

On Sun, 30 Mar 2008, Poul-Henning Kamp wrote:

> In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon 
> w
> rites:
>
>>    Er, why don't you explain it again, because I can't find the reference.
> 
> You'll find it if you search for it.
> 
I belive phk means that ggogling for "Flash Adaptation Layer" turns up some 
results.

> And no, I really don't want to discuss it any further with you.
> 
But please continue the duscussion for the sake of the silent majority, there 
are loads of us out here who are interested in flash fs development.

Also, i had the impression that newer flash based hardrives had internal logig 
to spread out writs evenly over the disk and to remap worn out blocks. And that 
the result of these algoritms increased MTBF to atleast the MTBF for spinning 
disks. Or have i misread something?


 	/Chris

--
http://www.arnold.se/
http://www.infotropic.com/

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 00:10:34 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id ECC08106566B
	for <arch@freebsd.org>; Mon, 31 Mar 2008 00:10:33 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id BDF358FC22
	for <arch@freebsd.org>; Mon, 31 Mar 2008 00:10:33 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2V0ALA3017187;
	Sun, 30 Mar 2008 17:10:21 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2V0ALRp017186;
	Sun, 30 Mar 2008 17:10:21 -0700 (PDT)
Date: Sun, 30 Mar 2008 17:10:21 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803310010.m2V0ALRp017186@apollo.backplane.com>
To: Christopher Arnold <chris@arnold.se>
References: <20080330231544.A96475@localhost>
Cc: arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 00:10:34 -0000

:I belive phk means that ggogling for "Flash Adaptation Layer" turns up some 
:results.
:
:> And no, I really don't want to discuss it any further with you.
:> 
:But please continue the duscussion for the sake of the silent majority, there 
:are loads of us out here who are interested in flash fs development.
:
:Also, i had the impression that newer flash based hardrives had internal logig 
:to spread out writs evenly over the disk and to remap worn out blocks. And that 
:the result of these algoritms increased MTBF to atleast the MTBF for spinning 
:disks. Or have i misread something?
:
:
: 	/Chris

    I found some of it, though I dunno if it's what he was specifically
    referencing.  The slide show was interesting though there were a
    number of factual errors, but I didn't really see anything in-depth
    about 'Flash Adaptation Layer'.  It seems to be a fairly generically
    coined term for something that is far from generic in actual
    implementation.

    The idea of remapping flash sectors could be considered a poor-man's
    way of dealing with wear issues in that remapping tends to be fairly
    limited... for example, you might use a fixed-sized table and once the
    table fills up the device is toast.  Remapping doesn't actually prevent
    the uneven wear from occuring, it just gives you a fixed factor of
    additional runway.  If remapping gets complex enough to work with an
    arbitrary number of dead sectors then it is effectively a 'Flash
    Adaptation Layer'.  Limited remapping (e.g. using a fixed-sized table)
    is really easy to code up.

    But there are some huge differences between the two.  Really huge
    differences.  Detecting a worn cell requires generating a CRC and
    correcting it requires generating an ECC code.  Neither CRCs nor
    ECCs are perfect and actually depending on them to handle situations
    that happen *normally* during the device's life-span is bad business.

    A proper sector translation mechanism guarantees even wear of all
    the cells.  You don't *GET* CRC errors under normal operation of
    the device.  You still want to have a CRC to detect the situation, and
    perhaps even a small ECC to try to correct it, but these exist to
    handle manufacturing defects (which can limit the life of individual
    cells) rather then to handle wear issues unrelated to manufacturing
    defects, which is what a limited remapping mechanic does.  A wear issue
    can cause many cells to die (see later on w/ regards to data retention)
    whereas a manufacturing defect tends to result in single bit errors.

    Insofar as indestructability, in the short term flash storage is 
    more resilient then disk storage especially considering that there are
    no moving parts, but flash cells will degrade over time whether you
    write to them or not, depending on temperature.

    Look at any flash part, bring up the technical specifications and 
    there will be an entry for 'data retention' time.  Usually it's around
    10 years at 20 C.  If it is hotter the data is retained for a shorter
    period of time, if it is colder the data is retained for a longer
    period of time.  Retention is different from cell wear.  What retention
    means is that if you have a flash device, you need to rewrite the
    cells (you can't just read the cell like a dram refresh, but you don't
    have to go through an erase cycle.  You only have to rewrite the cell)...
    you need to do that at least once every 5 years to be safe, or you risk
    losing the data.  Rewriting the cell does add wear to it so you don't
    want to rewrite it too often.  I have personally seen flash devices
    lose data... I'm trying to remember how many years it was but I think
    it was on the order of 15 years in one unit out of 30 that was subject
    to fairly hot temperatures in the summer.

    A flash unit must therefore run a scrubber to really be reliable.  It is
    absolutely required if you use a remapping algorithm, and a bit less so
    if you use a proper storage layer which generates even wear.  The real
    difference between the two comes down to shelf life (when you aren't
    scrubbing anything), since worn cells will die a lot more quickly then
    unworn cells.

    A scrubber in this case must validate the CRC and there is usually a
    way to tell the device to operate at a different detection threshold in
    order to detect a failing cell *before* it actually fails (write-verify
    usually does this when writing but you also want to do this when
    scrubbing, if you want to do it right).  The idea is for the scrubber
    to detect bit errors *before* the data becomes unrecoverable and,
    in fact, before the data even needs to be ECC'd.  You should not have
    to actually use ECC correction under normal operation of the device over
    its entire life span.

    If you have a wear situation where multiple cells are failing and you
    do not scan the data in the flash often enough (using write-verify
    thresholds, NOT normal operations thresholds) to detect the failing
    cells, and/or you do not have a verification voltage capability to
    detect failing cells before they fail (for example you take a worn
    device offline and store it on a shelf somewhere), then you risk
    detecting the failed cells too late at a point where there are too many
    failed cells to correct.  This is of particular concern for very large
    flash storage.

    One side-effect of having a proper storage layer is that the scrubber is
    typically built in to it.  Just the mechanic of write-appending and
    having to repack the storage usually cycles the storage in a time frame
    less then 10 years.  You can scrub either way, though, it isn't hard to
    do and doesn't require remapping the cell unless it has failed, just
    re-writing the same data resets the energy levels.

    A flash is still more reliable then a hard drive in the short-term.
    However, disk media tends to retain magnetic orientation longer then
    a flash cell (longer then 10 years)... well, I'm not sure about the
    absolute latest technology but that was certainly the case 10 years
    ago.  Disk media has similar thermal erasure issues so, really, both
    types of media have a limited data retention span.  Recovering data
    from an aging flash chip is a lot harder, though, because you have to
    remove the flash packaging and even shave the chip (yes, it can be done,
    there have been numerous cases where supposedly secure execute-only
    flash and E^2prom could be read out by shaving the chip, though I dunno
    if it has been done with recent super-high-density flashes).  With
    disk media you can generally recover thermally erased bits using very
    expensive equipment with very sensitive detectors.  If the data is
    important, and you are willing to pay for it, you can recover it off
    a HD.

    Typically the only difference between 'consumer' and 'industrial' flash
    is how they sort the chips coming out of the plant.  It is possible to 
    detect weak cells and sort the chips accordingly (thus consumer chips
    have fewer rewrite cycles), though frankly in most cases a consumer
    chip will be almost as good as an industrial one.  If you run a proper
    sector translation layer which generates even wear and you have the
    ability to use the write-verify mechanism in your scrubbing code, it
    doesn't really matter which grade you use.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 01:36:04 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2C428106566B
	for <arch@freebsd.org>; Mon, 31 Mar 2008 01:36:04 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id F41B28FC17
	for <arch@freebsd.org>; Mon, 31 Mar 2008 01:36:03 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2V1ZqdA018355;
	Sun, 30 Mar 2008 18:35:52 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2V1ZpiN018354;
	Sun, 30 Mar 2008 18:35:51 -0700 (PDT)
Date: Sun, 30 Mar 2008 18:35:51 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803310135.m2V1ZpiN018354@apollo.backplane.com>
To: Christopher Arnold <chris@arnold.se>, arch@freebsd.org
References: <20080330231544.A96475@localhost> 
Cc: 
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 01:36:04 -0000

     I just finished reading up on the latest NAND stuff, so I am going
     to add an addendum.

     There was one factual error in my last posting having to do with
     byte rewrites.  I'm not sure this applies to all manufacturers but
     one spec sheet I looked at specifically limited non-erase rewriting
     to two consecutive page-write sequences.  After that you have to perform
     an erase before you can write (and rewrite once) again.

**** I'd be interested in knowing if any chip vendors support multiple
**** consecutive page-write sequences without erase cycles inbetween
**** (i.e. allowing 1->0 transitions like you can do with NOR).

     It looks like most vendors provide SECTOR_SIZE + 64 bytes of auxillary
     information.  The auxillary information is where you typically store
     the CRC and ECC (they can be the same thing but it's a good idea to
     implement them separately).   I was surprised that the vendors
     speced only a 2 bit detect / 1 bit correction code, which is actually
     the simplest hamming code you can have.

     Describing this type of hamming code in a paragraph is actually pretty
     easy.  You can think of it as a code which identifies which bit in a
     block is in error and needs to be 'flipped' (aka the '1' bit correction).
     For example, if you are ECC'ing 8192 bytes you have 65536 bits
     which means the hamming code needs to be able to encode a 16 bit
     correction address, hence it requires 16 bits of storage for the
     correction, plus another (typically) log2(16) = 4 bits of storage for
     the detection, plus 1 more bit (you have to include the storage
     taken up by the ECC code itself).  So ECC on 65536 bits requires 21 bits.
     I'm doing that from memory so don't quote me, we used those sorts of 
     ECC in radio modem protocols 20 years ago.

     The actual construction of the correction address is a bit more
     complex but that is the basics of how a 2 bit detect / 1 bit correct
     hamming code works.

     The vendor bit error handling recommendation is to relocate the page
     and then erase the original rather then to rewrite the page, so the
     scrubbing code can't just rewrite the same page when it finds an error.
     You still have to scrub, though, or you risk accumulating too many
     errors to correct.  write-verify is typically automatic in the chips
     but the two I checked do not seem to have a variable threshold for
     read operations for early detection of leaking bits.  Older chips had
     separate power supplies for the programming power but newer ones
     incorporate internal charge pumps so it may not be doable, which
     would be too bad.

     Life span and shelf life information is correct.  My assumption there
     is that the manufacturers are specing the shelf life for leakage in the
     worst case write verses verify cycle (the verify is internal to the chip,
     the external entity just does a write and reads the verification status
     after it finishes).  If there is no way to do a read at a lower
     sensitivity level there is really no way to locate failing bits before
     they actually fail.  That doesn't seem right so I may be missing
     something in the spec.

     With regards to averaging out the wear by not erase cycling the same
     page over and over again, my read from the chip specs is that you
     basically have no choice on the matter... you MUST average the wear out,
     period end of story.  This also precludes using a simple sector
     remapping algorithm, particularly if the re-writes between erase
     cycles for a page are limited.

     The reason you MUST average the wear out is that the vendors do not
     appear to be guaranteeing even 100K erase cycles.

     I've read flash chip specs a billion times... when you read between
     the lines what the vendor is saying, basically, is that the shelf life
     of a stored bit is only guaranteed to be 10 years if you don't rewrite
     the cell more then X number of times.  So while it may be possible to
     write more then X number of times, you risk serious data degredation
     ('shelf life') if you do, even if the write does not fail.  This
     is the only guarantee they make, and it is based on the damage the cell
     takes when you erase/write to it which increases leakage which reduces
     shelf life.

     They do NOT guarantee that you can actually do X erase cycles, they
     simply say that the chip will tell you if an erase cycle fails, and that
     it can fail ANY TIME... the very first erase cycle you do on a
     particular page can fail.

     The ONLY thing the vendors guarantee is that the FIRST page on the device
     can go through a certain number of erase cycles, like 1000 or 10,000.
     No other page on the device has any sort of guarantee.

     This is very important.  This means you MUST average the wear out,
     period, whether it is consumer OR industrial grade.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 02:13:36 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 908E41065674
	for <arch@FreeBSD.ORG>; Mon, 31 Mar 2008 02:13:36 +0000 (UTC)
	(envelope-from das@FreeBSD.ORG)
Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 252798FC16
	for <arch@FreeBSD.ORG>; Mon, 31 Mar 2008 02:13:35 +0000 (UTC)
	(envelope-from das@FreeBSD.ORG)
Received: from zim.MIT.EDU (localhost [127.0.0.1])
	by zim.MIT.EDU (8.14.2/8.14.2) with ESMTP id m2V2F4m4001804;
	Sun, 30 Mar 2008 22:15:04 -0400 (EDT) (envelope-from das@FreeBSD.ORG)
Received: (from das@localhost)
	by zim.MIT.EDU (8.14.2/8.14.2/Submit) id m2V2F4ju001803;
	Sun, 30 Mar 2008 22:15:04 -0400 (EDT) (envelope-from das@FreeBSD.ORG)
Date: Sun, 30 Mar 2008 22:15:04 -0400
From: David Schultz <das@FreeBSD.ORG>
To: Matthew Dillon <dillon@apollo.backplane.com>
Message-ID: <20080331021504.GA1465@zim.MIT.EDU>
Mail-Followup-To: Matthew Dillon <dillon@apollo.backplane.com>,
	Christopher Arnold <chris@arnold.se>, arch@FreeBSD.ORG
References: <20080330231544.A96475@localhost>
	<200803310010.m2V0ALRp017186@apollo.backplane.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200803310010.m2V0ALRp017186@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@FreeBSD.ORG
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 02:13:36 -0000

On Sun, Mar 30, 2008, Matthew Dillon wrote:
>     The idea of remapping flash sectors could be considered a poor-man's
>     way of dealing with wear issues in that remapping tends to be fairly
>     limited... for example, you might use a fixed-sized table and once the
>     table fills up the device is toast.  Remapping doesn't actually prevent
>     the uneven wear from occuring, it just gives you a fixed factor of
>     additional runway.
[...]
>     A flash unit must therefore run a scrubber to really be reliable.  It is
>     absolutely required if you use a remapping algorithm, and a bit less so
>     if you use a proper storage layer which generates even wear.

Yes, this is essentially what modern NAND flash devices do. I
suggest that you read this article before you write any more
essays about it:

       http://www.cs.tau.ac.il/~stoledo/Pubs/flash-survey.pdf

Now if you think about issues such as sector mapping updates,
writes smaller than the mapping granularity, and running the
cleaner on fragmented erase units, you'll quickly see why random
writes perform so poorly.

You're right that you need additional algorithms to avoid uneven
wear; remapping merely facilitates that even when the write access
pattern is decidedly uneven. The article discusses several approaches.

Several people have proposed flash-aware filesystems, also
described in the article, to obviate the need for this sort of
remapping layer. Confusingly, one of them is called FFS, for
"Flash File System". Most of them resemble log-structured
filesystems like LFS and ZFS, but often with additional
considerations such as wear leveling.

Your earlier characterization of ZFS wasn't quite right, by the
way; ZFS arranges data and metadata in a tree of blocks, and even
the indirect blocks, except for the top-level block, are
copy-on-write. Unfortunately I can't find a good paper on it at
the moment.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 11:06:57 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@hub.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8F85810656A6
	for <freebsd-arch@hub.freebsd.org>;
	Mon, 31 Mar 2008 11:06:57 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 666178FC12
	for <freebsd-arch@hub.freebsd.org>;
	Mon, 31 Mar 2008 11:06:57 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m2VB6vpi038848
	for <freebsd-arch@FreeBSD.org>; Mon, 31 Mar 2008 11:06:57 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.2/8.14.1/Submit) id m2VB6uKW038844
	for freebsd-arch@FreeBSD.org; Mon, 31 Mar 2008 11:06:56 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 31 Mar 2008 11:06:56 GMT
Message-Id: <200803311106.m2VB6uKW038844@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
	owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Cc: 
Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 11:06:57 -0000

Current FreeBSD problem reports
Critical problems
Serious problems
Non-critical problems

S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/120749  arch       [request] Suggest upping the default kern.ps_arg_cache

1 problem total.


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 18:05:25 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 00FDC1065679
	for <arch@freebsd.org>; Mon, 31 Mar 2008 18:05:25 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id E09298FC14
	for <arch@freebsd.org>; Mon, 31 Mar 2008 18:05:24 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 8A94940A2A6;
	Mon, 31 Mar 2008 10:36:01 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 10:36:09 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
In-Reply-To: <200803310135.m2V1ZpiN018354@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics 
Thread-Index: AciSz5+GfnfZSxuDTqmryEuFc5lwBgAgsVwg
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>,
	"Christopher Arnold" <chris@arnold.se>, <arch@freebsd.org>
Cc: 
Subject: RE: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 18:05:25 -0000

I came late to this discussion, so pardon me if I'm repeating stuff
that's already been discussed.

You can guess a lot from vendor specs, but NAND flash requires
experience before you understand the nuances; especially since the
vendors tend not to document most of what you need to know to get good
performance and reliability from a flash device.

There are, basically, two approaches to using NAND devices. What PHK
calls "flash adapation layer" or, sometimes, "flash translation layer"
is widely used in devices that are meant to be seen as removable ms-dos
file system devices, such as almost every USB NAND based flash device on
the market. It is also used in at least two commercial flash file
systems intended for embedded flash. It is also an approach available to
the Linux MTD layer, although not used by any of the Linux filesystems.
This approach works well enough for specific usage patterns and you will
find several successful embedded devices on the CE market place that use
it.

The second approach is to have a 'flash aware filesystem', which
understand the write/read/erase properties of NAND flash parts. There
are three variants on this approach that I'm aware of. The first takes a
'traditional' filesystem like FFS and, in effect, adds a flash
translation layer.  The second takes a log-like file system and adapts
its GC to NAND. The third approach is to write a file system specific to
NAND devices from scratch. PalmOS Garnet's NAND file system is an
example of the first. The modified version of LFS that Mike Chen and I
did for PalmOS Cobalt is an example of the second. The MTD based file
system jffs2 is an example of the third, and a cautionary tale for those
who would write their own.

In addition to the various points Matt Dillon has figured out from
reading specs, there are several features of NAND parts that I haven't
seen mentioned here that play a fairly important role in designing file
systems around them. These include, but are probably not limited to:

1) Large page versus small page NAND
2) Broken or poorly performing hardware, especially ECC generation and
write verification
3) Adjacent write effect

Some interesting properties to take into account when designing a NAND
file system:

1) No block can be assumed good, which means you have to scan the device
to find your metadata starting point at boot time.

2) Small page NAND has less 'spare' available in the spare region than
large page NAND, which means that you can do optimizations for large
page nand that you can't for small.

3) write-back caching of writes makes NAND parts less reliable

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 18:48:55 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6F3CA106567C
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 18:48:55 +0000 (UTC)
	(envelope-from qpadla@gmail.com)
Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.184])
	by mx1.freebsd.org (Postfix) with ESMTP id 793228FC22
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 18:48:53 +0000 (UTC)
	(envelope-from qpadla@gmail.com)
Received: by nf-out-0910.google.com with SMTP id b2so847240nfb.33
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 11:48:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta;
	h=domainkey-signature:received:received:from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id;
	bh=fYuXiG8PASQrrvp1e4aPEK0LeTgJiB5x/UpFL5GGeZk=;
	b=JELOKyuPOnJtTETL8fnFZknqqtREY+xgmmGOMVD4Tv8G8grvVEc7H13fWsVdvCE2VvmKNpXsZCimHRQz7ZHkYeSBbZCwnwG7sQ8MMBUpNsQ1zUHghDFu+HrZwPuSHe9YUZ7miJ7dQuEIUm5pOWLpA4eixbZ6ZeJGMlMnjNxotrs=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta;
	h=from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id;
	b=sCAhc/LqZ6+A116H/ioYLETLNsFd5WuCQ+WDnaGDO01UEGbOXtgjNfumzYXoF8A5wYnwliceDXLG7UsXk8lPZYj4seFstn0vv9QSwMhkwzFDKe7NDoAjGaa+6KAQqbQDgicmfC5KVMAODzSIoEgoBb01Sv6c5BBtq3BAU6x2Zsg=
Received: by 10.78.182.17 with SMTP id e17mr22674199huf.57.1206987879793;
	Mon, 31 Mar 2008 11:24:39 -0700 (PDT)
Received: from atlas ( [89.162.141.1])
	by mx.google.com with ESMTPS id d23sm865337nfh.12.2008.03.31.11.24.37
	(version=TLSv1/SSLv3 cipher=OTHER);
	Mon, 31 Mar 2008 11:24:38 -0700 (PDT)
From: Nikolay Pavlov <qpadla@gmail.com>
To: freebsd-arch@freebsd.org
Date: Mon, 31 Mar 2008 21:25:28 +0300
User-Agent: KMail/1.9.6 (enterprise 0.20070907.709405)
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
In-Reply-To: <B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200803312125.29325.qpadla@gmail.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org,
	Martin Fouts <mfouts@danger.com>
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: qpadla@gmail.com
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 18:48:55 -0000

On Monday 31 March 2008 20:36:09 Martin Fouts wrote:
> The MTD based file
> system jffs2 is an example of the third, and a cautionary tale for those
> who would write their own.

Intrested parties could found this information usefull:
http://kerneltrap.org/Linux/UBI_File_System
It is related to new flash file system developed by Nokia engineers. 

-- 
======================================================================  
- Best regards, Nikolay Pavlov. <<<-----------------------------------    
======================================================================  


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 18:53:15 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 736C1106566B
	for <arch@freebsd.org>; Mon, 31 Mar 2008 18:53:15 +0000 (UTC)
	(envelope-from qpadla@gmail.com)
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.175])
	by mx1.freebsd.org (Postfix) with ESMTP id ED22E8FC12
	for <arch@freebsd.org>; Mon, 31 Mar 2008 18:53:14 +0000 (UTC)
	(envelope-from qpadla@gmail.com)
Received: by ug-out-1314.google.com with SMTP id y2so565371uge.37
	for <arch@freebsd.org>; Mon, 31 Mar 2008 11:53:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id;
	bh=fYuXiG8PASQrrvp1e4aPEK0LeTgJiB5x/UpFL5GGeZk=;
	b=lKd8Ky/BGXc68IDxOvoRHEyajPsode2do/7TRLXMMPR+bIde85AhwhW8BvV5G+TP97+1ZTrWwZDkaO5e9Ds4kn1DkZrUWnPNa5SmFZ+T5Bk0Z/+3+A0ec1pplkydF0tLnD3jP2EorUqgrRhUoim6WpEPdTA7dSwlYfsG3sZMuU0=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id;
	b=ha19YBnPRnl1iQREITLpmNeeZwam2i5MnCdtWQfPJTWoJo8zkPyif2OXsbRsdjoc2TrvuAwnuXajwQ7dh1IEqVsI/mxPKYQEjPai1Ya1PIFUAqj2VpB4nKt0emSSrhPvziGlvv4SZ0hrG0bkv2s7HsKavBAZ9NsUta+RXyXqZD8=
Received: by 10.78.182.17 with SMTP id e17mr22674199huf.57.1206987879793;
	Mon, 31 Mar 2008 11:24:39 -0700 (PDT)
Received: from atlas ( [89.162.141.1])
	by mx.google.com with ESMTPS id d23sm865337nfh.12.2008.03.31.11.24.37
	(version=TLSv1/SSLv3 cipher=OTHER);
	Mon, 31 Mar 2008 11:24:38 -0700 (PDT)
From: Nikolay Pavlov <qpadla@gmail.com>
To: freebsd-arch@freebsd.org
Date: Mon, 31 Mar 2008 21:25:28 +0300
User-Agent: KMail/1.9.6 (enterprise 0.20070907.709405)
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
In-Reply-To: <B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200803312125.29325.qpadla@gmail.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org,
	Martin Fouts <mfouts@danger.com>
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: qpadla@gmail.com
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 18:53:15 -0000

On Monday 31 March 2008 20:36:09 Martin Fouts wrote:
> The MTD based file
> system jffs2 is an example of the third, and a cautionary tale for those
> who would write their own.

Intrested parties could found this information usefull:
http://kerneltrap.org/Linux/UBI_File_System
It is related to new flash file system developed by Nokia engineers. 

-- 
======================================================================  
- Best regards, Nikolay Pavlov. <<<-----------------------------------    
======================================================================  


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 19:15:41 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 09714106566C
	for <arch@freebsd.org>; Mon, 31 Mar 2008 19:15:41 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id D05288FC17
	for <arch@freebsd.org>; Mon, 31 Mar 2008 19:15:40 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VJFSqj027594;
	Mon, 31 Mar 2008 12:15:28 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VJFSoR027593;
	Mon, 31 Mar 2008 12:15:28 -0700 (PDT)
Date: Mon, 31 Mar 2008 12:15:28 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803311915.m2VJFSoR027593@apollo.backplane.com>
To: qpadla@gmail.com
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org,
	Martin Fouts <mfouts@danger.com>, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 19:15:41 -0000

    This is all very good information.  I was unaware of the adjacent write
    effect, but it makes sense considering the cell size.  Hard drives have
    a similar effect (it's one of the limiting factors for density).

    Hamming codes (ECC codes) are very fragile beasts.  While they are in the
    same family as a CRC it is a really bad idea to try to use the ECC code
    as your CRC which is why I recommended against it in my previous posting.
    A two-bit-detect/one-bit-correction code is utterly trivial to generate
    (both generating it and using it)... I've done such codes in 8-bit cpu's.
    Their fragility can be surprising to anyone who has never worked with
    them.

    I've written numerous filesystems, including a NOR flash filesystem
    (whos characteristics are somewhat different due to the availability of
    byte-write).  In my opinion, designing a filesystem *specifically* for
    NAND flash is a mistake because the technology is rapidly evolving and
    such a filesystem would wind up being obsolete in fairly short order.
    For example, the simple addition of some front-end non-volatile cache,
    such as a dime-cap-backed static ram, would have a very serious effect
    on any such filesystem design.  It is far far better to design the
    filesystem around generally desired characteristics, such as good
    write locality of reference (though, again, indices still have to be
    updated and those usually do not have good locality of reference).

    DragonFly's HAMMER has pretty good write-locality of reference but still
    does random updates for B-Tree indices and things like the mtime and 
    atime fields.  It also uses numerous blockmaps that could make direct use
    of a flash sector-mapping translation layer (1).  It might be adaptable.

    (1) A flash sector-mapping translation layer gives a filesystem the
    ability to use 'named block numbers'.  For example, the NOR filesystem
    I did used 32 bit named block numbers regardless of the size of the
    flash (which was typically only 2MB).  The filesystem topology was
    actually encoded into the block number it self.  In other words, the
    filesystem is not bound to a linear range of block numbers it is
    simply bound

    What does this mean?  This means that what you really want to do is not
    necessarily write a filesystem that is explicitly designed for NAND
    operation, but instead write a filesystem that is explicitly designed
    to run on top of an abstracted topology (such as one where you can have
    named block numbers), and which generally has the desired features for
    locality of reference.  Such a filesystem would not become obsolete
    anywhere near as quickly as a nand-specific filesystem would and 
    rebuilding an abstracted topology (whos underlying code would become
    obsolete as the technology changes) is a whole lot easier then
    redesigning a filesystem.

    I am quite partial to the named-block concept, I really think it's the
    best way to go for flash filesystem design.  The flash already has to
    have a sector-translation mechanism, making the jump to a full blown
    named-block model is only a small additional step.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 19:27:08 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1F3EC1065673
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 19:27:08 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id D6E348FC16
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 19:27:07 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VJFSqj027594;
	Mon, 31 Mar 2008 12:15:28 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VJFSoR027593;
	Mon, 31 Mar 2008 12:15:28 -0700 (PDT)
Date: Mon, 31 Mar 2008 12:15:28 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803311915.m2VJFSoR027593@apollo.backplane.com>
To: qpadla@gmail.com
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org,
	Martin Fouts <mfouts@danger.com>, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 19:27:08 -0000

    This is all very good information.  I was unaware of the adjacent write
    effect, but it makes sense considering the cell size.  Hard drives have
    a similar effect (it's one of the limiting factors for density).

    Hamming codes (ECC codes) are very fragile beasts.  While they are in the
    same family as a CRC it is a really bad idea to try to use the ECC code
    as your CRC which is why I recommended against it in my previous posting.
    A two-bit-detect/one-bit-correction code is utterly trivial to generate
    (both generating it and using it)... I've done such codes in 8-bit cpu's.
    Their fragility can be surprising to anyone who has never worked with
    them.

    I've written numerous filesystems, including a NOR flash filesystem
    (whos characteristics are somewhat different due to the availability of
    byte-write).  In my opinion, designing a filesystem *specifically* for
    NAND flash is a mistake because the technology is rapidly evolving and
    such a filesystem would wind up being obsolete in fairly short order.
    For example, the simple addition of some front-end non-volatile cache,
    such as a dime-cap-backed static ram, would have a very serious effect
    on any such filesystem design.  It is far far better to design the
    filesystem around generally desired characteristics, such as good
    write locality of reference (though, again, indices still have to be
    updated and those usually do not have good locality of reference).

    DragonFly's HAMMER has pretty good write-locality of reference but still
    does random updates for B-Tree indices and things like the mtime and 
    atime fields.  It also uses numerous blockmaps that could make direct use
    of a flash sector-mapping translation layer (1).  It might be adaptable.

    (1) A flash sector-mapping translation layer gives a filesystem the
    ability to use 'named block numbers'.  For example, the NOR filesystem
    I did used 32 bit named block numbers regardless of the size of the
    flash (which was typically only 2MB).  The filesystem topology was
    actually encoded into the block number it self.  In other words, the
    filesystem is not bound to a linear range of block numbers it is
    simply bound

    What does this mean?  This means that what you really want to do is not
    necessarily write a filesystem that is explicitly designed for NAND
    operation, but instead write a filesystem that is explicitly designed
    to run on top of an abstracted topology (such as one where you can have
    named block numbers), and which generally has the desired features for
    locality of reference.  Such a filesystem would not become obsolete
    anywhere near as quickly as a nand-specific filesystem would and 
    rebuilding an abstracted topology (whos underlying code would become
    obsolete as the technology changes) is a whole lot easier then
    redesigning a filesystem.

    I am quite partial to the named-block concept, I really think it's the
    best way to go for flash filesystem design.  The flash already has to
    have a sector-translation mechanism, making the jump to a full blown
    named-block model is only a small additional step.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 19:51:45 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 941BD106567D
	for <arch@freebsd.org>; Mon, 31 Mar 2008 19:51:45 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 604DD8FC44
	for <arch@freebsd.org>; Mon, 31 Mar 2008 19:51:45 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 4B44E414A67;
	Mon, 31 Mar 2008 12:51:27 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 12:51:35 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
In-Reply-To: <200803311915.m2VJFSoR027593@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTY6BS0ooCSuYbSOeNPD7xBxIXYwAAn46g
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>,
	<qpadla@gmail.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 19:51:45 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
>=20
>     Hamming codes (ECC codes) are very fragile beasts.  While=20
> they are in the same family as a CRC it is a really bad idea to=20
> try to use the ECC code as your CRC which is why I recommended=20
> against it in my previous posting.

True, but when you're working with a part that does ECC in HW, you're
stuck with the ECC it does.

>     I've written numerous filesystems, including a NOR flash=20
> filesystem (whos characteristics are somewhat different due to the=20
> availability of byte-write).  In my opinion, designing a filesystem=20
> *specifically* for NAND flash is a mistake because the technology is
rapidly=20
> evolving and such a filesystem would wind up being obsolete in fairly=20
> short order.

Well, those of us who are shipping devices with flash parts in them have
a somewhat different view on that, which is why I've worked on three
NAND specific file systems in the last four years. Two of those are in
use in shipping devices, and are expected to be in use for five or more
years.


>     For example, the simple addition of some front-end non-volatile
cache,
> such as a dime-cap-backed static ram, would have a very serious effect
> on any such filesystem design.

Yes.  However since the phone market makes such a change very unlikely,
because of cost pressures, it's not one we take into consideration.

> It is far far better to design the filesystem around generally desired
> characteristics, such as good write locality of reference (though,
again, indices still=20
> have to be updated and those usually do not have good locality of
reference).

You've talked yourself into pretty much the same mistake that led to
jffs2, which turned out to be a terrible idea.

>     DragonFly's HAMMER has pretty good write-locality of=20
> reference but still does random updates for B-Tree indices and things
like=20
> the mtime and atime fields.  It also uses numerous blockmaps that
could=20
> make direct use of a flash sector-mapping translation layer (1).  It=20
> might be adaptable.

You are pretty much describing the data structures that have made jffs2
such a poor performer.

>=20
>     (1) A flash sector-mapping translation layer gives a=20
> filesystem the ability to use 'named block numbers'.  For example, the

> NOR filesystem I did used 32 bit named block numbers regardless of the

> size of the flash (which was typically only 2MB).  The filesystem
topology was
> actually encoded into the block number it self.  In other=20
> words, the filesystem is not bound to a linear range of block numbers
it is
> simply bound

Works OK for NOR. Has interesting problems, mainly with maintaining the
block number map reliabily in storage, when attempted on NAND.

>     What does this mean?  This means that what you really=20
> want to do is not necessarily write a filesystem that is explicitly=20
> designed for NAND operation, but instead write a filesystem that is=20
> explicitly designed to run on top of an abstracted topology (such as
one=20
> where you can have named block numbers), and which generally has the
desired=20
> features for locality of reference.  Such a filesystem would not=20
> become obsolete anywhere near as quickly as a nand-specific filesystem
would and=20
> rebuilding an abstracted topology (whos underlying code=20
> would become obsolete as the technology changes) is a whole lot easier
then
> redesigning a filesystem.

There's really only one topology that's efficient for a NAND device, and
that's to do log-like writing coupled with garbage collection.


> I am quite partial to the named-block concept, I really=20
> think it's the best way to go for flash filesystem design.  The flash=20
> already has to have a sector-translation mechanism, making the jump to
a=20
> full blown named-block model is only a small additional step.

The devil in the details of your naming scheme turns out to be managing
the name translation information within the NAND storage itself. This is
the source of significant performance problems in jffs2, for example,
and have a huge amount of code complexity in the commercial system I
work with.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 20:06:30 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6E6E21065671;
	Mon, 31 Mar 2008 20:06:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 318988FC18;
	Mon, 31 Mar 2008 20:06:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VK6ANu028134;
	Mon, 31 Mar 2008 13:06:10 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VK6Aom028133;
	Mon, 31 Mar 2008 13:06:10 -0700 (PDT)
Date: Mon, 31 Mar 2008 13:06:10 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312006.m2VK6Aom028133@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 20:06:30 -0000


:You've talked yourself into pretty much the same mistake that led to
:jffs2, which turned out to be a terrible idea.

    I'm not familiar with jffs2 but a blockmap abstraction in of itself
    just doesn't have the terrible characteristics you are implying.
    The implementations might have been bad but the concept is quite sound.

    Here's a question.  Ok so the best write performance is to essentially
    append to the NAND device.  That's fairly obvious though as long as you
    are able to fully complete a page it doesn't really matter where the
    data goes.  So the main issue is being able to complete a page (since
    you can't rewrite it).

    But how do you index that information?  You can't simply append the
    information to the NAND unless you also have a way to access it.  So
    does the filesystem have to scan the NAND (or significant portions of it)
    in order to build an index of the filesystem topology in system memory?

    No matter what you do you have to index the information *SOMEWHERE*.
    That somewhere is either going to be in-NAND or in-memory or some
    combination of the two.  If it is entirely in-memory you have to scan
    the auxillary information in nearly the entire NAND array to build
    your index.  If it is entirely in-NAND you have a significant updating
    problem.

    A named-block model, done right, can serve as the index.  That is, it
    is exactly the same problem just viewed from a different angle.
    A named-block model does not necessarily imply that the indexing
    topology has to be stored entirely in-NAND, it does not imply any sort
    of linear array, and it does not imply any random-updating requirement.

    I don't know what the jffs2 folks did but you shouldn't take their
    performance failure as an indication that the general concept is
    incorrect.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 20:06:30 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6E6E21065671;
	Mon, 31 Mar 2008 20:06:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 318988FC18;
	Mon, 31 Mar 2008 20:06:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VK6ANu028134;
	Mon, 31 Mar 2008 13:06:10 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VK6Aom028133;
	Mon, 31 Mar 2008 13:06:10 -0700 (PDT)
Date: Mon, 31 Mar 2008 13:06:10 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312006.m2VK6Aom028133@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 20:06:30 -0000


:You've talked yourself into pretty much the same mistake that led to
:jffs2, which turned out to be a terrible idea.

    I'm not familiar with jffs2 but a blockmap abstraction in of itself
    just doesn't have the terrible characteristics you are implying.
    The implementations might have been bad but the concept is quite sound.

    Here's a question.  Ok so the best write performance is to essentially
    append to the NAND device.  That's fairly obvious though as long as you
    are able to fully complete a page it doesn't really matter where the
    data goes.  So the main issue is being able to complete a page (since
    you can't rewrite it).

    But how do you index that information?  You can't simply append the
    information to the NAND unless you also have a way to access it.  So
    does the filesystem have to scan the NAND (or significant portions of it)
    in order to build an index of the filesystem topology in system memory?

    No matter what you do you have to index the information *SOMEWHERE*.
    That somewhere is either going to be in-NAND or in-memory or some
    combination of the two.  If it is entirely in-memory you have to scan
    the auxillary information in nearly the entire NAND array to build
    your index.  If it is entirely in-NAND you have a significant updating
    problem.

    A named-block model, done right, can serve as the index.  That is, it
    is exactly the same problem just viewed from a different angle.
    A named-block model does not necessarily imply that the indexing
    topology has to be stored entirely in-NAND, it does not imply any sort
    of linear array, and it does not imply any random-updating requirement.

    I don't know what the jffs2 folks did but you shouldn't take their
    performance failure as an indication that the general concept is
    incorrect.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 20:08:25 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B99011065670
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 20:08:25 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id A2B988FC16
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 20:08:25 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 4B44E414A67;
	Mon, 31 Mar 2008 12:51:27 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 12:51:35 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
In-Reply-To: <200803311915.m2VJFSoR027593@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTY6BS0ooCSuYbSOeNPD7xBxIXYwAAn46g
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>,
	<qpadla@gmail.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 20:08:25 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
>=20
>     Hamming codes (ECC codes) are very fragile beasts.  While=20
> they are in the same family as a CRC it is a really bad idea to=20
> try to use the ECC code as your CRC which is why I recommended=20
> against it in my previous posting.

True, but when you're working with a part that does ECC in HW, you're
stuck with the ECC it does.

>     I've written numerous filesystems, including a NOR flash=20
> filesystem (whos characteristics are somewhat different due to the=20
> availability of byte-write).  In my opinion, designing a filesystem=20
> *specifically* for NAND flash is a mistake because the technology is
rapidly=20
> evolving and such a filesystem would wind up being obsolete in fairly=20
> short order.

Well, those of us who are shipping devices with flash parts in them have
a somewhat different view on that, which is why I've worked on three
NAND specific file systems in the last four years. Two of those are in
use in shipping devices, and are expected to be in use for five or more
years.


>     For example, the simple addition of some front-end non-volatile
cache,
> such as a dime-cap-backed static ram, would have a very serious effect
> on any such filesystem design.

Yes.  However since the phone market makes such a change very unlikely,
because of cost pressures, it's not one we take into consideration.

> It is far far better to design the filesystem around generally desired
> characteristics, such as good write locality of reference (though,
again, indices still=20
> have to be updated and those usually do not have good locality of
reference).

You've talked yourself into pretty much the same mistake that led to
jffs2, which turned out to be a terrible idea.

>     DragonFly's HAMMER has pretty good write-locality of=20
> reference but still does random updates for B-Tree indices and things
like=20
> the mtime and atime fields.  It also uses numerous blockmaps that
could=20
> make direct use of a flash sector-mapping translation layer (1).  It=20
> might be adaptable.

You are pretty much describing the data structures that have made jffs2
such a poor performer.

>=20
>     (1) A flash sector-mapping translation layer gives a=20
> filesystem the ability to use 'named block numbers'.  For example, the

> NOR filesystem I did used 32 bit named block numbers regardless of the

> size of the flash (which was typically only 2MB).  The filesystem
topology was
> actually encoded into the block number it self.  In other=20
> words, the filesystem is not bound to a linear range of block numbers
it is
> simply bound

Works OK for NOR. Has interesting problems, mainly with maintaining the
block number map reliabily in storage, when attempted on NAND.

>     What does this mean?  This means that what you really=20
> want to do is not necessarily write a filesystem that is explicitly=20
> designed for NAND operation, but instead write a filesystem that is=20
> explicitly designed to run on top of an abstracted topology (such as
one=20
> where you can have named block numbers), and which generally has the
desired=20
> features for locality of reference.  Such a filesystem would not=20
> become obsolete anywhere near as quickly as a nand-specific filesystem
would and=20
> rebuilding an abstracted topology (whos underlying code=20
> would become obsolete as the technology changes) is a whole lot easier
then
> redesigning a filesystem.

There's really only one topology that's efficient for a NAND device, and
that's to do log-like writing coupled with garbage collection.


> I am quite partial to the named-block concept, I really=20
> think it's the best way to go for flash filesystem design.  The flash=20
> already has to have a sector-translation mechanism, making the jump to
a=20
> full blown named-block model is only a small additional step.

The devil in the details of your naming scheme turns out to be managing
the name translation information within the NAND storage itself. This is
the source of significant performance problems in jffs2, for example,
and have a huge amount of code complexity in the commercial system I
work with.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 21:34:30 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5C849106567A;
	Mon, 31 Mar 2008 21:34:30 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 45B8F8FC21;
	Mon, 31 Mar 2008 21:34:30 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 62F3440A445;
	Mon, 31 Mar 2008 14:34:21 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 14:34:29 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
In-Reply-To: <200803312006.m2VK6Aom028133@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTaslf/zlwdF8FTa+bHVN44JtuagAC86dQ
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 21:34:30 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Monday, March 31, 2008 1:06 PM
> To: Martin Fouts
> Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20
> Arnold; arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics
>=20
>=20
>     But how do you index that information?  You can't simply=20
> append the information to the NAND unless you also have a way to=20
> access it.  So does the filesystem have to scan the NAND (or
significant=20
> portions of it) in order to build an index of the filesystem topology
in=20
> system memory?
>=20
>     No matter what you do you have to index the information=20
> *SOMEWHERE*.

And NAND devices have a *SOMEWHERE* that makes them different than other
persistent storage devices in ways that make them interesting to do file
systems for.

It's not _that_ you have to scan the NAND, by the way, it's _when_ you
scan the NAND that has the major impact on performance.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 21:34:30 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5C849106567A;
	Mon, 31 Mar 2008 21:34:30 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 45B8F8FC21;
	Mon, 31 Mar 2008 21:34:30 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 62F3440A445;
	Mon, 31 Mar 2008 14:34:21 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 14:34:29 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
In-Reply-To: <200803312006.m2VK6Aom028133@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTaslf/zlwdF8FTa+bHVN44JtuagAC86dQ
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 21:34:30 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Monday, March 31, 2008 1:06 PM
> To: Martin Fouts
> Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20
> Arnold; arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics
>=20
>=20
>     But how do you index that information?  You can't simply=20
> append the information to the NAND unless you also have a way to=20
> access it.  So does the filesystem have to scan the NAND (or
significant=20
> portions of it) in order to build an index of the filesystem topology
in=20
> system memory?
>=20
>     No matter what you do you have to index the information=20
> *SOMEWHERE*.

And NAND devices have a *SOMEWHERE* that makes them different than other
persistent storage devices in ways that make them interesting to do file
systems for.

It's not _that_ you have to scan the NAND, by the way, it's _when_ you
scan the NAND that has the major impact on performance.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:20:08 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EB06F1065673;
	Mon, 31 Mar 2008 22:20:08 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id CA0508FC14;
	Mon, 31 Mar 2008 22:20:08 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMJlOU029241;
	Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMJlkT029240;
	Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312219.m2VMJlkT029240@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:20:09 -0000

:True, but when you're working with a part that does ECC in HW, you're
:stuck with the ECC it does.
    
    Well, I suppose if you can't get access to the original data *OR* the
    HW ECC code (to undo the broken correction) you would wind up with an
    uncorrectable double error (A failed ECC correction always makes things
    worse rather then better, particularly a 1 bit correct / 2 bit detect
    hamming code).  If you DO have access to the original data or the
    HW ECC code you can undo the failed correction and then ignore the
    hardware ECC and do your own in the auxillary storage.

    I can see why people would want the hardware to do the data validation,
    it's a performance issue in many respects, but if the hardware is only
    doing a simple ECC and doesn't do a separate CRC it will do more harm
    then good and simply cannot be depended upon for anything.

:...
: (some stuff I reordered later one below so I can get this out of the way)
:...
:>     For example, the simple addition of some front-end non-volatile
:cache,
:> such as a dime-cap-backed static ram, would have a very serious effect
:> on any such filesystem design.
:
:Yes.  However since the phone market makes such a change very unlikely,
:because of cost pressures, it's not one we take into consideration.

    For flash storage systems competitive with hard drive storage, 
    that is any flash storage device of significant size (e.g. 64GB-1TB or
    more), the incremental cost of adding non-volatile front-end cache ram
    is going to be in the $1-$3 range.  If that vendor can then advertise
    a performance advantage over their competitors they can easily price
    the drive to match the incremental cost.  Very easily.  This is what
    happened to the HD market.

    The economics that drive front-end cache implementations for flash SSD
    are going to be the same economics that drive front-end cache
    implementations for hard drives.  All hard drives these days have at
    least 8MB of sector cache (and many have 32MB or more), so the writing
    is on the wall.  Any filesystem you design which does not take into
    account a front-end cache is going to be obsolete in probably less then
    2 years (if not already).

    For the phone market?  You mean small flash storage devices?  Performance
    is almost irrelevant there and you certainly do not need a front-end
    cache, or much of one.  A sector translation model for a small flash
    storage device (< 2G) is utterly trivial to implement but so is the
    log-append model.  There is going to be a huge scaling difference
    between the two when you get into large amounts of storage.

:Well, those of us who are shipping devices with flash parts in them have
:a somewhat different view on that, which is why I've worked on three
:NAND specific file systems in the last four years. Two of those are in
:use in shipping devices, and are expected to be in use for five or more
:years.

    Three in five years?  Is that an illustration of my point with regards
    to flash filesystem design?  Ok, that was a joke :-)

    But I don't think we can count small flash storage systems.  Both models
    devolve into trivialities when you are managing small amounts of
    flash storage.

:...
:reference).
:
:You've talked yourself into pretty much the same mistake that led to
:jffs2, which turned out to be a terrible idea.
:
:>     DragonFly's HAMMER has pretty good write-locality of=20
:> reference but still does random updates for B-Tree indices and things
:like=20
:> the mtime and atime fields.  It also uses numerous blockmaps that
:could=20
:> make direct use of a flash sector-mapping translation layer (1).  It=20
:> might be adaptable.
:
:You are pretty much describing the data structures that have made jffs2
:such a poor performer.
:...
:Works OK for NOR. Has interesting problems, mainly with maintaining the
:block number map reliabily in storage, when attempted on NAND.
:...
:The devil in the details of your naming scheme turns out to be managing
:the name translation information within the NAND storage itself. This is
:the source of significant performance problems in jffs2, for example,
:and have a huge amount of code complexity in the commercial system I
:work with.

    Again, I am not familiar with jffs2 but you are painting a very broad
    brush that is more then likely an issue specifically with the jffs2
    design and not the concept of using named blocks in general.

    I understand where you are coming from.  Regardless of the model you
    use you have to index the data somehow.

    What you are advocating is a filesystem which uses an absolute sector
    referencing scheme.  Any change made to the filesystem requires a new
    page to essentially be appended to the flash storage.  In order to
    properly index the information and maintain the filesystem topology
    you also have to recopy *ALL* pages containing references to the
    updated absolute sector in order to repoint them to the new
    absolute sector.  The root of your filesystem winds up being the last
    page appended, in simple terms.

    While some modifications to this scheme are possible, it's pretty much
    the way you have to do things if you use that model.

    I really understand that model, and it has the advantage of simplicity
    but it also has some severe disadvantages when used as a general
    purpose filesystem (verses an embedded filesystem), not the least of
    which is that a single update can result in a chain reaction that
    requires considerably more write bandwidth, considerably more garbage
    collection, and some extra (but probably minor) wear of the flash.

    In contrast, if a filesystem is referencing named blocks and you have
    to move a block (either due to an error or a modification of that
    block through normal filesystem activity), NO changes need to be made
    to those elements of the filesystem that pointed to the block that got
    moved.  All you have to do is append the new block that is renaming the
    old one, which includes the name (aka 64 bit quantity) in its auxillary
    data area, and cache the change in the translation in system memory
    until you decide to flush out the named block index (which I will
    describe a bit later on)... that's non critical information, by the
    way, and does not have to be synchronously in order to be crash
    recoverable.  Write bandwidth is greatly reduced, particularly
    because when using a named block you only have to flush the actual
    modified page to the flash and nothing else other then a topological
    rollup record (which I will describe a bit later on).

    This works particularly well with a filesystem designed to use named
    blocks because there are *NO* indirect blockmaps to reference data or
    inodes in the filesystem.  An absolute-sector-based filesystem has
    blockmaps, e.g. to locate a block in a file.  In a named-block
    filesystem the blockmap *IS* the named block.  That is, the 'name' of
    the named block is effectively the inode number and file block number
    combined into one 64 bit (or larger) key.

    Let me be clear about this distinction.  In a filesystem that references
    absolute sectors an append to a file requires (typically) updating a
    blockmap of absolute sectors which in turns requires the blockmap block
    to be rewritten, along with any reference to it and so on and so forth
    up the chain.   In a named-block filesystem appending to a file simply
    means writing out a new named block.  The filesystem itself has NO
    concept of a blockmap... the blockmap is built into the sector translation
    layer.

    In other words, a filesystem using the named-block model is not any
    more complex then a filesystem using an absolute sector numbering model,
    and a filesystem using the named-block model is far easier from the
    perspective of caching changes in system memory without requiring a
    sync to flash for crash recovery.  That is a huge deal.

    Now is there some work involved with making the named block translations
    efficient?  Yes, there is some... but it is really not much more complex
    then the work involved in an absolute-sector-based filesystem which
    must index files, directories, and so forth within the filesystem
    itself.

    In particular, when using a named-block model you still have to
    occassionally flush out the translation topology to the flash media
    Since this topology references physical block numbers it, in fact, uses
    exactly the SAME mechanism that the absolute filesystem model used
    to maintain its topology.  In other words, no more complex then the
    absolute filesystem model.  

    The big difference is that the translation topology does not have
    to be written synchronously and the frequency of the rollup writes
    is based ENTIRELY on how much system memory you are willing to 
    dedicate to caching topological changes.  E.G. if you dedicate, say, 100KB
    of system memory you can store the topology for, say, 3200 filesystem
    updates (using 32 byte structures) before you have to 'flush' it to
    the flash.  A filesystem based on absolute blocks pretty much has to
    cache the related (modified) blocks in memory which are far larger, and
    thus must flush them to storage far more frequently.  But translations
    are tiny little records...  10's of gigabytes worth of updates can be
    cached in a small amount of system memory.  The translation topology
    does NOT have to be synced to disk on fsync() because all the
    information can be recovered when the filesystem is mounted after
    a reboot.  That is critical.

    Going back to the absolute filesystem model, such a filesystem does have
    the advantage of locality of reference (that is, not so much seek-wise
    which is irrelevant for flash but more from the point of view of being
    able to chain to the desired information).  A filesystem using the
    named-block model must lookup the block, typically in a global index,
    which means it must maintain a cache of pointers into its translation
    tree.  This is a little more expensive when looking up inodes but,
    actually, I use a very similar scheme in HAMMER (which has a global
    B-Tree for everything) and the caching required is so simple that it just
    becomes a NOP.  Just storing a cached absolute sector number in the
    in-memory inode structure for use as a starting point when looking up
    elements related to that inode winds up being no less efficient then
    embedding a blockmap in an inode as you see in a more typical
    filesystem.


    In anycase, I hope this clarifies the issues.  I really do understand
    where you are coming from, the simplicity of chaining the physical
    topology cannot be denied, and I like the elegance, but I hope I've
    shown that it is not actually simplifying the overall design much
    over a named block scheme, and that there are some fairly severe
    issues that can crop up that are complete non-issues when using a named
    block scheme.

    Long winded, I know.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:20:08 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EB06F1065673;
	Mon, 31 Mar 2008 22:20:08 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id CA0508FC14;
	Mon, 31 Mar 2008 22:20:08 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMJlOU029241;
	Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMJlkT029240;
	Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312219.m2VMJlkT029240@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:20:09 -0000

:True, but when you're working with a part that does ECC in HW, you're
:stuck with the ECC it does.
    
    Well, I suppose if you can't get access to the original data *OR* the
    HW ECC code (to undo the broken correction) you would wind up with an
    uncorrectable double error (A failed ECC correction always makes things
    worse rather then better, particularly a 1 bit correct / 2 bit detect
    hamming code).  If you DO have access to the original data or the
    HW ECC code you can undo the failed correction and then ignore the
    hardware ECC and do your own in the auxillary storage.

    I can see why people would want the hardware to do the data validation,
    it's a performance issue in many respects, but if the hardware is only
    doing a simple ECC and doesn't do a separate CRC it will do more harm
    then good and simply cannot be depended upon for anything.

:...
: (some stuff I reordered later one below so I can get this out of the way)
:...
:>     For example, the simple addition of some front-end non-volatile
:cache,
:> such as a dime-cap-backed static ram, would have a very serious effect
:> on any such filesystem design.
:
:Yes.  However since the phone market makes such a change very unlikely,
:because of cost pressures, it's not one we take into consideration.

    For flash storage systems competitive with hard drive storage, 
    that is any flash storage device of significant size (e.g. 64GB-1TB or
    more), the incremental cost of adding non-volatile front-end cache ram
    is going to be in the $1-$3 range.  If that vendor can then advertise
    a performance advantage over their competitors they can easily price
    the drive to match the incremental cost.  Very easily.  This is what
    happened to the HD market.

    The economics that drive front-end cache implementations for flash SSD
    are going to be the same economics that drive front-end cache
    implementations for hard drives.  All hard drives these days have at
    least 8MB of sector cache (and many have 32MB or more), so the writing
    is on the wall.  Any filesystem you design which does not take into
    account a front-end cache is going to be obsolete in probably less then
    2 years (if not already).

    For the phone market?  You mean small flash storage devices?  Performance
    is almost irrelevant there and you certainly do not need a front-end
    cache, or much of one.  A sector translation model for a small flash
    storage device (< 2G) is utterly trivial to implement but so is the
    log-append model.  There is going to be a huge scaling difference
    between the two when you get into large amounts of storage.

:Well, those of us who are shipping devices with flash parts in them have
:a somewhat different view on that, which is why I've worked on three
:NAND specific file systems in the last four years. Two of those are in
:use in shipping devices, and are expected to be in use for five or more
:years.

    Three in five years?  Is that an illustration of my point with regards
    to flash filesystem design?  Ok, that was a joke :-)

    But I don't think we can count small flash storage systems.  Both models
    devolve into trivialities when you are managing small amounts of
    flash storage.

:...
:reference).
:
:You've talked yourself into pretty much the same mistake that led to
:jffs2, which turned out to be a terrible idea.
:
:>     DragonFly's HAMMER has pretty good write-locality of=20
:> reference but still does random updates for B-Tree indices and things
:like=20
:> the mtime and atime fields.  It also uses numerous blockmaps that
:could=20
:> make direct use of a flash sector-mapping translation layer (1).  It=20
:> might be adaptable.
:
:You are pretty much describing the data structures that have made jffs2
:such a poor performer.
:...
:Works OK for NOR. Has interesting problems, mainly with maintaining the
:block number map reliabily in storage, when attempted on NAND.
:...
:The devil in the details of your naming scheme turns out to be managing
:the name translation information within the NAND storage itself. This is
:the source of significant performance problems in jffs2, for example,
:and have a huge amount of code complexity in the commercial system I
:work with.

    Again, I am not familiar with jffs2 but you are painting a very broad
    brush that is more then likely an issue specifically with the jffs2
    design and not the concept of using named blocks in general.

    I understand where you are coming from.  Regardless of the model you
    use you have to index the data somehow.

    What you are advocating is a filesystem which uses an absolute sector
    referencing scheme.  Any change made to the filesystem requires a new
    page to essentially be appended to the flash storage.  In order to
    properly index the information and maintain the filesystem topology
    you also have to recopy *ALL* pages containing references to the
    updated absolute sector in order to repoint them to the new
    absolute sector.  The root of your filesystem winds up being the last
    page appended, in simple terms.

    While some modifications to this scheme are possible, it's pretty much
    the way you have to do things if you use that model.

    I really understand that model, and it has the advantage of simplicity
    but it also has some severe disadvantages when used as a general
    purpose filesystem (verses an embedded filesystem), not the least of
    which is that a single update can result in a chain reaction that
    requires considerably more write bandwidth, considerably more garbage
    collection, and some extra (but probably minor) wear of the flash.

    In contrast, if a filesystem is referencing named blocks and you have
    to move a block (either due to an error or a modification of that
    block through normal filesystem activity), NO changes need to be made
    to those elements of the filesystem that pointed to the block that got
    moved.  All you have to do is append the new block that is renaming the
    old one, which includes the name (aka 64 bit quantity) in its auxillary
    data area, and cache the change in the translation in system memory
    until you decide to flush out the named block index (which I will
    describe a bit later on)... that's non critical information, by the
    way, and does not have to be synchronously in order to be crash
    recoverable.  Write bandwidth is greatly reduced, particularly
    because when using a named block you only have to flush the actual
    modified page to the flash and nothing else other then a topological
    rollup record (which I will describe a bit later on).

    This works particularly well with a filesystem designed to use named
    blocks because there are *NO* indirect blockmaps to reference data or
    inodes in the filesystem.  An absolute-sector-based filesystem has
    blockmaps, e.g. to locate a block in a file.  In a named-block
    filesystem the blockmap *IS* the named block.  That is, the 'name' of
    the named block is effectively the inode number and file block number
    combined into one 64 bit (or larger) key.

    Let me be clear about this distinction.  In a filesystem that references
    absolute sectors an append to a file requires (typically) updating a
    blockmap of absolute sectors which in turns requires the blockmap block
    to be rewritten, along with any reference to it and so on and so forth
    up the chain.   In a named-block filesystem appending to a file simply
    means writing out a new named block.  The filesystem itself has NO
    concept of a blockmap... the blockmap is built into the sector translation
    layer.

    In other words, a filesystem using the named-block model is not any
    more complex then a filesystem using an absolute sector numbering model,
    and a filesystem using the named-block model is far easier from the
    perspective of caching changes in system memory without requiring a
    sync to flash for crash recovery.  That is a huge deal.

    Now is there some work involved with making the named block translations
    efficient?  Yes, there is some... but it is really not much more complex
    then the work involved in an absolute-sector-based filesystem which
    must index files, directories, and so forth within the filesystem
    itself.

    In particular, when using a named-block model you still have to
    occassionally flush out the translation topology to the flash media
    Since this topology references physical block numbers it, in fact, uses
    exactly the SAME mechanism that the absolute filesystem model used
    to maintain its topology.  In other words, no more complex then the
    absolute filesystem model.  

    The big difference is that the translation topology does not have
    to be written synchronously and the frequency of the rollup writes
    is based ENTIRELY on how much system memory you are willing to 
    dedicate to caching topological changes.  E.G. if you dedicate, say, 100KB
    of system memory you can store the topology for, say, 3200 filesystem
    updates (using 32 byte structures) before you have to 'flush' it to
    the flash.  A filesystem based on absolute blocks pretty much has to
    cache the related (modified) blocks in memory which are far larger, and
    thus must flush them to storage far more frequently.  But translations
    are tiny little records...  10's of gigabytes worth of updates can be
    cached in a small amount of system memory.  The translation topology
    does NOT have to be synced to disk on fsync() because all the
    information can be recovered when the filesystem is mounted after
    a reboot.  That is critical.

    Going back to the absolute filesystem model, such a filesystem does have
    the advantage of locality of reference (that is, not so much seek-wise
    which is irrelevant for flash but more from the point of view of being
    able to chain to the desired information).  A filesystem using the
    named-block model must lookup the block, typically in a global index,
    which means it must maintain a cache of pointers into its translation
    tree.  This is a little more expensive when looking up inodes but,
    actually, I use a very similar scheme in HAMMER (which has a global
    B-Tree for everything) and the caching required is so simple that it just
    becomes a NOP.  Just storing a cached absolute sector number in the
    in-memory inode structure for use as a starting point when looking up
    elements related to that inode winds up being no less efficient then
    embedding a blockmap in an inode as you see in a more typical
    filesystem.


    In anycase, I hope this clarifies the issues.  I really do understand
    where you are coming from, the simplicity of chaining the physical
    topology cannot be denied, and I like the elegance, but I hope I've
    shown that it is not actually simplifying the overall design much
    over a named block scheme, and that there are some fairly severe
    issues that can crop up that are complete non-issues when using a named
    block scheme.

    Long winded, I know.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:21:57 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D2B29106566B
	for <arch@freebsd.org>; Mon, 31 Mar 2008 22:21:57 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (ns1.bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id C3C888FC19
	for <arch@freebsd.org>; Mon, 31 Mar 2008 22:21:56 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id C976C5B50;
	Mon, 31 Mar 2008 15:21:54 -0700 (PDT)
To: Matthew Dillon <dillon@apollo.backplane.com>
In-reply-to: Your message of "Mon, 31 Mar 2008 13:06:10 PDT."
	<200803312006.m2VK6Aom028133@apollo.backplane.com> 
Date: Mon, 31 Mar 2008 15:21:54 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080331222154.C976C5B50@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org, Martin Fouts <mfouts@danger.com>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:21:57 -0000

On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
>     But how do you index that information?  You can't simply append the
>     information to the NAND unless you also have a way to access it.  So
>     does the filesystem have to scan the NAND (or significant portions of it)
>     in order to build an index of the filesystem topology in system memory?

One possible way:

I'd design the system so that each update ends with the write
of a root block[1]. I'd also write root blocks at fixed
locations to find them easily without having to scann the
whole disk. Given this, on reboot use binary search to locate
the latest root block at a fixed location. There may be
further updates so scan forward until you locate the most
uptodate root block and once you have that, you are home
free!  Everything before that root block will be consistent
with it.

Even if the system crashes in the middle of a compacting GC,
the design should be able to recover all data.

What I am not sure about is whether one can do incremental
GC. A stop-and-copy GC is always possible but I don't like
the idea of long pauses.

[1]
The root block contains block # of the earliest valid block,
a sequence number (that will not roll over in device's
lifetime), block #s for various structures such as the root
of inodes, superblock, freelist if any, etc.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:23:42 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 753B71065673;
	Mon, 31 Mar 2008 22:23:42 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 3B5268FC13;
	Mon, 31 Mar 2008 22:23:41 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id DCFB017104;
	Mon, 31 Mar 2008 22:23:39 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2VMNbjD026081;
	Mon, 31 Mar 2008 22:23:38 GMT (envelope-from phk@critter.freebsd.dk)
To: Bakul Shah <bakul@bitblocks.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Mon, 31 Mar 2008 15:21:54 MST."
	<20080331222154.C976C5B50@mail.bitblocks.com> 
Date: Mon, 31 Mar 2008 22:23:37 +0000
Message-ID: <26080.1207002217@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:23:42 -0000

In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
>On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
>>     But how do you index that information?  You can't simply append the
>>     information to the NAND unless you also have a way to access it.  So
>>     does the filesystem have to scan the NAND (or significant portions of it)
>>     in order to build an index of the filesystem topology in system memory?
>
>One possible way:
>
>I'd design the system so that each update ends with the write
>of a root block[1]. 

This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
it has many drawbacks, in particular when it comes to recovery.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:23:42 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 753B71065673;
	Mon, 31 Mar 2008 22:23:42 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 3B5268FC13;
	Mon, 31 Mar 2008 22:23:41 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id DCFB017104;
	Mon, 31 Mar 2008 22:23:39 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2VMNbjD026081;
	Mon, 31 Mar 2008 22:23:38 GMT (envelope-from phk@critter.freebsd.dk)
To: Bakul Shah <bakul@bitblocks.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Mon, 31 Mar 2008 15:21:54 MST."
	<20080331222154.C976C5B50@mail.bitblocks.com> 
Date: Mon, 31 Mar 2008 22:23:37 +0000
Message-ID: <26080.1207002217@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:23:42 -0000

In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
>On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
>>     But how do you index that information?  You can't simply append the
>>     information to the NAND unless you also have a way to access it.  So
>>     does the filesystem have to scan the NAND (or significant portions of it)
>>     in order to build an index of the filesystem topology in system memory?
>
>One possible way:
>
>I'd design the system so that each update ends with the write
>of a root block[1]. 

This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
it has many drawbacks, in particular when it comes to recovery.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:29:39 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 308DF1065673;
	Mon, 31 Mar 2008 22:29:39 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 22D468FC28;
	Mon, 31 Mar 2008 22:29:38 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id A779A1A4D8D; Mon, 31 Mar 2008 15:29:38 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:29:38 -0700
From: Alfred Perlstein <alfred@freebsd.org>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Message-ID: <20080331222938.GS95731@elvis.mu.org>
References: <20080331222154.C976C5B50@mail.bitblocks.com>
	<26080.1207002217@critter.freebsd.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <26080.1207002217@critter.freebsd.dk>
User-Agent: Mutt/1.4.2.3i
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:29:39 -0000

* Poul-Henning Kamp <phk@phk.freebsd.dk> [080331 15:24] wrote:
> In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
> >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
> >>     But how do you index that information?  You can't simply append the
> >>     information to the NAND unless you also have a way to access it.  So
> >>     does the filesystem have to scan the NAND (or significant portions of it)
> >>     in order to build an index of the filesystem topology in system memory?
> >
> >One possible way:
> >
> >I'd design the system so that each update ends with the write
> >of a root block[1]. 
> 
> This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
> it has many drawbacks, in particular when it comes to recovery.

Can you explain why?

I could see it being a problem because recovering the filesystem's
most recent change might require significant scanning?

-- 
- Alfred Perlstein

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:29:39 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 308DF1065673;
	Mon, 31 Mar 2008 22:29:39 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 22D468FC28;
	Mon, 31 Mar 2008 22:29:38 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id A779A1A4D8D; Mon, 31 Mar 2008 15:29:38 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:29:38 -0700
From: Alfred Perlstein <alfred@freebsd.org>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Message-ID: <20080331222938.GS95731@elvis.mu.org>
References: <20080331222154.C976C5B50@mail.bitblocks.com>
	<26080.1207002217@critter.freebsd.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <26080.1207002217@critter.freebsd.dk>
User-Agent: Mutt/1.4.2.3i
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:29:39 -0000

* Poul-Henning Kamp <phk@phk.freebsd.dk> [080331 15:24] wrote:
> In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
> >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
> >>     But how do you index that information?  You can't simply append the
> >>     information to the NAND unless you also have a way to access it.  So
> >>     does the filesystem have to scan the NAND (or significant portions of it)
> >>     in order to build an index of the filesystem topology in system memory?
> >
> >One possible way:
> >
> >I'd design the system so that each update ends with the write
> >of a root block[1]. 
> 
> This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
> it has many drawbacks, in particular when it comes to recovery.

Can you explain why?

I could see it being a problem because recovering the filesystem's
most recent change might require significant scanning?

-- 
- Alfred Perlstein

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:38:47 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C6F401065671;
	Mon, 31 Mar 2008 22:38:47 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id AD7E08FC19;
	Mon, 31 Mar 2008 22:38:47 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id CFD975BAE;
	Mon, 31 Mar 2008 15:38:46 -0700 (PDT)
To: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-reply-to: Your message of "Mon, 31 Mar 2008 22:23:37 -0000."
	<26080.1207002217@critter.freebsd.dk> 
Date: Mon, 31 Mar 2008 15:38:46 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080331223846.CFD975BAE@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:38:47 -0000

On Mon, 31 Mar 2008 22:23:37 -0000 "Poul-Henning Kamp" <phk@phk.freebsd.dk>  wrote:
> In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
> >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com
> >  wrote:
> >>     But how do you index that information?  You can't simply append the
> >>     information to the NAND unless you also have a way to access it.  So
> >>     does the filesystem have to scan the NAND (or significant portions of 
> it)
> >>     in order to build an index of the filesystem topology in system memory
> ?
> >
> >One possible way:
> >
> >I'd design the system so that each update ends with the write
> >of a root block[1]. 
> 
> This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
> it has many drawbacks, in particular when it comes to recovery.

[Poul, use positive encouragement and you'd inspire a lot more
people!]

Note that in effect this is exactly what zfs does. Update of
any block implies finding a new place for the updated copy,
which means the block pointing to it must be also updated,
which means a new place for it etc. etc.

But hey, I spent just a few minutes sketching out the idea so
it is possible I missed a whole bunch of things!  If I was
actually implementing this (which I am tempted to...) I'd
certainly want to know what others did.

One thing I forgot to add: I'd let the lower level handle bad
block forwarding and wear levelling (like on the m-tron
device).

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:38:47 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C6F401065671;
	Mon, 31 Mar 2008 22:38:47 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id AD7E08FC19;
	Mon, 31 Mar 2008 22:38:47 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id CFD975BAE;
	Mon, 31 Mar 2008 15:38:46 -0700 (PDT)
To: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-reply-to: Your message of "Mon, 31 Mar 2008 22:23:37 -0000."
	<26080.1207002217@critter.freebsd.dk> 
Date: Mon, 31 Mar 2008 15:38:46 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080331223846.CFD975BAE@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:38:47 -0000

On Mon, 31 Mar 2008 22:23:37 -0000 "Poul-Henning Kamp" <phk@phk.freebsd.dk>  wrote:
> In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
> >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com
> >  wrote:
> >>     But how do you index that information?  You can't simply append the
> >>     information to the NAND unless you also have a way to access it.  So
> >>     does the filesystem have to scan the NAND (or significant portions of 
> it)
> >>     in order to build an index of the filesystem topology in system memory
> ?
> >
> >One possible way:
> >
> >I'd design the system so that each update ends with the write
> >of a root block[1]. 
> 
> This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
> it has many drawbacks, in particular when it comes to recovery.

[Poul, use positive encouragement and you'd inspire a lot more
people!]

Note that in effect this is exactly what zfs does. Update of
any block implies finding a new place for the updated copy,
which means the block pointing to it must be also updated,
which means a new place for it etc. etc.

But hey, I spent just a few minutes sketching out the idea so
it is possible I missed a whole bunch of things!  If I was
actually implementing this (which I am tempted to...) I'd
certainly want to know what others did.

One thing I forgot to add: I'd let the lower level handle bad
block forwarding and wear levelling (like on the m-tron
device).

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:54:38 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DA727106568B;
	Mon, 31 Mar 2008 22:54:37 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id BCBD58FC1A;
	Mon, 31 Mar 2008 22:54:37 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMsQqB029550;
	Mon, 31 Mar 2008 15:54:26 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMsPqZ029549;
	Mon, 31 Mar 2008 15:54:25 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:54:25 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312254.m2VMsPqZ029549@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:54:38 -0000

:>=20
:>     No matter what you do you have to index the information=20
:> *SOMEWHERE*.
:
:And NAND devices have a *SOMEWHERE* that makes them different than other
:persistent storage devices in ways that make them interesting to do file
:systems for.
:
:It's not _that_ you have to scan the NAND, by the way, it's _when_ you
:scan the NAND that has the major impact on performance.

   I know where you are coming from there.  The flash filesystem I did for
   our telemetry product (20 years ago, which is still in operation today)
   uses a named-block translation scheme but simply builds the topology out
   in main memory when the filesystem is mounted.  These are small flash
   devices, two 1 MBytes NOR chips if I remember right.  It just scans
   the translation table which is just a linear array and bulids the
   topology in ram, which takes maybe a few milliseconds to do on boot
   and after that, zero cost.

   Of course, that was for a small flash device so I could get away with it.
   And it was NOR so the translation table was all in one place and could
   be trivially scanned and updated.

   I have a similar issue in HAMMER.  HAMMER is designed as a multi-terrabyte
   filesystem.  HAMMER isn't a flash filesystem but it effectively uses a
   naming mechanic to locate inodes and data, so the problem is similar.
   I was really worried about this mechanic as compared to, say, UFS, where
   the absolute location of the on-disk inode can be directly calculated
   from the inode number.

   HAMMER has to look the inode number up in the global B-Tree.  Even though
   it's a 15-way B-Tree (meaning it is fairly shallow and good locality of
   reference in the buffer cache), I was really worried about performance
   so I implemented a B-Tree pointer cache in the in-memory inode
   structure.  So, e.g. if you lookup a filename in a directory the directory
   inode cached a pointer into the B-Tree 'near' the directory inode element,
   and another for the most recent inode number it looked up.  These cached
   pointers then served as a heuristical starting point for the B-Tree
   lookup to locate the file in the directory and the inode number.

   Well, to shorten the description... the overhead of having to do the
   lookup turned out to not matter at all with the cache in place.  Even
   better, since an inode's data blocks (and other information) is also
   indexed in the global B-Tree, the same cache also served for accesses
   into the file or directory itself.  Whatever overhead might have been
   added from having to lookup the inode was completely covered by the
   savings of not having to run through a multi-level blockmap like FFS
   does.  In many cases a B-Tree search is so well localized that it doesn't
   even have to leave the node.

   (p.s. when I talk about localization here, I'm talking about in-memory
   disk cache, not seek localization).

   In anycase, this is why I just don't worry about named-block translations.
   If one had a filesystem-layer blockmap AND named-block translations it
   could be pretty nasty due to the updating requirements.  But if the
   filesystem revolves entirely around named-block translations and did
   not implement any blockmaps of its own, the only thing that happens is
   that some overheads that used to be in one part of the filesystem are
   now in another part instead, resulting in a net zero.

   HAMMER actually does implement a couple of blockmaps in addition to
   its global B-Tree, but in the case of HAMMER the blockmap is mapping
   *huge* physical translations... 8MB per block.  They aren't named blocks
   like the B-Tree, but instead a virtualized address space designed to
   localize records, B-Tree nodes, large data blocks, and small data blocks.
   It's a different sort of blockmap then what one typically hears described
   for a filesystem... really more of an allocation space.  I do this for
   several implementation reasons most specifically because HAMMER is
   designed for a hard disk and seek locality of reference is important,
   but also so information can be relocated in 8MB chunks to be able to
   add and remove physical storage.  If I were reimplementing HAMMER as
   a flash filesystem (which I am NOT doing), I would probably do away
   with the blockmap layer entirely since seek locality of reference is
   not needed for a flash filesystem, and the global B-Tree would serve
   directly as the named-block topology.  Kinda cool to think about.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:54:38 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DA727106568B;
	Mon, 31 Mar 2008 22:54:37 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id BCBD58FC1A;
	Mon, 31 Mar 2008 22:54:37 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMsQqB029550;
	Mon, 31 Mar 2008 15:54:26 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMsPqZ029549;
	Mon, 31 Mar 2008 15:54:25 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:54:25 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312254.m2VMsPqZ029549@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:54:38 -0000

:>=20
:>     No matter what you do you have to index the information=20
:> *SOMEWHERE*.
:
:And NAND devices have a *SOMEWHERE* that makes them different than other
:persistent storage devices in ways that make them interesting to do file
:systems for.
:
:It's not _that_ you have to scan the NAND, by the way, it's _when_ you
:scan the NAND that has the major impact on performance.

   I know where you are coming from there.  The flash filesystem I did for
   our telemetry product (20 years ago, which is still in operation today)
   uses a named-block translation scheme but simply builds the topology out
   in main memory when the filesystem is mounted.  These are small flash
   devices, two 1 MBytes NOR chips if I remember right.  It just scans
   the translation table which is just a linear array and bulids the
   topology in ram, which takes maybe a few milliseconds to do on boot
   and after that, zero cost.

   Of course, that was for a small flash device so I could get away with it.
   And it was NOR so the translation table was all in one place and could
   be trivially scanned and updated.

   I have a similar issue in HAMMER.  HAMMER is designed as a multi-terrabyte
   filesystem.  HAMMER isn't a flash filesystem but it effectively uses a
   naming mechanic to locate inodes and data, so the problem is similar.
   I was really worried about this mechanic as compared to, say, UFS, where
   the absolute location of the on-disk inode can be directly calculated
   from the inode number.

   HAMMER has to look the inode number up in the global B-Tree.  Even though
   it's a 15-way B-Tree (meaning it is fairly shallow and good locality of
   reference in the buffer cache), I was really worried about performance
   so I implemented a B-Tree pointer cache in the in-memory inode
   structure.  So, e.g. if you lookup a filename in a directory the directory
   inode cached a pointer into the B-Tree 'near' the directory inode element,
   and another for the most recent inode number it looked up.  These cached
   pointers then served as a heuristical starting point for the B-Tree
   lookup to locate the file in the directory and the inode number.

   Well, to shorten the description... the overhead of having to do the
   lookup turned out to not matter at all with the cache in place.  Even
   better, since an inode's data blocks (and other information) is also
   indexed in the global B-Tree, the same cache also served for accesses
   into the file or directory itself.  Whatever overhead might have been
   added from having to lookup the inode was completely covered by the
   savings of not having to run through a multi-level blockmap like FFS
   does.  In many cases a B-Tree search is so well localized that it doesn't
   even have to leave the node.

   (p.s. when I talk about localization here, I'm talking about in-memory
   disk cache, not seek localization).

   In anycase, this is why I just don't worry about named-block translations.
   If one had a filesystem-layer blockmap AND named-block translations it
   could be pretty nasty due to the updating requirements.  But if the
   filesystem revolves entirely around named-block translations and did
   not implement any blockmaps of its own, the only thing that happens is
   that some overheads that used to be in one part of the filesystem are
   now in another part instead, resulting in a net zero.

   HAMMER actually does implement a couple of blockmaps in addition to
   its global B-Tree, but in the case of HAMMER the blockmap is mapping
   *huge* physical translations... 8MB per block.  They aren't named blocks
   like the B-Tree, but instead a virtualized address space designed to
   localize records, B-Tree nodes, large data blocks, and small data blocks.
   It's a different sort of blockmap then what one typically hears described
   for a filesystem... really more of an allocation space.  I do this for
   several implementation reasons most specifically because HAMMER is
   designed for a hard disk and seek locality of reference is important,
   but also so information can be relocated in 8MB chunks to be able to
   add and remove physical storage.  If I were reimplementing HAMMER as
   a flash filesystem (which I am NOT doing), I would probably do away
   with the blockmap layer entirely since seek locality of reference is
   not needed for a flash filesystem, and the global B-Tree would serve
   directly as the named-block topology.  Kinda cool to think about.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:54:48 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5BDBB1065702
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 22:54:48 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 43B748FC12
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 22:54:48 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id C976C5B50;
	Mon, 31 Mar 2008 15:21:54 -0700 (PDT)
To: Matthew Dillon <dillon@apollo.backplane.com>
In-reply-to: Your message of "Mon, 31 Mar 2008 13:06:10 PDT."
	<200803312006.m2VK6Aom028133@apollo.backplane.com> 
Date: Mon, 31 Mar 2008 15:21:54 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080331222154.C976C5B50@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org, Martin Fouts <mfouts@danger.com>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:54:48 -0000

On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
>     But how do you index that information?  You can't simply append the
>     information to the NAND unless you also have a way to access it.  So
>     does the filesystem have to scan the NAND (or significant portions of it)
>     in order to build an index of the filesystem topology in system memory?

One possible way:

I'd design the system so that each update ends with the write
of a root block[1]. I'd also write root blocks at fixed
locations to find them easily without having to scann the
whole disk. Given this, on reboot use binary search to locate
the latest root block at a fixed location. There may be
further updates so scan forward until you locate the most
uptodate root block and once you have that, you are home
free!  Everything before that root block will be consistent
with it.

Even if the system crashes in the middle of a compacting GC,
the design should be able to recover all data.

What I am not sure about is whether one can do incremental
GC. A stop-and-copy GC is always possible but I don't like
the idea of long pauses.

[1]
The root block contains block # of the earliest valid block,
a sequence number (that will not roll over in device's
lifetime), block #s for various structures such as the root
of inodes, superblock, freelist if any, etc.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 23:06:29 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 892F4106566C
	for <arch@freebsd.org>; Mon, 31 Mar 2008 23:06:29 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 6D0C48FC25
	for <arch@freebsd.org>; Mon, 31 Mar 2008 23:06:29 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VN6Smg029759;
	Mon, 31 Mar 2008 16:06:28 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VN6SRa029758;
	Mon, 31 Mar 2008 16:06:28 -0700 (PDT)
Date: Mon, 31 Mar 2008 16:06:28 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312306.m2VN6SRa029758@apollo.backplane.com>
To: Bakul Shah <bakul@bitblocks.com>
References: <20080331223846.CFD975BAE@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	qpadla@gmail.com, arch@freebsd.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 23:06:29 -0000

:[Poul, use positive encouragement and you'd inspire a lot more
:people!]
:
:Note that in effect this is exactly what zfs does. Update of
:any block implies finding a new place for the updated copy,
:which means the block pointing to it must be also updated,
:which means a new place for it etc. etc.
:
:But hey, I spent just a few minutes sketching out the idea so
:it is possible I missed a whole bunch of things!  If I was
:actually implementing this (which I am tempted to...) I'd
:certainly want to know what others did.
:
:One thing I forgot to add: I'd let the lower level handle bad
:block forwarding and wear levelling (like on the m-tron
:device).

    This is my understanding of what ZFS does too, and I considered it
    when I was designing HAMMER.  I ultimately decided not to go that
    route because I was worried it would destroy seek-locality-of-reference
    on-disk (i.e. read/access performance).  Seek locality of reference
    is of course very important for a disk-based filesystem but not so
    important for a flash-based filesystem.

    The one hard part I have left to do in HAMMER is the UNDO meta-data log.
    Or, more precisely, the recover-on-mount code for the UNDO meta-data log.
    Everything else is done and working.  I knew it would be the hardest part
    of the filesystem when I ultimately decided not to go ZFS's route.

    The UNDO log is basically one seek-write per fsync or whenever the
    filesystem is flushed (every 30 seconds on BSDs)... not too bad,
    particularly because it stores only meta-data changes and not
    data-changes.  Ultimately I think I can make it worthwhile by including
    data elements for small seek/write/fsync sequences in the UNDO record
    and just syncing it, which would be awesome for database applications.
    I have no immediate plans to do that right now, though.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 23:06:40 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EFA591065681
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 23:06:40 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id D39328FC22
	for <freebsd-arch@freebsd.org>; Mon, 31 Mar 2008 23:06:40 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VN6Smg029759;
	Mon, 31 Mar 2008 16:06:28 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VN6SRa029758;
	Mon, 31 Mar 2008 16:06:28 -0700 (PDT)
Date: Mon, 31 Mar 2008 16:06:28 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312306.m2VN6SRa029758@apollo.backplane.com>
To: Bakul Shah <bakul@bitblocks.com>
References: <20080331223846.CFD975BAE@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	qpadla@gmail.com, arch@freebsd.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>, freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 23:06:41 -0000

:[Poul, use positive encouragement and you'd inspire a lot more
:people!]
:
:Note that in effect this is exactly what zfs does. Update of
:any block implies finding a new place for the updated copy,
:which means the block pointing to it must be also updated,
:which means a new place for it etc. etc.
:
:But hey, I spent just a few minutes sketching out the idea so
:it is possible I missed a whole bunch of things!  If I was
:actually implementing this (which I am tempted to...) I'd
:certainly want to know what others did.
:
:One thing I forgot to add: I'd let the lower level handle bad
:block forwarding and wear levelling (like on the m-tron
:device).

    This is my understanding of what ZFS does too, and I considered it
    when I was designing HAMMER.  I ultimately decided not to go that
    route because I was worried it would destroy seek-locality-of-reference
    on-disk (i.e. read/access performance).  Seek locality of reference
    is of course very important for a disk-based filesystem but not so
    important for a flash-based filesystem.

    The one hard part I have left to do in HAMMER is the UNDO meta-data log.
    Or, more precisely, the recover-on-mount code for the UNDO meta-data log.
    Everything else is done and working.  I knew it would be the hardest part
    of the filesystem when I ultimately decided not to go ZFS's route.

    The UNDO log is basically one seek-write per fsync or whenever the
    filesystem is flushed (every 30 seconds on BSDs)... not too bad,
    particularly because it stores only meta-data changes and not
    data-changes.  Ultimately I think I can make it worthwhile by including
    data elements for small seek/write/fsync sequences in the UNDO record
    and just syncing it, which would be awesome for database applications.
    I have no immediate plans to do that right now, though.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 23:18:30 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 127CF106566C;
	Mon, 31 Mar 2008 23:18:30 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id C4A498FC27;
	Mon, 31 Mar 2008 23:18:29 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id 26D865B50;
	Mon, 31 Mar 2008 16:18:29 -0700 (PDT)
To: Alfred Perlstein <alfred@freebsd.org>
In-reply-to: Your message of "Mon, 31 Mar 2008 15:29:38 PDT."
	<20080331222938.GS95731@elvis.mu.org> 
Date: Mon, 31 Mar 2008 16:18:29 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080331231829.26D865B50@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org,
	Poul-Henning Kamp <phk@phk.freebsd.dk>, qpadla@gmail.com,
	Martin Fouts <mfouts@danger.com>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 23:18:30 -0000

On Mon, 31 Mar 2008 15:29:38 PDT Alfred Perlstein <alfred@freebsd.org>  wrote:
> * Poul-Henning Kamp <phk@phk.freebsd.dk> [080331 15:24] wrote:
> > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes
> :
> > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.c
> om>  wrote:
> > >>     But how do you index that information?  You can't simply append the
> > >>     information to the NAND unless you also have a way to access it.  So
> > >>     does the filesystem have to scan the NAND (or significant portions o
> f it)
> > >>     in order to build an index of the filesystem topology in system memo
> ry?
> > >
> > >One possible way:
> > >
> > >I'd design the system so that each update ends with the write
> > >of a root block[1]. 
> > 
> > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
> > it has many drawbacks, in particular when it comes to recovery.
> 
> Can you explain why?
> 
> I could see it being a problem because recovering the filesystem's
> most recent change might require significant scanning?

Let us take the mtron MSD-SATA3025-032 device for example. It
has a capacity of 32GB + can do 16,000 random & 78,000
sequential reads per second (of 512 byte blocks). If you
write a root block every megabyte, you have 2^15 potential
root blocks.  Locating the latest one will require 16 random
reads + a scan of at most 1MB; which translates to about
26ms.  Not too bad since this cost is incurred only on
the first mount or reboot.

From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 23:55:22 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AE2441065672
	for <arch@freebsd.org>; Mon, 31 Mar 2008 23:55:22 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 831758FC15
	for <arch@freebsd.org>; Mon, 31 Mar 2008 23:55:22 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VNt7Jf030309;
	Mon, 31 Mar 2008 16:55:07 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VNt7ZY030308;
	Mon, 31 Mar 2008 16:55:07 -0700 (PDT)
Date: Mon, 31 Mar 2008 16:55:07 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312355.m2VNt7ZY030308@apollo.backplane.com>
To: Bakul Shah <bakul@bitblocks.com>
References: <20080331231829.26D865B50@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	Alfred Perlstein <alfred@freebsd.org>, qpadla@gmail.com,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 23:55:22 -0000


:Let us take the mtron MSD-SATA3025-032 device for example. It
:has a capacity of 32GB + can do 16,000 random & 78,000
:sequential reads per second (of 512 byte blocks). If you
:write a root block every megabyte, you have 2^15 potential
:root blocks.  Locating the latest one will require 16 random
:reads + a scan of at most 1MB; which translates to about
:26ms.  Not too bad since this cost is incurred only on
:the first mount or reboot.

    Yah, I think for NAND filesystems crash recovery is actually the easiest
    issue to deal with since all you need to do, pretty much, is rewind
    your append pointer a bit.  Not only can you rewind the filesystem
    to an older state, but you can also reconstruct a great deal of the
    unsynced in-memory data by doing a limited reverse scan.  It just does
    not take very long to do that and it can be done automatically by the
    filesystem at mount time.

    For example, if you write some file data and the new file block is
    flushed to flash but the related meta-data changes for the pointers to
    the now relocated data block have not yet been flushed to flash, on
    crash recovery it is possible to note this condition (the old physical
    block number would be stored in the aux data area of the new one) and
    regenerate the missing meta-data changes instead of being forced to
    back-out the write.

    Being able to do this has fairly substantial consequences because it
    means the fsync() only needs to flush some of the modified pages, not
    all of them, and that the remaining modified pages could in fact
    remain unflushed in the system cache and STILL be recoverable after
    a crash because their modification was mearly a side effect of the
    operation that *was* flushed to flash.  Once you are able to do that,
    you can also simply decide not to synchronously flush that meta data
    at all and thus allow multiple changes to accumulate in system memory
    before flushing to flash.

    There is one issue here and that is the transactional nature of most
    filesystem operations.  For example, if you append to a file with
    a write() you are not only writing new data to the backing store, you
    are also updating the on-disk inode's st_size field.  Those are two
    widely separated pieces of information which must be transactionally
    bound --- all or nothing.  In this regard the crash recovery code needs
    to understand that they are bound and either recover the whole
    transaction or none of it. 

    Once you get to that point you also have to worry about
    interdependancies between transactions... a really sticky issue
    that is the reason for a huge chunk of softupdate's complexity in
    UFS.  Basically you can wind up with interdependant transactions
    which must be properly recovered.  An example would be doing a write()
    and then doing another write().  The second write() cannot be recovered
    unless the first can also be recovered.  Separate transactions, but with
    a dependancy.

    Such interdependancies can become arbitrarily complex the longer
    you leave meta-data changes unflushed.  The question ultimately becomes
    whether the recovery code can deal with the complexity or not.  If not
    you may be forced to flush rather then create the interdependancy.

    HAMMER has precisely this issue with it's UNDO sequencing.  The complexity
    is in the algorith, but not the (fairly small) amount of time it takes
    to actually perform the recovery operation.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 01:03:34 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 29EAF106564A
	for <arch@FreeBSD.ORG>; Tue,  1 Apr 2008 01:03:34 +0000 (UTC)
	(envelope-from das@FreeBSD.ORG)
Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101])
	by mx1.freebsd.org (Postfix) with ESMTP id DAF938FC12
	for <arch@FreeBSD.ORG>; Tue,  1 Apr 2008 01:03:33 +0000 (UTC)
	(envelope-from das@FreeBSD.ORG)
Received: from zim.MIT.EDU (localhost [127.0.0.1])
	by zim.MIT.EDU (8.14.2/8.14.2) with ESMTP id m31159Zv007875;
	Mon, 31 Mar 2008 21:05:09 -0400 (EDT) (envelope-from das@FreeBSD.ORG)
Received: (from das@localhost)
	by zim.MIT.EDU (8.14.2/8.14.2/Submit) id m31158QH007874;
	Mon, 31 Mar 2008 21:05:08 -0400 (EDT) (envelope-from das@FreeBSD.ORG)
Date: Mon, 31 Mar 2008 21:05:08 -0400
From: David Schultz <das@FreeBSD.ORG>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Message-ID: <20080401010508.GA7708@zim.MIT.EDU>
Mail-Followup-To: Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Bakul Shah <bakul@bitblocks.com>,
	Christopher Arnold <chris@arnold.se>,
	Martin Fouts <mfouts@danger.com>, arch@FreeBSD.ORG,
	qpadla@gmail.com, freebsd-arch@FreeBSD.ORG
References: <20080331222154.C976C5B50@mail.bitblocks.com>
	<26080.1207002217@critter.freebsd.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <26080.1207002217@critter.freebsd.dk>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@FreeBSD.ORG, qpadla@gmail.com, freebsd-arch@FreeBSD.ORG
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 01:03:34 -0000

On Mon, Mar 31, 2008, Poul-Henning Kamp wrote:
> In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
> >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
> >>     But how do you index that information?  You can't simply append the
> >>     information to the NAND unless you also have a way to access it.  So
> >>     does the filesystem have to scan the NAND (or significant portions of it)
> >>     in order to build an index of the filesystem topology in system memory?
> >
> >One possible way:
> >
> >I'd design the system so that each update ends with the write
> >of a root block[1]. 

This is exactly what ZFS does (except that it wasn't designed for
flash, so the primary copy of the root block is always stored at a
well-known location.) Countless other systems dating back to the
use of shadow paging in System R use the same technique, including
WAFL and several flash file systems.

> This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
> it has many drawbacks, in particular when it comes to recovery.

Generally not. Recovery is trivial, especially compared to other
techniques such as journalling. You simply find the root block,
and it has pointers to a consistent snapshot of the system.  The
main limitation is that making updates durable immediately (i.e.,
fsync()) is inefficient, since all the dirty indirect blocks up to
the root need to be flushed to disk. ZFS addresses this by writing
updates that must be synchronous to a logical redo log, which does
introduce complications for recovery.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 01:13:07 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8CE9B1065672;
	Tue,  1 Apr 2008 01:13:07 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 22F4E8FC12;
	Tue,  1 Apr 2008 01:13:07 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id 2A4875B41;
	Mon, 31 Mar 2008 18:13:06 -0700 (PDT)
To: Matthew Dillon <dillon@apollo.backplane.com>
In-reply-to: Your message of "Mon, 31 Mar 2008 16:55:07 PDT."
	<200803312355.m2VNt7ZY030308@apollo.backplane.com> 
Date: Mon, 31 Mar 2008 18:13:05 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080401011306.2A4875B41@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	Alfred Perlstein <alfred@freebsd.org>, qpadla@gmail.com,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 01:13:07 -0000

On Mon, 31 Mar 2008 16:55:07 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
>     There is one issue here and that is the transactional nature of most
>     filesystem operations.  For example, if you append to a file with
>     a write() you are not only writing new data to the backing store, you
>     are also updating the on-disk inode's st_size field.  Those are two
>     widely separated pieces of information which must be transactionally
>     bound --- all or nothing.  In this regard the crash recovery code needs
>     to understand that they are bound and either recover the whole
>     transaction or none of it. 
> 
>     Once you get to that point you also have to worry about
>     interdependancies between transactions... a really sticky issue
>     that is the reason for a huge chunk of softupdate's complexity in
>     UFS.  Basically you can wind up with interdependant transactions
>     which must be properly recovered.  An example would be doing a write()
>     and then doing another write().  The second write() cannot be recovered
>     unless the first can also be recovered.  Separate transactions, but with
>     a dependancy.

My instinct is to not combine transactions.  That is, every
data write results in a sequence: {data, [indirect blocks],
inode, ..., root block}.  Until the root block is written to
the disk this is not a "commited" transaction and can be
thrown away.  In a Log-FS we always append on write; we never
overwrite any data/metadata so this is easy and the FS state
remains consistent.  FFS overwrites blocks so all this gets
far more complicated.  Sort of like the difference between
reasoning about functional programs & imperative programs!

Now, it may be possible to define certain rules that allows
one to combine transactions.  For instance,

    write1(block n), write2(block n) == write2(block n)
    write(block n of file f1), delete file f1 == delete file f1

etc. That is, as long as write1 & associated metadata writes
are not flushed to the disk, and a later write (write2) comes
along, the earlier write (write1) can be thrown away.  [But I
have no idea if this is worth doing or even doable!]

This is reminiscent of the bottom up rewrite system (BURS)
used in some code generators (such as lcc's). The idea is the
same here: replace a sequence of operations with an
equivalent but lower cost sequence.

>     Such interdependancies can become arbitrarily complex the longer
>     you leave meta-data changes unflushed.  The question ultimately becomes
>     whether the recovery code can deal with the complexity or not.  If not
>     you may be forced to flush rather then create the interdependancy.
> 
>     HAMMER has precisely this issue with it's UNDO sequencing.  The complexity
>     is in the algorith, but not the (fairly small) amount of time it takes
>     to actually perform the recovery operation.

I don't understand the complexity. Basically your log should
allow you to define a functional programming abstraction --
where you never overwrite any data/metadata for any active
transactions and so reasoning becomes easier. [But may be we
should take any hammer discussion offline]

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 01:33:43 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7CE1C106564A
	for <freebsd-arch@FreeBSD.ORG>; Tue,  1 Apr 2008 01:33:43 +0000 (UTC)
	(envelope-from das@FreeBSD.ORG)
Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 39D7E8FC17
	for <freebsd-arch@FreeBSD.ORG>; Tue,  1 Apr 2008 01:33:43 +0000 (UTC)
	(envelope-from das@FreeBSD.ORG)
Received: from zim.MIT.EDU (localhost [127.0.0.1])
	by zim.MIT.EDU (8.14.2/8.14.2) with ESMTP id m31159Zv007875;
	Mon, 31 Mar 2008 21:05:09 -0400 (EDT) (envelope-from das@FreeBSD.ORG)
Received: (from das@localhost)
	by zim.MIT.EDU (8.14.2/8.14.2/Submit) id m31158QH007874;
	Mon, 31 Mar 2008 21:05:08 -0400 (EDT) (envelope-from das@FreeBSD.ORG)
Date: Mon, 31 Mar 2008 21:05:08 -0400
From: David Schultz <das@FreeBSD.ORG>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Message-ID: <20080401010508.GA7708@zim.MIT.EDU>
Mail-Followup-To: Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Bakul Shah <bakul@bitblocks.com>,
	Christopher Arnold <chris@arnold.se>,
	Martin Fouts <mfouts@danger.com>, arch@FreeBSD.ORG,
	qpadla@gmail.com, freebsd-arch@FreeBSD.ORG
References: <20080331222154.C976C5B50@mail.bitblocks.com>
	<26080.1207002217@critter.freebsd.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <26080.1207002217@critter.freebsd.dk>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	arch@FreeBSD.ORG, qpadla@gmail.com, freebsd-arch@FreeBSD.ORG
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 01:33:43 -0000

On Mon, Mar 31, 2008, Poul-Henning Kamp wrote:
> In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes:
> >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
> >>     But how do you index that information?  You can't simply append the
> >>     information to the NAND unless you also have a way to access it.  So
> >>     does the filesystem have to scan the NAND (or significant portions of it)
> >>     in order to build an index of the filesystem topology in system memory?
> >
> >One possible way:
> >
> >I'd design the system so that each update ends with the write
> >of a root block[1]. 

This is exactly what ZFS does (except that it wasn't designed for
flash, so the primary copy of the root block is always stored at a
well-known location.) Countless other systems dating back to the
use of shadow paging in System R use the same technique, including
WAFL and several flash file systems.

> This is sort of the approach Margo Seltzer used for her (Kludge-)LFS
> it has many drawbacks, in particular when it comes to recovery.

Generally not. Recovery is trivial, especially compared to other
techniques such as journalling. You simply find the root block,
and it has pointers to a consistent snapshot of the system.  The
main limitation is that making updates durable immediately (i.e.,
fsync()) is inefficient, since all the dirty indirect blocks up to
the root need to be flushed to disk. ZFS addresses this by writing
updates that must be synchronous to a logical redo log, which does
introduce complications for recovery.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 04:53:48 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E4DB7106564A;
	Tue,  1 Apr 2008 04:53:48 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id C97878FC24;
	Tue,  1 Apr 2008 04:53:48 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 5F47F41310C;
	Mon, 31 Mar 2008 21:53:39 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 21:53:47 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
In-Reply-To: <200803312254.m2VMsPqZ029549@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTgjIRUdfyD0S9R1C3o5kg62MZ6QAMdO2Q
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 04:53:49 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Monday, March 31, 2008 3:54 PM
> To: Martin Fouts
> Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20
> Arnold; arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics
>
> If I were reimplementing HAMMER as a flash filesystem=20
> (which I am NOT doing), I would probably do away
> with the blockmap layer entirely since seek locality of=20
> reference is not needed for a flash filesystem, and the global
> B-Tree would serve directly as the named-block topology.

Which would lead you almost directly to the sort of performance problems
that jffs2 has.

Until you've done it, you'll be surprised at the cost of maintaining
b-trees in NAND.


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 04:53:48 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E4DB7106564A;
	Tue,  1 Apr 2008 04:53:48 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id C97878FC24;
	Tue,  1 Apr 2008 04:53:48 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 5F47F41310C;
	Mon, 31 Mar 2008 21:53:39 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 21:53:47 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
In-Reply-To: <200803312254.m2VMsPqZ029549@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTgjIRUdfyD0S9R1C3o5kg62MZ6QAMdO2Q
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 04:53:49 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Monday, March 31, 2008 3:54 PM
> To: Martin Fouts
> Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20
> Arnold; arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics
>
> If I were reimplementing HAMMER as a flash filesystem=20
> (which I am NOT doing), I would probably do away
> with the blockmap layer entirely since seek locality of=20
> reference is not needed for a flash filesystem, and the global
> B-Tree would serve directly as the named-block topology.

Which would lead you almost directly to the sort of performance problems
that jffs2 has.

Until you've done it, you'll be surprised at the cost of maintaining
b-trees in NAND.


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 04:55:21 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8E7AF10656CD;
	Tue,  1 Apr 2008 04:55:21 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 7617F8FC30;
	Tue,  1 Apr 2008 04:55:21 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 6F771409D23;
	Mon, 31 Mar 2008 21:55:07 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 21:55:15 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0E@EXCHANGE.danger.com>
In-Reply-To: <20080331223846.CFD975BAE@mail.bitblocks.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics 
Thread-Index: AciTf/3rlDZiIGCpSl2L4XsPLGVpdwANHLxA
References: Your message of "Mon,
	31 Mar 2008 22:23:37 -0000." <26080.1207002217@critter.freebsd.dk>
	<20080331223846.CFD975BAE@mail.bitblocks.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Bakul Shah" <bakul@bitblocks.com>,
	"Poul-Henning Kamp" <phk@phk.freebsd.dk>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 04:55:21 -0000

=20

> -----Original Message-----
> From: Bakul Shah [mailto:bakul@bitblocks.com]=20
> Sent: Monday, March 31, 2008 3:39 PM
> To: Poul-Henning Kamp
> Cc: Matthew Dillon; Christopher Arnold; arch@freebsd.org;=20
> qpadla@gmail.com; freebsd-arch@freebsd.org; Martin Fouts
> Subject: Re: Flash disks and FFS layout heuristics=20

> One thing I forgot to add: I'd let the lower level handle bad=20
> block forwarding and wear levelling (like on the m-tron device).
>=20

One of the difficulties of doing things this way comes from the
complexity of dealing with garbage collection when you want to reuse an
erase unit.


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 04:55:21 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8E7AF10656CD;
	Tue,  1 Apr 2008 04:55:21 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 7617F8FC30;
	Tue,  1 Apr 2008 04:55:21 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 6F771409D23;
	Mon, 31 Mar 2008 21:55:07 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 21:55:15 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0E@EXCHANGE.danger.com>
In-Reply-To: <20080331223846.CFD975BAE@mail.bitblocks.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics 
Thread-Index: AciTf/3rlDZiIGCpSl2L4XsPLGVpdwANHLxA
References: Your message of "Mon,
	31 Mar 2008 22:23:37 -0000." <26080.1207002217@critter.freebsd.dk>
	<20080331223846.CFD975BAE@mail.bitblocks.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Bakul Shah" <bakul@bitblocks.com>,
	"Poul-Henning Kamp" <phk@phk.freebsd.dk>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 04:55:21 -0000

=20

> -----Original Message-----
> From: Bakul Shah [mailto:bakul@bitblocks.com]=20
> Sent: Monday, March 31, 2008 3:39 PM
> To: Poul-Henning Kamp
> Cc: Matthew Dillon; Christopher Arnold; arch@freebsd.org;=20
> qpadla@gmail.com; freebsd-arch@freebsd.org; Martin Fouts
> Subject: Re: Flash disks and FFS layout heuristics=20

> One thing I forgot to add: I'd let the lower level handle bad=20
> block forwarding and wear levelling (like on the m-tron device).
>=20

One of the difficulties of doing things this way comes from the
complexity of dealing with garbage collection when you want to reuse an
erase unit.


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 05:30:00 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 80DA91065674;
	Tue,  1 Apr 2008 05:30:00 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 669398FC2D;
	Tue,  1 Apr 2008 05:30:00 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id F11E7414D25;
	Mon, 31 Mar 2008 22:27:41 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 22:27:50 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
In-Reply-To: <200803312219.m2VMJlkT029240@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTfnsz5JkHfGFbRHmgUX7Qxg+vbgANl4/w
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 05:30:00 -0000

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Monday, March 31, 2008 3:20 PM
>=20
> For flash storage systems competitive with hard drive storage,=20

In embedded systems, it's RAM that flash storage competes with, not hard

drive storage.

SSD is a completely different engineering problem.

> For the phone market?  You mean small flash storage=20
> devices?  Performance is almost irrelevant there

Actually, we're very performance sensitive in this area, and getting
more so as audio and video demands grow.

> Three in five years?  Is that an illustration of my point=20
> with regards to flash filesystem design?  Ok, that was a joke :-)
>=20

It's illustrative of my changing career. Three different filel sytems
for three different products. ;)

> But I don't think we can count small flash storage systems.  Both
models
> devolve into trivialities when you are managing small amounts of
> flash storage.

I don't know who your "we" is, but *my* "we" counts small flash storage
systems as rather critical.

And the 'trivialities' aren't so trivial when you have to maintain
reliability in the face of easily removable batteries.

> Again, I am not familiar with jffs2 but you are painting=20
> a very broad brush that is more then likely an issue specifically
> with the jffs2 design and not the concept of using named blocks in
> general.

That's the assumption that led from jffs1 to jffs2. It's an incorrect
assumption.

> What you are advocating is a filesystem which uses an=20
> absolute sector referencing scheme.

I haven't actually advocated anything. Merely pointed out problems.  But
no, the scheme that we're currently using doesn't use the sort of
absolute sector referencing scheme you're suggesting below.

> Any change made to the filesystem requires a new
> page to essentially be appended to the flash storage.  In order to
> properly index the information and maintain the=20
> filesystem topology  you also have to recopy *ALL* pages containing=20
> references to the updated absolute sector in order to repoint them=20
> to the new absolute sector.

Sorry, no.  Doesn't work like that at all. This is, after all, computer
science, and indirection is your friend.

> I really understand that model, and it has the advantage=20

I'm sure you do. It's not the one we're using though.

> I really do understand where you are coming from, the=20
> simplicity of chaining the physical topology cannot be denied,
> and I like the elegance, but I hope I've
> shown that it is not actually simplifying the overall design much
> over a named block scheme, and that there are some fairly severe
> issues that can crop up that are complete non-issues when=20
> using a named block scheme.

All you've really shown is that the difference between theory and
practice, as usual, remains larger in practice than in theory.

You have made it painfully clear that you are immersed in large scale
file systems, an area I left behind a decade ago when I abandoned my
work on CUE at HP Labs. It is a fascinating and difficult area, and I
heartily approve of experimentation in it. It also has almost no
engineering tradeoffs in common with persistent storage for battery
powered devices.

In summary, then: NAND devices are critical to CE products, especially
so-called convergent devices, in which there is no hard disk and
persistent storage takes the form of an embedded NAND device and zero or
more removable NAND devices.  Power issues are critical and performance
is becoming more so as the devices become more complex. Reliability of
the file systems on these devices is also critical.  The usual technique
of disk optimization performance (throw more ram at in in order to
cache) is unavailable, the usual hardware need for optimization (seek
and rotational latency) are not present, and the peculiarities of NAND,
most notably the size of the erase unit compared to the size of the
write unit, the existence of the spare area, and the much higher bit
error rates than either disk or ram experience, coupled with those
requirements lead to a need for NAND-specific file systems on such
devices.

Experience has shown that brute force approaches based on flash
translation layers work, but are inefficient and overly complex.
Attempts to use generalized NOR file systems in NAND tend to have
significant performance problems because of the cost maintaining the
embedded data structures, such as b-trees, that replaced the more
straightfoward data structures of earlier more linear file system
designs.

Experience has also shown that the file system needs to expose
transaction semantics to the application, and that leaving bad block
handling to a translation layer (even a block naming scheme) leads to
performance problems consequent to garbage collection, which is
inevitable in devices that have such large erase units.


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 05:30:00 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 80DA91065674;
	Tue,  1 Apr 2008 05:30:00 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 669398FC2D;
	Tue,  1 Apr 2008 05:30:00 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id F11E7414D25;
	Mon, 31 Mar 2008 22:27:41 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 22:27:50 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
In-Reply-To: <200803312219.m2VMJlkT029240@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTfnsz5JkHfGFbRHmgUX7Qxg+vbgANl4/w
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 05:30:00 -0000

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Monday, March 31, 2008 3:20 PM
>=20
> For flash storage systems competitive with hard drive storage,=20

In embedded systems, it's RAM that flash storage competes with, not hard

drive storage.

SSD is a completely different engineering problem.

> For the phone market?  You mean small flash storage=20
> devices?  Performance is almost irrelevant there

Actually, we're very performance sensitive in this area, and getting
more so as audio and video demands grow.

> Three in five years?  Is that an illustration of my point=20
> with regards to flash filesystem design?  Ok, that was a joke :-)
>=20

It's illustrative of my changing career. Three different filel sytems
for three different products. ;)

> But I don't think we can count small flash storage systems.  Both
models
> devolve into trivialities when you are managing small amounts of
> flash storage.

I don't know who your "we" is, but *my* "we" counts small flash storage
systems as rather critical.

And the 'trivialities' aren't so trivial when you have to maintain
reliability in the face of easily removable batteries.

> Again, I am not familiar with jffs2 but you are painting=20
> a very broad brush that is more then likely an issue specifically
> with the jffs2 design and not the concept of using named blocks in
> general.

That's the assumption that led from jffs1 to jffs2. It's an incorrect
assumption.

> What you are advocating is a filesystem which uses an=20
> absolute sector referencing scheme.

I haven't actually advocated anything. Merely pointed out problems.  But
no, the scheme that we're currently using doesn't use the sort of
absolute sector referencing scheme you're suggesting below.

> Any change made to the filesystem requires a new
> page to essentially be appended to the flash storage.  In order to
> properly index the information and maintain the=20
> filesystem topology  you also have to recopy *ALL* pages containing=20
> references to the updated absolute sector in order to repoint them=20
> to the new absolute sector.

Sorry, no.  Doesn't work like that at all. This is, after all, computer
science, and indirection is your friend.

> I really understand that model, and it has the advantage=20

I'm sure you do. It's not the one we're using though.

> I really do understand where you are coming from, the=20
> simplicity of chaining the physical topology cannot be denied,
> and I like the elegance, but I hope I've
> shown that it is not actually simplifying the overall design much
> over a named block scheme, and that there are some fairly severe
> issues that can crop up that are complete non-issues when=20
> using a named block scheme.

All you've really shown is that the difference between theory and
practice, as usual, remains larger in practice than in theory.

You have made it painfully clear that you are immersed in large scale
file systems, an area I left behind a decade ago when I abandoned my
work on CUE at HP Labs. It is a fascinating and difficult area, and I
heartily approve of experimentation in it. It also has almost no
engineering tradeoffs in common with persistent storage for battery
powered devices.

In summary, then: NAND devices are critical to CE products, especially
so-called convergent devices, in which there is no hard disk and
persistent storage takes the form of an embedded NAND device and zero or
more removable NAND devices.  Power issues are critical and performance
is becoming more so as the devices become more complex. Reliability of
the file systems on these devices is also critical.  The usual technique
of disk optimization performance (throw more ram at in in order to
cache) is unavailable, the usual hardware need for optimization (seek
and rotational latency) are not present, and the peculiarities of NAND,
most notably the size of the erase unit compared to the size of the
write unit, the existence of the spare area, and the much higher bit
error rates than either disk or ram experience, coupled with those
requirements lead to a need for NAND-specific file systems on such
devices.

Experience has shown that brute force approaches based on flash
translation layers work, but are inefficient and overly complex.
Attempts to use generalized NOR file systems in NAND tend to have
significant performance problems because of the cost maintaining the
embedded data structures, such as b-trees, that replaced the more
straightfoward data structures of earlier more linear file system
designs.

Experience has also shown that the file system needs to expose
transaction semantics to the application, and that leaving bad block
handling to a translation layer (even a block naming scheme) leads to
performance problems consequent to garbage collection, which is
inevitable in devices that have such large erase units.


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 07:45:44 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C45ED1065676
	for <arch@freebsd.org>; Tue,  1 Apr 2008 07:45:44 +0000 (UTC)
	(envelope-from ed@hoeg.nl)
Received: from palm.hoeg.nl (mx0.hoeg.nl [IPv6:2001:610:652::211])
	by mx1.freebsd.org (Postfix) with ESMTP id 883448FC21
	for <arch@freebsd.org>; Tue,  1 Apr 2008 07:45:44 +0000 (UTC)
	(envelope-from ed@hoeg.nl)
Received: by palm.hoeg.nl (Postfix, from userid 1000)
	id 8393E1CC30; Tue,  1 Apr 2008 09:45:43 +0200 (CEST)
Date: Tue, 1 Apr 2008 09:45:43 +0200
From: Ed Schouten <ed@80386.nl>
To: Martin Fouts <mfouts@danger.com>
Message-ID: <20080401074543.GK51074@hoeg.nl>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="np6E2rbShIadjoVu"
Content-Disposition: inline
In-Reply-To: <B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
User-Agent: Mutt/1.5.17 (2007-11-01)
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 07:45:44 -0000


--np6E2rbShIadjoVu
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Martin Fouts <mfouts@danger.com> wrote:
> The MTD based file system jffs2 is an example of the third, and a
> cautionary tale for those who would write their own.

I can remember there is also a newer MTD based file system called LogFS:

	http://logfs.org/

--=20
 Ed Schouten <ed@80386.nl>
 WWW: http://g-rave.nl/

--np6E2rbShIadjoVu
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (FreeBSD)

iEYEARECAAYFAkfx6CcACgkQ52SDGA2eCwUlrQCff2XreQyrIcjzv0F9852VCHkf
o1cAnjWRUYuPv2m1wjg2meh6lX16m1RN
=ovEd
-----END PGP SIGNATURE-----

--np6E2rbShIadjoVu--

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 07:56:15 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E0B751065671
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 07:56:15 +0000 (UTC)
	(envelope-from wangyi6854@gmail.com)
Received: from ti-out-0910.google.com (ti-out-0910.google.com [209.85.142.190])
	by mx1.freebsd.org (Postfix) with ESMTP id 704248FC1B
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 07:56:15 +0000 (UTC)
	(envelope-from wangyi6854@gmail.com)
Received: by ti-out-0910.google.com with SMTP id j2so619668tid.3
	for <freebsd-arch@freebsd.org>; Tue, 01 Apr 2008 00:56:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta;
	h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	bh=DJW7dMd74MIdDXgD33AMUdw6IjftB2ZRGg2PHvDAybU=;
	b=cesp+97OYGZm/XliwUrwCb+63W46AyJ7u5S3q/7BHClJYSv1RicWH5BNFtgrUwD0wbBClWFje/ElgyKB8IMRH7b+wiSrhxPLfH5X7Vw53xP4Mr7R+RUT6p97toVDMfDFP7r/0OeD56kYSa+iw0rrX4z4VFbFKBfYplse7hG50Ic=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta;
	h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=bNwpShVcS6Kzw8Ku2mqPB7kVBvAzYhTzILfCz12WbSqSFyf2DsjxYMaWHvLCWJJfM2aPAnjKNKSY6FTHgVAOdghyrAp2r8xbfylwuR8wv3KNaQRBXQ2e2bYH6w1xdwBY/JkDDfT3+rKpLZL6M2jBtSEoytrqai04O7cGisgVol4=
Received: by 10.110.31.11 with SMTP id e11mr3341169tie.56.1207034859245;
	Tue, 01 Apr 2008 00:27:39 -0700 (PDT)
Received: by 10.110.10.14 with HTTP; Tue, 1 Apr 2008 00:27:39 -0700 (PDT)
Message-ID: <5ea5cca50804010027k51b59658mb28a481c516e84b0@mail.gmail.com>
Date: Tue, 1 Apr 2008 15:27:39 +0800
From: "Yi Wang" <wangyi6854@gmail.com>
To: "Attilio Rao" <attilio@freebsd.org>
In-Reply-To: <3bbf2fe10802061700p253e68b8s704deb3e5e4ad086@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <3bbf2fe10802061700p253e68b8s704deb3e5e4ad086@mail.gmail.com>
Cc: Yar Tikhiy <yar@freebsd.org>, Doug Barton <dougb@freebsd.org>,
	Jeff Roberson <jeff@freebsd.org>, freebsd-fs@freebsd.org,
	Scot Hetzel <swhetzel@gmail.com>, freebsd-arch@freebsd.org
Subject: Re: [RFC] Remove NTFS kernel support
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 07:56:16 -0000

On 2/7/08, Attilio Rao <attilio@freebsd.org> wrote:
> As exposed by several users, NTFS seems to be broken even before first
>  VFS commits happeing around the end of December. Those commits exposed
>  some problems about NTFS which are currently under investigation.
>  Ultimately, This filesystem is also unmaintained at the moment.
>
>  Speaking with jeff, we agreed on what can be a possible compromise:
>  remove the kernel support for NTFS and maybe take care of the FUSE
>  implementation.
>  What I now propose is a small survey which can shade a light on us
>  about what do you think about this idea and its implications:
>  - Do you use NTFS?

Yes. I have a dual-boot machine.

>  - Are you interested in maintaining it?

No. I'm not familiar with kernel/fs programming.

>  - Do you know a good reason to not use FUSE ntfs implementation? What

Yes. Listening music and watching video on ntfs disks stops frequently
using ntfs-3g.

>  the kernel counter part adds?

I've no idea.

>  - Do you think axing the kernel support a good idea?

For servers, Yes. For desktops, NO!

>
>  Thanks,
>  Attilio
>
>
>
>  --
>  Peace can only be achieved by understanding - A. Einstein
>  _______________________________________________
>  freebsd-fs@freebsd.org mailing list
>  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>  To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>


-- 
Regards,
Wang Yi

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 17:33:31 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1469C106566B;
	Tue,  1 Apr 2008 17:33:31 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id D624D8FC15;
	Tue,  1 Apr 2008 17:33:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31HXFaJ039652;
	Tue, 1 Apr 2008 10:33:15 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31HXF6e039649;
	Tue, 1 Apr 2008 10:33:15 -0700 (PDT)
Date: Tue, 1 Apr 2008 10:33:15 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804011733.m31HXF6e039649@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 17:33:31 -0000

:> with the blockmap layer entirely since seek locality of=20
:> reference is not needed for a flash filesystem, and the global
:> B-Tree would serve directly as the named-block topology.
:
:Which would lead you almost directly to the sort of performance problems
:that jffs2 has.
:
:Until you've done it, you'll be surprised at the cost of maintaining
:b-trees in NAND.

    Well, I'm not advocating a B-Tree storage model for indexing in NAND.
    That would be kinda nasty.  What I've done is simply describe a mechanism
    whereby a filesystem topology is able to make use of an abstraction to
    the point of being able to do away with what would normally have to be
    implemented by the filesystem itself.  It doesn't have to be a B-Tree.

    You keep mentioning jffs2 and you keep mentioning 'the sort of
    performance problems that jffs2 has'... ok, but you aren't actually
    saying what they are with any specificity.  Just saying that a blockmap
    or a named-block model is bad is wholely insufficient... it's way too
    broad a brush that ignores the literally thousands of ways such
    entities can be implemented.  I've described numerous ways such entities
    can work, particularly if one is manipulating large blocks.  If you
    want to address those please feel free but holding up jffs2 as a poster
    child of fail for an entire class of storage modeling is stupid.

    Please also remember, since you've appeared to have forgotten, that
    topologies can be implemented in both ram and storage together and
    are NOT necessarily ram intensive.  This is going to be particularly
    true for any application reading or writing large files, such as an
    audio application, and is even more particularly true when dealing
    with fairly large files in fairly small amounts of storage.  Synthesis
    is a major design component for small scale filesystems.

    I can't comment on your filesystem specifically, but you are welcome
    to describe it in more detail.

    I've doing embedded work for over 20 years now, everything from single
    chip microcomputers with 256 bytes of ram to little ARM chipsets running
    linux.  I still have all that goddamn machine code burned into my brain,
    in fact, like a lost cousin.  Please do not make the inference that I
    somehow do not understand the issues involved.  I know precisely what
    the issues are and I will only repeat that for small scale devices,
    particularly recording and playback devices, the filesystem design
    devolves into trivialities that are easily cached, even if you don't
    have a lot of ram.  Large linear files are extremely well suited for
    synthetic topologies and ridiculously easy to manage the performance
    characteristics of.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 17:33:31 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1469C106566B;
	Tue,  1 Apr 2008 17:33:31 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id D624D8FC15;
	Tue,  1 Apr 2008 17:33:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31HXFaJ039652;
	Tue, 1 Apr 2008 10:33:15 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31HXF6e039649;
	Tue, 1 Apr 2008 10:33:15 -0700 (PDT)
Date: Tue, 1 Apr 2008 10:33:15 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804011733.m31HXF6e039649@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 17:33:31 -0000

:> with the blockmap layer entirely since seek locality of=20
:> reference is not needed for a flash filesystem, and the global
:> B-Tree would serve directly as the named-block topology.
:
:Which would lead you almost directly to the sort of performance problems
:that jffs2 has.
:
:Until you've done it, you'll be surprised at the cost of maintaining
:b-trees in NAND.

    Well, I'm not advocating a B-Tree storage model for indexing in NAND.
    That would be kinda nasty.  What I've done is simply describe a mechanism
    whereby a filesystem topology is able to make use of an abstraction to
    the point of being able to do away with what would normally have to be
    implemented by the filesystem itself.  It doesn't have to be a B-Tree.

    You keep mentioning jffs2 and you keep mentioning 'the sort of
    performance problems that jffs2 has'... ok, but you aren't actually
    saying what they are with any specificity.  Just saying that a blockmap
    or a named-block model is bad is wholely insufficient... it's way too
    broad a brush that ignores the literally thousands of ways such
    entities can be implemented.  I've described numerous ways such entities
    can work, particularly if one is manipulating large blocks.  If you
    want to address those please feel free but holding up jffs2 as a poster
    child of fail for an entire class of storage modeling is stupid.

    Please also remember, since you've appeared to have forgotten, that
    topologies can be implemented in both ram and storage together and
    are NOT necessarily ram intensive.  This is going to be particularly
    true for any application reading or writing large files, such as an
    audio application, and is even more particularly true when dealing
    with fairly large files in fairly small amounts of storage.  Synthesis
    is a major design component for small scale filesystems.

    I can't comment on your filesystem specifically, but you are welcome
    to describe it in more detail.

    I've doing embedded work for over 20 years now, everything from single
    chip microcomputers with 256 bytes of ram to little ARM chipsets running
    linux.  I still have all that goddamn machine code burned into my brain,
    in fact, like a lost cousin.  Please do not make the inference that I
    somehow do not understand the issues involved.  I know precisely what
    the issues are and I will only repeat that for small scale devices,
    particularly recording and playback devices, the filesystem design
    devolves into trivialities that are easily cached, even if you don't
    have a lot of ram.  Large linear files are extremely well suited for
    synthetic topologies and ridiculously easy to manage the performance
    characteristics of.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 17:48:16 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 080171065676
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 17:48:16 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id D871F8FC13
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 17:48:15 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31HmE7w039801;
	Tue, 1 Apr 2008 10:48:14 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31HmE1h039800;
	Tue, 1 Apr 2008 10:48:14 -0700 (PDT)
Date: Tue, 1 Apr 2008 10:48:14 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804011748.m31HmE1h039800@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
Cc: freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 17:48:16 -0000

:
:> -----Original Message-----
:> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
:> Sent: Monday, March 31, 2008 3:20 PM
:>=20
:> For flash storage systems competitive with hard drive storage,=20
:
:In embedded systems, it's RAM that flash storage competes with, not hard
:
:drive storage.
:
:SSD is a completely different engineering problem.

    You know, I think I've asked this already and you don't have to answer
    it if you don't want to, but exactly how large a flash device are you
    working with in your embedded project(s)?

					-Matt


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 17:56:16 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B755E1065672;
	Tue,  1 Apr 2008 17:56:16 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 987C68FC2D;
	Tue,  1 Apr 2008 17:56:16 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 595724020D7;
	Tue,  1 Apr 2008 10:56:06 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 10:56:14 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
In-Reply-To: <200804011733.m31HXF6e039649@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUHoB86TsNjwfyS+ixG2NdJB9SywAACYKg
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 17:56:16 -0000

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Tuesday, April 01, 2008 10:33 AM
> To: Martin Fouts

> Well, I'm not advocating a B-Tree storage model for=20
> indexing in NAND. That would be kinda nasty.  What I've done=20
> is simply describe a mechanism whereby a filesystem topology=20
> is able to make use of an abstraction to
> the point of being able to do away with what would=20
> normally have to be implemented by the filesystem itself.  It=20
> doesn't have to be a B-Tree.
>=20

It has to be a data structure with certain properties, most notably
what's required to maintain consistency. It might in theory be possible
to invent such a data structure that doesn't trip over NAND performance
issues. In practice, it has not turned out to be so. I welcome your
demonstration of such a design.

> You keep mentioning jffs2 and you keep mentioning 'the sort of
> performance problems that jffs2 has'... ok, but you=20
> aren't actually saying what they are with any specificity.

There's plenty of information on jffs2's performance problems available.

> Just saying  that a blockmap or a named-block model is bad=20
> is wholely insufficient...=20

Saying that it's good, and then describing an implementation that's
known in practice to be bad is much less sufficient.


> it's way too broad a brush that ignores the literally
> thousands of ways such entities can be implemented.  I've
> described numerous ways such entities can work, particularly
> if one is manipulating large blocks.

And I've pointed out that your idea of 'large' is too large to be of
value in CE devices.

> If you want to address those please feel free but
> holding up jffs2 as a poster child of fail for an=20
> entire class of storage modeling is stupid.

Indeed it would be.  It's good that I haven't done so.

The only times I've brought jffs2 up is when you've described approaches
that are jffs2-like, and I've pointed out that those specific approaches
have failed in jffs2.


> Please also remember, since you've appeared to have=20
> forgotten, that topologies can be implemented in both ram
> and storage together and are NOT necessarily ram intensive.

No, Matt, I haven't "forgotten".  It's a trivial statement. At runtime
*all* topologies have in-ram and on-storage components.

> I've doing embedded work for over 20 years now,=20

But, by your own earlier admission, you have no experience with NAND in
such systems.  It is a common mistake to extrapolate from NOR flash to
inappropriate assumptions about NAND flash.

> Large linear files are extremely well suited for
> synthetic topologies and ridiculously easy to manage the=20
> performance characteristics of.

"large linear files" are fairly rare on the ground in convergent
devices. What you say may well be true for a simple MP3 player, but
that's not what we're talking about here.

You've done the same thing in this email that you did in your earlier
comparison. You've found a trivial subset of the problem and then make
the generalization that solving that subset shows that the solution to
the problem is trivial.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 17:56:16 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B755E1065672;
	Tue,  1 Apr 2008 17:56:16 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 987C68FC2D;
	Tue,  1 Apr 2008 17:56:16 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 595724020D7;
	Tue,  1 Apr 2008 10:56:06 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 10:56:14 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
In-Reply-To: <200804011733.m31HXF6e039649@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUHoB86TsNjwfyS+ixG2NdJB9SywAACYKg
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 17:56:16 -0000

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Tuesday, April 01, 2008 10:33 AM
> To: Martin Fouts

> Well, I'm not advocating a B-Tree storage model for=20
> indexing in NAND. That would be kinda nasty.  What I've done=20
> is simply describe a mechanism whereby a filesystem topology=20
> is able to make use of an abstraction to
> the point of being able to do away with what would=20
> normally have to be implemented by the filesystem itself.  It=20
> doesn't have to be a B-Tree.
>=20

It has to be a data structure with certain properties, most notably
what's required to maintain consistency. It might in theory be possible
to invent such a data structure that doesn't trip over NAND performance
issues. In practice, it has not turned out to be so. I welcome your
demonstration of such a design.

> You keep mentioning jffs2 and you keep mentioning 'the sort of
> performance problems that jffs2 has'... ok, but you=20
> aren't actually saying what they are with any specificity.

There's plenty of information on jffs2's performance problems available.

> Just saying  that a blockmap or a named-block model is bad=20
> is wholely insufficient...=20

Saying that it's good, and then describing an implementation that's
known in practice to be bad is much less sufficient.


> it's way too broad a brush that ignores the literally
> thousands of ways such entities can be implemented.  I've
> described numerous ways such entities can work, particularly
> if one is manipulating large blocks.

And I've pointed out that your idea of 'large' is too large to be of
value in CE devices.

> If you want to address those please feel free but
> holding up jffs2 as a poster child of fail for an=20
> entire class of storage modeling is stupid.

Indeed it would be.  It's good that I haven't done so.

The only times I've brought jffs2 up is when you've described approaches
that are jffs2-like, and I've pointed out that those specific approaches
have failed in jffs2.


> Please also remember, since you've appeared to have=20
> forgotten, that topologies can be implemented in both ram
> and storage together and are NOT necessarily ram intensive.

No, Matt, I haven't "forgotten".  It's a trivial statement. At runtime
*all* topologies have in-ram and on-storage components.

> I've doing embedded work for over 20 years now,=20

But, by your own earlier admission, you have no experience with NAND in
such systems.  It is a common mistake to extrapolate from NOR flash to
inappropriate assumptions about NAND flash.

> Large linear files are extremely well suited for
> synthetic topologies and ridiculously easy to manage the=20
> performance characteristics of.

"large linear files" are fairly rare on the ground in convergent
devices. What you say may well be true for a simple MP3 player, but
that's not what we're talking about here.

You've done the same thing in this email that you did in your earlier
comparison. You've found a trivial subset of the problem and then make
the generalization that solving that subset shows that the solution to
the problem is trivial.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 18:06:35 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DDC28106564A
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 18:06:35 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id C324E8FC22
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 18:06:35 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 88086403750;
	Tue,  1 Apr 2008 11:06:26 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 11:06:35 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D19@EXCHANGE.danger.com>
In-Reply-To: <200804011748.m31HmE1h039800@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUIJASYk2JPFFoQSiKOw+tRD4dxgAAkzWQ
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
	<200804011748.m31HmE1h039800@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 18:06:36 -0000

If you've asked, I've missed the question.

We tend to size ram and embedded NAND the same. The latest numbers I can
discuss are several years old and were 64mb/64mb. Engineering *always*
wants more of each, but the BOM rules.
=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Tuesday, April 01, 2008 10:48 AM
> To: Martin Fouts
> Cc: freebsd-arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics
>=20
> :
> :> -----Original Message-----
> :> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=3D20
> :> Sent: Monday, March 31, 2008 3:20 PM
> :>=3D20
> :> For flash storage systems competitive with hard drive storage,=3D20
> :
> :In embedded systems, it's RAM that flash storage competes=20
> with, not hard
> :
> :drive storage.
> :
> :SSD is a completely different engineering problem.
>=20
>     You know, I think I've asked this already and you don't=20
> have to answer
>     it if you don't want to, but exactly how large a flash=20
> device are you
>     working with in your embedded project(s)?
>=20
> 					-Matt
>=20
>=20

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 18:07:58 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B7E111065671
	for <arch@freebsd.org>; Tue,  1 Apr 2008 18:07:58 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 994DC8FC1A
	for <arch@freebsd.org>; Tue,  1 Apr 2008 18:07:58 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31I7iFF039980;
	Tue, 1 Apr 2008 11:07:44 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31I7g8I039974;
	Tue, 1 Apr 2008 11:07:42 -0700 (PDT)
Date: Tue, 1 Apr 2008 11:07:42 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804011807.m31I7g8I039974@apollo.backplane.com>
To: Bakul Shah <bakul@bitblocks.com>
References: <20080401011306.2A4875B41@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	Alfred Perlstein <alfred@freebsd.org>, qpadla@gmail.com,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 18:07:58 -0000

:My instinct is to not combine transactions.  That is, every
:data write results in a sequence: {data, [indirect blocks],
:inode, ..., root block}.  Until the root block is written to
:the disk this is not a "commited" transaction and can be
:thrown away.  In a Log-FS we always append on write; we never
:overwrite any data/metadata so this is easy and the FS state
:remains consistent.  FFS overwrites blocks so all this gets
:far more complicated.  Sort of like the difference between
:reasoning about functional programs & imperative programs!
:
:Now, it may be possible to define certain rules that allows
:one to combine transactions.  For instance,
:
:    write1(block n), write2(block n) == write2(block n)
:    write(block n of file f1), delete file f1 == delete file f1
:
:etc. That is, as long as write1 & associated metadata writes
:are not flushed to the disk, and a later write (write2) comes
:along, the earlier write (write1) can be thrown away.  [But I
:have no idea if this is worth doing or even doable!]

    This is a somewhat different problem, one that is actually fairly
    easy to solve in larger systems because operating systems tend to
    want to cache everything.  So really what is going on is that
    your operations (until you fsync()) are being cached in system memory
    and are not immediately committed to the underlying storage.  Because
    of that, overwrites and deletions can simply destroy the related
    cache entities in system memory and never touch the disk.

    Ultimately you have to flush something to disk, and that is where
    the transactional atomicy and ordering issues start popping up.

:This is reminiscent of the bottom up rewrite system (BURS)
:used in some code generators (such as lcc's). The idea is the
:same here: replace a sequence of operations with an
:equivalent but lower cost sequence.

    What it comes down to is how expensive do you want your fsync()
    to be?  You can always commit everything down to the root block
    and your recovery code can always throw everything away until it
    finds a good root block, and avoid the whole issue, but if you do
    things that way then fsync() becomes an extremely expensive call to
    make.  Certain applications, primarily database applications, really
    depend on having an efficient fsync().

    Brute force is always simpler, but not necessarily always desireable.

:...
:>     is in the algorith, but not the (fairly small) amount of time it takes
:>     to actually perform the recovery operation.
:
:I don't understand the complexity. Basically your log should
:allow you to define a functional programming abstraction --
:where you never overwrite any data/metadata for any active
:transactions and so reasoning becomes easier. [But may be we
:should take any hammer discussion offline]

     The complexity is there because a filesystem is actually a multi-layer
     entity.  One has a storage topology which must be indexed in some manner,
     but one also has the implementation on top of that storage topology
     which has its own consistency requirements.

     For example, UFS stores inodes in specific places and has bitmaps for
     allocating data blocks and blockmaps to access data from its inodes.

     But UFS also has to maintain the link count for a file, the st_size
     field in the inode, the directory entry in the directory, and so forth.
     Certain operations require multiple filesystem entities to be adjusted
     as one atomic operation.  For example removing a file requires the
     link count in the inode to be decremented and the entry in the directory
     to be removed.

     Undo logs are very good at describing the low level entity, allowing you
     to undo changes in time order, but undo logs need additional logic to
     recognize groups of transactions which must be recovered or thrown away
     as a single atomic entity, or which depend on each other.

     One reason why it's a big issue is that portions of those transactions
     can be committed to disk out of order.  The recovery code has to
     recognize that dependant pieces are not present even if other 
     micro-transactions have been fully committed.

     Taking UFS as an example:  UFS's fsck can clean up link counts and
     directory entries, but has no concept of lost file data so you can
     wind up with an inode specifying a 100K file which after recovery
     is actually full of zero's (or garbage) instead of the 100K of data
     that was written to it.  That is an example of a low level recovery
     operation that is unable to take into account high level dependancies.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 20:10:24 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 706F91065670
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 20:10:24 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 4F6728FC3E
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 20:10:24 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31KAMJV041012;
	Tue, 1 Apr 2008 13:10:22 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31KAMpu041011;
	Tue, 1 Apr 2008 13:10:22 -0700 (PDT)
Date: Tue, 1 Apr 2008 13:10:22 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804012010.m31KAMpu041011@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
	<200804011748.m31HmE1h039800@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D19@EXCHANGE.danger.com>
Cc: freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 20:10:24 -0000


:If you've asked, I've missed the question.
:
:We tend to size ram and embedded NAND the same. The latest numbers I can
:discuss are several years old and were 64mb/64mb. Engineering *always*
:wants more of each, but the BOM rules.
:=20

    64MB is tiny.  None of the problems with any of the approachs we've
    discussed even exist with devices that small in an embedded system.
    You barely need to even implement a filesystem topology, let alone
    anything sophisticated.

    To be clear, because I really don't understand how you can possibly
    argue that the named-block storage layer is bad in a device that small... 
    the only sophistication a named-block storage model needs is when it must
    create a forward lookup on-flash to complement the reverse lookup you
    get from the auxillary storage.  Given that you can trivially cache
    many translations in memory, not to mention do a boot-up scan of a flash
    that small, the only performance impact would be writing out a portion
    of the forward translation topology every N (N > 1000) or so page writes 
    (however many translations can be conveniently cached in system memory).

    In a small flash device the embedded application will determine whether
    you even need such a table... frankly, unless it's a general purpose 
    computing device like an iPhone you probably wouldn't need an on-flash
    forward lookup, you could simply size the blocks to guarantee 99% flash
    utilization verses the number of files you expect to have to maintain
    (so, for example, the named block size could be 512K if you expected
    to have to maintain 100 files on a 64MB device).  This doesn't mean the
    filesystem would have to use a 512K block size, that would only be the
    case if the filesystem were flash-unaware.

    It's seriously a non-issue.  You are making too many assumptions about
    how named blocks would be used, particularly if the filesystem is
    flash-aware.  Named blocks do not have to be 'sectors' or 'filesystem
    blocks' or anything of the sort.  They can be much larger.. they can
    easily be multiples of a flash page though you don't want to make them
    too large because a failed page also fails the whole named-block covering
    that page.  They can be erase units (probably the best fit).

    This leaves the filesystem layer (and here we are talking about a
    flash-aware filesystem), with a hellofalot more implementation
    flexiblity.

			FORWARD LOOKUP ON-FLASH TOPOLOGY

    There are many choices available for the forward lookup topology,
    assuming you need one.  Here again we are describing the need to have
    one (or at least one that would be considered sophisticated) only for
    larger flash devices -- really true solid state storage.

    We aren't talking about having to write out tiny little updates to
    B-Tree elements... that's stupid.  Only an idiot would do that.

    Because you can cache new lookups in system memory and because you do
    NOT have to flush the forward lookup topology to flash for crash
    recovery purposes, the sole limiting factor for the efficiency of the
    forward lookup flush to flash is the amount of system memory you are
    willing to set aside to cache new translations.  Since translations are
    fairly small structures we are probably talking not dozens, not hundreds,
    but probably at least a thousand translations before any sort of flush
    would be needed.

    Lets be clear here.  That's ONE page write every THOUSAND page writes
    worth of overhead.  There are no write performance issues.

    The actual on-flash topology for the forward lookup?  With such a large
    rollup cache available it almost doesn't matter, but lets say we wanted
    to limit forward lookups to 3 levels.

    Lets take a 2G flash device with 8K pages, just to be nice about it.
    That's 262144 named blocks.  Not a very large number, eh?  Why you 
    could almost fit that in system memory (and maybe you can!) and obviate
    the need for an on-flash forward lookup topology at all.

    But lets say you COULDN'T fit that in system memory.  Hmm. 3 levels,
    262144 entries maximum (less in real life).  That would be a 3-layer
    radix tree with a radix of 64.  The top layer would almost certainly
    be cacheable in system memory (and maybe even two layers) so we are
    talking one or two page reads from flash to do the lookup and the
    update mechanic, being a radix tree, would be to sync the bits of the
    radix tree that were modified by the new translations all the way up
    to the root.

    Clearly given the number of 'dirty' translations that would need to be
    synchronized, you could easily fill a flash page and then simply retire
    the synced translations from system memory, and repeat as often as
    necessary to maintain the dirty ratio in the cache in system memory
    at appropriate levels.  You can also clearly accumulate enough dirty
    translations for the sync to be worthwhile... that is, be guaranteed
    to fill a whole page.  You do NOT have to sync for recovery purposes
    so it becomes an issue that is solely related to the system cache
    and nothing else.

    I'll add something else with regards to radix trees using large radii...
    you can usually cache just about the whole damn thing except the leaf
    level in system memory.  Think about that for a moment and in particular
    think about how it greatly reduces the number of actual flash reads
    needed to perform the lookup.

    I'll add something else with regards to on-storage radix trees.  You can
    also adjust the layout so the BOTTOM few levels of the radix tree,
    relative to some leaf, reside in the same page.

    So now we've reduced a random uncached translation lookup to, at worse,
    ONE flash page read operation that ALSO guarantees us locality of
    reference for nearby file blocks (and hence has no performance issues
    for streaming reads either).

    --

    Now, if you want to argue that this model would have serious performance
    penalities please go ahead, I'm all ears.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 20:15:12 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5108F106564A;
	Tue,  1 Apr 2008 20:15:12 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 29A028FC3D;
	Tue,  1 Apr 2008 20:15:12 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31KEv0e041050;
	Tue, 1 Apr 2008 13:14:58 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31KEvTJ041049;
	Tue, 1 Apr 2008 13:14:57 -0700 (PDT)
Date: Tue, 1 Apr 2008 13:14:57 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804012014.m31KEvTJ041049@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 20:15:12 -0000


:You've done the same thing in this email that you did in your earlier
:comparison. You've found a trivial subset of the problem and then make
:the generalization that solving that subset shows that the solution to
:the problem is trivial.

    You know as well as I do that embedded projects are ALWAYS a trivial
    subset of something.  Until you get to the level of sophistication of
    an iPhone.  You only need to solve the subset of the problem that
    the embedded project covers.

    Most general problem sets become trivialized when used in degenerate
    environments.  This is not a description of a trivialized solution to the
    problem set being generalized up, it is a description of the generalized
    solution to the problem set being applied to a degenerate application
    which trivializes many aspects of the general solution.

    My interest is in large scale systems, OF COURSE I'm approaching the
    problem from the point of view of large scale systems and not small
    scale systems.  Don't be silly.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 20:15:12 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5108F106564A;
	Tue,  1 Apr 2008 20:15:12 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 29A028FC3D;
	Tue,  1 Apr 2008 20:15:12 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31KEv0e041050;
	Tue, 1 Apr 2008 13:14:58 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31KEvTJ041049;
	Tue, 1 Apr 2008 13:14:57 -0700 (PDT)
Date: Tue, 1 Apr 2008 13:14:57 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804012014.m31KEvTJ041049@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 20:15:12 -0000


:You've done the same thing in this email that you did in your earlier
:comparison. You've found a trivial subset of the problem and then make
:the generalization that solving that subset shows that the solution to
:the problem is trivial.

    You know as well as I do that embedded projects are ALWAYS a trivial
    subset of something.  Until you get to the level of sophistication of
    an iPhone.  You only need to solve the subset of the problem that
    the embedded project covers.

    Most general problem sets become trivialized when used in degenerate
    environments.  This is not a description of a trivialized solution to the
    problem set being generalized up, it is a description of the generalized
    solution to the problem set being applied to a degenerate application
    which trivializes many aspects of the general solution.

    My interest is in large scale systems, OF COURSE I'm approaching the
    problem from the point of view of large scale systems and not small
    scale systems.  Don't be silly.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 20:20:20 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F1E1F1065685
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 20:20:20 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id D6F3F8FC27
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 20:20:20 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 5A5E8402FDF;
	Tue,  1 Apr 2008 13:20:11 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 13:20:19 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D1D@EXCHANGE.danger.com>
In-Reply-To: <200804012010.m31KAMpu041011@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUNGuMC9ZvsF/0TUeT8Q90DJXWjAAAFO/g
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
	<200804011748.m31HmE1h039800@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D19@EXCHANGE.danger.com>
	<200804012010.m31KAMpu041011@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 20:20:21 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Tuesday, April 01, 2008 1:10 PM
> To: Martin Fouts
> Cc: freebsd-arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics
>=20
>     64MB is tiny.  None of the problems with any of the=20
> approachs we've discussed even exist with devices that small in an=20
> embedded system.

It is fairly clear that you're not familiar with NAND devices on
embedded systems, as you've just said that well known problems do not
exist.

> To be clear, because I really don't understand how you=20
> can possibly argue that the named-block storage layer is bad in a=20
> device that small...

Yes, your lack of understanding is very apparent.

> It's seriously a non-issue.  You are making too many=20
> assumptions about how named blocks would be used, particularly
> if the filesystem is flash-aware.

Now you're moving your goal posts. You came into this suggesting that
the file system not be flash-aware. If I make the file system flash
aware than many of the problems become managable.  That *was* my
starting thesis, after all.

> Now, if you want to argue that this model would have=20
> serious performance penalities please go ahead,
> I'm all ears.

Feel free to implement it and see for yourself.

The only point I had wished to make is that you get performance wins out
of making the file system flash aware. Now that you've agreed to that,
feel free to experiment with any of a number of ways of making it flash
aware.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 20:32:42 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 868EE106564A;
	Tue,  1 Apr 2008 20:32:42 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 688538FC15;
	Tue,  1 Apr 2008 20:32:42 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 282AB403D0A;
	Tue,  1 Apr 2008 13:32:33 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 13:32:41 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
In-Reply-To: <200804012014.m31KEvTJ041049@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUNRcYybraZFx2R9SkG1318plQVgAAL2Xg
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 20:32:42 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Tuesday, April 01, 2008 1:15 PM
> To: Martin Fouts
> Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20
> Arnold; arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics

> You know as well as I do that embedded projects are=20
> ALWAYS a trivial subset of something.

No, I don't know that. It is hard to "know" something that is not true.

> Until you get to the level of  sophistication of
> an iPhone.

Although Apple is getting much hype about the sophistication of the
iPhone, we've been shipping convergent devices of that complexity for
some time now. Apple have better industrial design, but they're not
doing anything, other than the touch screen, that we haven't already
done.

You are now *starting* to understand the level of complexity of CE
embedded devices.

> My interest is in large scale systems, OF COURSE I'm=20
> approaching the problem from the point of view
> of large scale systems and not small
> scale systems.  Don't be silly.

Actually, Matt, it's you, by trying to solve a complex embedded systems
problem as if it were a 'degenerate' large scale systems problem, who
are "being silly."  You keep handing me crowbars when I need a scapel.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 20:32:42 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 868EE106564A;
	Tue,  1 Apr 2008 20:32:42 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 688538FC15;
	Tue,  1 Apr 2008 20:32:42 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 282AB403D0A;
	Tue,  1 Apr 2008 13:32:33 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 13:32:41 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
In-Reply-To: <200804012014.m31KEvTJ041049@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUNRcYybraZFx2R9SkG1318plQVgAAL2Xg
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 20:32:42 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Tuesday, April 01, 2008 1:15 PM
> To: Martin Fouts
> Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20
> Arnold; arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics

> You know as well as I do that embedded projects are=20
> ALWAYS a trivial subset of something.

No, I don't know that. It is hard to "know" something that is not true.

> Until you get to the level of  sophistication of
> an iPhone.

Although Apple is getting much hype about the sophistication of the
iPhone, we've been shipping convergent devices of that complexity for
some time now. Apple have better industrial design, but they're not
doing anything, other than the touch screen, that we haven't already
done.

You are now *starting* to understand the level of complexity of CE
embedded devices.

> My interest is in large scale systems, OF COURSE I'm=20
> approaching the problem from the point of view
> of large scale systems and not small
> scale systems.  Don't be silly.

Actually, Matt, it's you, by trying to solve a complex embedded systems
problem as if it were a 'degenerate' large scale systems problem, who
are "being silly."  You keep handing me crowbars when I need a scapel.

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 22:11:52 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4BAFB106567B
	for <arch@freebsd.org>; Tue,  1 Apr 2008 22:11:52 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 08C908FC23
	for <arch@freebsd.org>; Tue,  1 Apr 2008 22:11:51 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m31M94wI093631;
	Tue, 1 Apr 2008 16:09:04 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Tue, 01 Apr 2008 16:09:52 -0600 (MDT)
Message-Id: <20080401.160952.1678772361.imp@bsdimp.com>
To: jroberson@chesapeake.net
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <20080326230322.H72156@desktop>
References: <20080327.013229.1649766744.imp@bsdimp.com>
	<20080326230322.H72156@desktop>
X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org
Subject: Re: AsiaBSDCon DEVSUMMIT patch
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 22:11:52 -0000

In message: <20080326230322.H72156@desktop>
            Jeff Roberson <jroberson@chesapeake.net> writes:
: 
: On Thu, 27 Mar 2008, M. Warner Losh wrote:
: 
: > Greetings,
: >
: > We've been talking about the situation with suspend/resume in the
: > tree.  Here's a quick hack to allow one to suspend/resume an
: > individual device.  This may or may not work too well, but it is
: > offered up for testing and criticism.
: >
: > 	http://people.freebsd.org/~imp/devctl.diff
: >
: > devctl -s ath 0		suspend ath0
: > devctl -r ath 0		resume ath0
: 
: Hey Warner,
: 
: This is a great idea.  Would it be possible to provide a little more 
: background about what the expected failure/success modes are?  If we had 
: some easy to follow steps we could ask for testers on current@ and create 
: a wiki with a list of known working/broken hardware.  That'd be a great 
: step towards having widespread suspend/resume support.

There's two areas of testing/use here.

The first is to run it like so:

devctl -s ath 0 && sleep 10 && devctl -r ath 0

Eg, suspend and resume an individual device, or even tree of devices.
At least one bug has been found with this technique (it is actually a
rediscovery of an older bug, but I digress).  You'd want the kernel to
not panic, and you'd want things to be good after as before.

One can also use it to test to make sure that a device remains sane
after a long time suspended as well.  This can have power savings
potential too, but that's a secondary effect at this time.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 22:26:06 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 490FA106566C
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 22:26:06 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 279328FC21
	for <freebsd-arch@freebsd.org>; Tue,  1 Apr 2008 22:26:06 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31MQ4Di042174;
	Tue, 1 Apr 2008 15:26:04 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31MQ42O042173;
	Tue, 1 Apr 2008 15:26:04 -0700 (PDT)
Date: Tue, 1 Apr 2008 15:26:04 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804012226.m31MQ42O042173@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
	<200804011748.m31HmE1h039800@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D19@EXCHANGE.danger.com>
	<200804012010.m31KAMpu041011@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1D@EXCHANGE.danger.com>
Cc: freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 22:26:06 -0000


:>     64MB is tiny.  None of the problems with any of the=20
:> approachs we've discussed even exist with devices that small in an=20
:> embedded system.
:
:It is fairly clear that you're not familiar with NAND devices on
:embedded systems, as you've just said that well known problems do not
:exist.
:
:> To be clear, because I really don't understand how you=20
:> can possibly argue that the named-block storage layer is bad in a=20
:> device that small...
:
:Yes, your lack of understanding is very apparent.

    What complete bullshit.  If you want to argue technical merits, be
    my guest.  So far you haven't made one single technical point in
    any of your postings.  You've posted about your experience with NAND
    flash in embedded systems, very clearly with SMALL flash devices and
    simple filesystems, and that's fine, it's similar to my flash filesystem
    experience (which, yes, was primarily on NOR devices but, no, that
    doesn't magically make you an expert on NAND and me an idiot about it). 
    Considering I've pretty much spent my entire life working with hardware
    that is about as ridiculous an assertion as you could make, but clearly
    you believe it.

    But then you generalized to the entire market and that's not fine.

    Real filesystems are far more sophisticated then what you will ever see
    in the embedded flash product, and consequently real filesystems tend
    to broken down into more abstract terms so the higher layers can actually
    implement the filesystem functions without it taking 10 man years of
    programming.  My interest is squarely with real filesystems targetted
    to mass storage, these days.

    I didn't start out smearing people, but if you are going to start
    acting like an asshole then I have no problem ratcheting it up
    to your level.

:> It's seriously a non-issue.  You are making too many=20
:> assumptions about how named blocks would be used, particularly
:> if the filesystem is flash-aware.
:
:Now you're moving your goal posts. You came into this suggesting that
:the file system not be flash-aware. If I make the file system flash
:aware than many of the problems become managable.  That *was* my
:starting thesis, after all.

    More bullshit.  My first posting was not addressing performance issues,
    it was specifically addressing FFS and ZFS and the (bad) idea of making
    them more flash aware.  ZFS on a 2G flash device?  What the hell would
    be the point of that?  We're talking about two completely different
    things.

    It used a NOR flash translation table by way of example.  I sure as hell
    would never say that a flash-unaware filesystem would perform better
    then a flash-aware one.  Duh!

    You have seriously misread the meaning behind that posting and you
    clearly didn't read any of the other postings.  I suggest you go back
    and READ THE POSTINGS and maybe you'll start to understand the issues
    being addressed.

    Since you don't understand my position, let me lay it out for you
    in simple terms:

    * There's no point trying to adapt a flash-unaware filesystem to
      become flash-aware.  It is a complete waste of time.  You might as
      well write a new filesystem.  If you want to use a flash-unaware
      filesystem you use a translation layer, eat any performance issues,
      and be done with it.  MAYBE spend a few days optimizing the one
      or two critical paths you want to eek a little more performance out
      of.

      This has nothing to do with having to use translation tables and
      everything to do with the fact that the existance and use of those
      REQUIRED translation tables are not integrated into the flash-unaware
      filesystem, so inefficiencies are compounded rather then reduced.
      It's like jamming a square peg into a round hole.

    * Just because flash-unaware filesystems HAVE To use a translation layer
      doesn't mean that a translation layer is bad for a flash-aware
      filesystem.

    * A named-block translation layer can be an extremely valuable abstraction
      for use in filesystem designs which directly integrate its features
      (that is, the filesystem NAMES the block instead of ALLOCATES the
      block).

      There is absolutely NOTHING inherently bad about the model from a 
      performance point of view, particularly if your storage media requires
      relocation (as NAND does).  The key point is that a named-block layer
      takes over the functionality of all the indirect pointers that would
      normally have to be manipulated by higher layers in the filesystem.
      If you can integrate that into the physical storage requirements then
      you kill two birds with one stone and get major performance benefits
      from doing so.

    You are welcome to debate the points, but you'll get burned if you try
    to take some sort of moral highground stand based on a few piddly flash
    filesystems written over the course of a few years.  Coding at that
    level is fun and interesting but ultimately not very difficult.


:Feel free to implement it and see for yourself.
:
:The only point I had wished to make is that you get performance wins out
:of making the file system flash aware. Now that you've agreed to that,
:feel free to experiment with any of a number of ways of making it flash
:aware.

     Right now my work is with HAMMER.  It's fun to theorize how I could
     make HAMMER into a flash-aware filesystem but I have no intention of
     actually doing so any time soon, or ever.

     Frankly, if I wanted to write a ground-up flash filesystem I could,
     it would not be difficult... certainly not more difficult then HAMMER
     and HAMMER is probably the most sophisticated filesystem that exists
     in the open source world today.  But I have no desire to do that at
     this juncture and the lack of desire certainly does not invalidate my
     comments on the matter.

     It's kinda like saying a person has no right to comment about how to
     cut an european apples if their focus in life is cutting american ones.
     NAND is different from NOR but the differences can be explained pretty
     much in two paragraphs and most of the same concepts apply.  You can't
     byte-write, you have auxillary information, you need to add a little ECC,
     and scrub.  It isn't rocket science.

     I am a very technical person.  If you are going to argue merit, then
     you damn well better say WHY something doesn't work, in detail,
     instead of simply stating that someone random other entity couldn't
     make it work some point in the past so therefor it is bad.  If you do
     not know the WHY, precisely, then good $#%$#%$#% luck designing
     anything that's actually sophisticated.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 23:26:11 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 335B8106566B;
	Tue,  1 Apr 2008 23:26:11 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id E53E38FC1C;
	Tue,  1 Apr 2008 23:26:10 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31NPxYl042552;
	Tue, 1 Apr 2008 16:25:59 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31NPwM1042551;
	Tue, 1 Apr 2008 16:25:58 -0700 (PDT)
Date: Tue, 1 Apr 2008 16:25:58 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804012325.m31NPwM1042551@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 23:26:11 -0000

:Although Apple is getting much hype about the sophistication of the
:iPhone, we've been shipping convergent devices of that complexity for
:some time now. Apple have better industrial design, but they're not
:doing anything, other than the touch screen, that we haven't already
:done.
:
:You are now *starting* to understand the level of complexity of CE
:embedded devices.

    How condescending you are.  Just remember, you started this frackas.
    I can't believe it, you actually think you know more about embedded 
    design then I do!  What a laugh.

    I don't know a thing about you, and you clearly don't know a thing about
    me.  Here's a hint:  When you don't know you shouldn't assume.

:Actually, Matt, it's you, by trying to solve a complex embedded systems
:problem as if it were a 'degenerate' large scale systems problem, who
:are "being silly."  You keep handing me crowbars when I need a scapel.

    Oooh. complex.... biiig word.  What bullshit.  You think these problems
    are complex?  Embedded systems these days are nearly complete
    single-chip microcomputers running hacked up but nearly complete
    operating systems containing 95% off-the-shelf software, much of it
    open source, and much of it provided to the developer on a shiny platter,
    with a fully operational SDK and HDK and FPGA logic around the core cpu.
    All in one chip.  These days 'embedded' means you are sporting a
    completely functional linux operating system in a two chip solution
    with virtually no external parts required beyond those needed for the
    connectors.  And it's all now written in C or C++ or whatever the hell
    language you want to write it in.

    It's crazy easy to do embedded development work these days.  No more
    difficult then writing software on a full blown PC.

    I'm sorry, but if that is your idea of complex then its roughly
    equivalent to my idea of ridiculously easy.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 23:26:11 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 335B8106566B;
	Tue,  1 Apr 2008 23:26:11 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id E53E38FC1C;
	Tue,  1 Apr 2008 23:26:10 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31NPxYl042552;
	Tue, 1 Apr 2008 16:25:59 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31NPwM1042551;
	Tue, 1 Apr 2008 16:25:58 -0700 (PDT)
Date: Tue, 1 Apr 2008 16:25:58 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804012325.m31NPwM1042551@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 23:26:11 -0000

:Although Apple is getting much hype about the sophistication of the
:iPhone, we've been shipping convergent devices of that complexity for
:some time now. Apple have better industrial design, but they're not
:doing anything, other than the touch screen, that we haven't already
:done.
:
:You are now *starting* to understand the level of complexity of CE
:embedded devices.

    How condescending you are.  Just remember, you started this frackas.
    I can't believe it, you actually think you know more about embedded 
    design then I do!  What a laugh.

    I don't know a thing about you, and you clearly don't know a thing about
    me.  Here's a hint:  When you don't know you shouldn't assume.

:Actually, Matt, it's you, by trying to solve a complex embedded systems
:problem as if it were a 'degenerate' large scale systems problem, who
:are "being silly."  You keep handing me crowbars when I need a scapel.

    Oooh. complex.... biiig word.  What bullshit.  You think these problems
    are complex?  Embedded systems these days are nearly complete
    single-chip microcomputers running hacked up but nearly complete
    operating systems containing 95% off-the-shelf software, much of it
    open source, and much of it provided to the developer on a shiny platter,
    with a fully operational SDK and HDK and FPGA logic around the core cpu.
    All in one chip.  These days 'embedded' means you are sporting a
    completely functional linux operating system in a two chip solution
    with virtually no external parts required beyond those needed for the
    connectors.  And it's all now written in C or C++ or whatever the hell
    language you want to write it in.

    It's crazy easy to do embedded development work these days.  No more
    difficult then writing software on a full blown PC.

    I'm sorry, but if that is your idea of complex then its roughly
    equivalent to my idea of ridiculously easy.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 00:04:03 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DC0101065672
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 00:04:03 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id BFEEA8FC1F
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 00:04:03 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 41B01404C87;
	Tue,  1 Apr 2008 17:03:53 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 17:04:01 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D21@EXCHANGE.danger.com>
In-Reply-To: <200804012226.m31MQ42O042173@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUR2Bdnxq/kqqoSkSjUVdQHavnoAAAaoGw
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
	<200804011748.m31HmE1h039800@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D19@EXCHANGE.danger.com>
	<200804012010.m31KAMpu041011@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1D@EXCHANGE.danger.com>
	<200804012226.m31MQ42O042173@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 00:04:04 -0000

=20

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Tuesday, April 01, 2008 3:26 PM
> To: Martin Fouts
> Cc: freebsd-arch@freebsd.org
> Subject: RE: Flash disks and FFS layout heuristics
>=20
>=20
> What complete bullshit.  If you want to argue technical=20
> merits, be my guest.  So far you haven't made one single
> technical point in any of your postings.

My, my. Mr Dillon likes to be rude to people and tell them they are =
'stupid' and 'silly', but when he makes na=EFve comments about systems =
he doesn't understand and gets called on it, suddenly it's "complete =
bullshit."

I can see why PHK broke off trying to educate you.

> You've posted about your experience with NAND flash in embedded =
systems,
> very clearly with SMALL flash devices

Acutally, you're jumping to conclusions, again, Matt.  I mentioned what =
size devices we used 3 years ago. I haven't spoken at all about the size =
of devices I have experience with.

> and simple filesystems, and that's fine, it's similar to my=20
> flash filesystem

Actually, I mentioned some file systems which are not 'simple' by any =
reasonable metric.  You're the one who keeps trying to impose 'simple'.

>     experience (which, yes, was primarily on NOR devices but, no, that
>     doesn't magically make you an expert on NAND and me an=20
> idiot about it).=20

You're not an 'idiot' about NAND, your knowledge is merely limited to =
reading specs, and as a consequence you're extrapolating beyond that =
knowledge when you try to apply your theory to NAND, and experience has =
shown that your extrapolations don't hold up.


> Considering I've pretty much spent my entire life working with =
hardware
> that is about as ridiculous an assertion as you could make, but =
clearly
> you believe it.

You need to stop being defensive in technical discussions; stop imposing =
your presumptions on other peoples problems; and stop thinking that =
anyone cares enough about you to make any assertions about your =
background.

I have *not* made any assertions, other than that you've made comments =
about NAND which betray your lack of experience with it.  You don't have =
the experience, and your comments about 'trivial' problems, and =
'nonexistant' problems clearly shows that.  You don't need to take my =
word for it. You merely have to check on the state of the art in NAND =
file systems for CE products.

Oh, you should also stop putting words in my mouth.  You're wrong again. =
 I've never thought you were an idiot and I don't think so now.  You're =
rude, arrogant, judgmental, and sure of your of your own skills beyond =
your actual ability, but you're no idiot.

>     But then you generalized to the entire market and that's not fine.

I've not made any such generalizations, Matt. You're projecting again.  =
*You* are the one who made the generation that all embedded problems =
were trivial.

The only thing I speak with authority about in this discussion is =
convergent CE devices, and then I speak only of the ones I've worked on =
and what experience with them has been.

>=20
> Real filesystems are far more sophisticated then what you=20
> will ever see in the embedded flash product,

Now there's a hasty generalization that betrays your attitude problem.  =
"real" file systems?  First NAND filesystems are 'trivial'.  Then it's =
'degenerate'. Now it's not 'real.'

You'll never understand a problem that you dismiss without =
investigating.

> My interest is squarely with real filesystems targetted
> to mass storage, these days.

Yes. I pointed that out. Also pointed out that as a consequence you're =
trying to apply approaches that don't work in CE devices.

>=20
> I didn't start out smearing people, but if you are going to start
> acting like an asshole then I have no problem ratcheting it up
> to your level.

Dillon, after all these years, I would have thought you'd gotten past =
that blind spot.  You don't call people 'silly' and 'stupid' and the =
work they're doing "trivial" and "degenerate" *unless* you're acting =
like an asshole. As long as I've known you, you've liked starting =
pissing contests and then blaming the other party.  PHK was wise to have =
begged off when you started down that path, but I had some time on my =
hands and thought others would benefit from a technical discussion. If =
you want a pissing match, I suggest alt.flames, where I'm sure they'll =
happily accommodate you.

>     Since you don't understand my position, let me lay it out for you
>     in simple terms:
>=20
>     * There's no point trying to adapt a flash-unaware filesystem to
>       become flash-aware.  It is a complete waste of time. =20

"waste of time" is a value judgment that you don't have the background =
to make for anyone but yourself. The marketplace, which supports at =
least two such filesystems, disagrees with your judgment.

> You might as well write a new filesystem.
> If you want to use a flash-unaware
> filesystem you use a translation layer, eat any=20
> performance issues, and be done with it.

Congratulations. Welcome to FATFS on usb sticks.

>     * Just because flash-unaware filesystems HAVE To use a=20
> translation layer
>       doesn't mean that a translation layer is bad for a flash-aware
>       filesystem.

That is correct.  The FTL approach is suitable for certain types of =
flash file systems, as I pointed out some number of emails back.  It is =
not suitable for all.


>     * A named-block translation layer can be an extremely=20
> valuable abstraction
>       for use in filesystem designs which directly integrate=20
> its features
>       (that is, the filesystem NAMES the block instead of=20
> ALLOCATES the
>       block).

'can be' makes for a pretty weak precondition, so sure, it 'can be'.

>=20
>       There is absolutely NOTHING inherently bad about the=20
> model from a=20
>       performance point of view, particularly if your storage=20
> media requires
>       relocation (as NAND does).

Either 'relocation' doesn't mean what you think it means, or NAND =
doesn't require it.

>  The key point is that a=20
> named-block layer
>       takes over the functionality of all the indirect=20
> pointers that would
>       normally have to be manipulated by higher layers in the=20
> filesystem.

Yes. This is what the FTL people do, except the granularity of their =
named-block is the write unit. It has performance issues.


>       If you can integrate that into the physical storage=20
> requirements then
>       you kill two birds with one stone and get major=20
> performance benefits
>       from doing so.
>=20

That's a big if. It has in practice turned out to be unattainable.  I =
await your demonstration to the contrary.


>     You are welcome to debate the points, but you'll get=20
> burned if you try
>     to take some sort of moral highground stand based on a=20
> few piddly flash
>     filesystems written over the course of a few years. =20
> Coding at that
>     level is fun and interesting but ultimately not very difficult.

'burned'. 'moral highground'. 'piddly'. 'not very difficult'.  That's a =
hell of a blindspot to your own behavior that you've got there, Matt.

> Right now my work is with HAMMER.  It's fun to theorize=20
> how I could make HAMMER into a flash-aware filesystem but=20
> I have no intention of actually doing so any time soon, or ever.
>=20

I didn't think so.

> Frankly, if I wanted to write a ground-up flash=20
> filesystem I could, it would not be difficult...=20

Of course not. People write file systems in undergraduate OS classes.

> But I have no desire to do that at
> this juncture and the lack of desire certainly does not=20
> invalidate my comments on the matter.

What 'invalidates' your comments, is that others have tried what you've =
outlined, in the way that you've outlined it, and it has failed.  That, =
coupled with your na=EFve claims about embedded systems not being =
complex and your several mistaken claims about where the problems are or =
aren't in such systems simply highlights that, as you say, you're having =
fun speculating, but, as I say, your speculation would take you down =
trodden paths to well known conclusions.

> NAND is different from NOR but the differences can be=20
> explained pretty much in two paragraphs and most of the same concepts=20
> apply.

The interesting aspects lie in the differences.

> It isn't rocket science.

I've done rocket science for a living. It's not that hard, and I've =
always found that statement silly.

>=20
> I am a very technical person.  If you are going to argue=20
> merit, then you damn well better say WHY something doesn't
> work, in detail, instead of simply stating that someone
> random other entity couldn't make it work some point in
> the past so therefor it is bad.

You're a 'very technical person' with a very judgmental attitude and a =
tendency to use emotionally loaded language that you later disclaim.  =
But no, I don't have to say WHY it doesn't work in detail, provided =
someone else has already said so. I merely have to point out the =
existence of the refutation.


> If you do not know the WHY, precisely, then good $#%$#%$#%
> luck designing anything that's actually sophisticated.
>=20

"sophisticated", which I suppose is a synonmy for "complex", is an =
interesting metric for a "very technical person" to apply.

But actually, it's pretty easy to design sophisticated systems when you =
don't understand the underlying issues. In practice it's more common to =
make systems more sophisticated in the face of uncertainty, not less.


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 00:36:41 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CF6161065678;
	Wed,  2 Apr 2008 00:36:41 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id B1D568FC1B;
	Wed,  2 Apr 2008 00:36:41 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 0BFF4402FE5;
	Tue,  1 Apr 2008 17:36:32 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 17:36:40 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
In-Reply-To: <200804012325.m31NPwM1042551@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUT8TcTf9Q5HzrRyu3uwNh8yZ7yAABWEug
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
	<200804012325.m31NPwM1042551@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 00:36:42 -0000

=20

> I can't believe it, you actually think you know more=20
> about embedded design then I do!  What a laugh.
>=20
> I don't know a thing about you, and you clearly don't=20
> know a thing about me.  Here's a hint:  When you don't
> know you shouldn't assume.

So what part of "you think you know" is *not* an assumption?

> You think these problems are complex?

Yes. I do it. That's what makes them fun.

> Embedded systems these days are nearly complete
> single-chip microcomputers running hacked up but nearly complete
> operating systems containing 95% off-the-shelf software,=20
> much of it open source, and much of it provided to the developer on=20
> a shiny platter, with a fully operational SDK and HDK and FPGA logic=20
> around the core cpu.

It amazes me that you can assert to be so knowledgeable about embedded
systems and then make such a glaringly wrong description of the ones I
work on. Our current shipping product has *no* off-the-shelf software,
beyond a few small libraries for image encoding, out of several million
lines of code.  There's no 'fully operational SDK', beyond a gcc
crosscompiler that we've debugged ourselves. The SOC has no FPGA.

> All in one chip.  These days 'embedded' means you are sporting a
> completely functional linux operating system in a two=20
> chip solution

It's not a single chip or even two chips. It doesn't run linux.  Keep
guessing wrong, Matt.

> with virtually no external parts required beyond those=20
> needed for the connectors.

There are a lot more parts than connectors in the BOM.  Wrong again.

> And it's all now written in C or C++ or=20
> whatever the hell language you want to write it in.

Well, "whatever the hell language" gets you off on a technicality there,
Matt.

> It's crazy easy to do embedded development work these=20
> days.  No more difficult then writing software on a full blown PC.

There is a class of such development. Pity it's not the class I'm
working in.

>     I'm sorry, but if that is your idea of complex then its roughly
>     equivalent to my idea of ridiculously easy.

No, Matt, it's not my idea of complex.

I see that you're more in need of your advice about not assuming than I
am.


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 00:36:41 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CF6161065678;
	Wed,  2 Apr 2008 00:36:41 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id B1D568FC1B;
	Wed,  2 Apr 2008 00:36:41 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 0BFF4402FE5;
	Tue,  1 Apr 2008 17:36:32 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 17:36:40 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
In-Reply-To: <200804012325.m31NPwM1042551@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUT8TcTf9Q5HzrRyu3uwNh8yZ7yAABWEug
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
	<200804012325.m31NPwM1042551@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 00:36:42 -0000

=20

> I can't believe it, you actually think you know more=20
> about embedded design then I do!  What a laugh.
>=20
> I don't know a thing about you, and you clearly don't=20
> know a thing about me.  Here's a hint:  When you don't
> know you shouldn't assume.

So what part of "you think you know" is *not* an assumption?

> You think these problems are complex?

Yes. I do it. That's what makes them fun.

> Embedded systems these days are nearly complete
> single-chip microcomputers running hacked up but nearly complete
> operating systems containing 95% off-the-shelf software,=20
> much of it open source, and much of it provided to the developer on=20
> a shiny platter, with a fully operational SDK and HDK and FPGA logic=20
> around the core cpu.

It amazes me that you can assert to be so knowledgeable about embedded
systems and then make such a glaringly wrong description of the ones I
work on. Our current shipping product has *no* off-the-shelf software,
beyond a few small libraries for image encoding, out of several million
lines of code.  There's no 'fully operational SDK', beyond a gcc
crosscompiler that we've debugged ourselves. The SOC has no FPGA.

> All in one chip.  These days 'embedded' means you are sporting a
> completely functional linux operating system in a two=20
> chip solution

It's not a single chip or even two chips. It doesn't run linux.  Keep
guessing wrong, Matt.

> with virtually no external parts required beyond those=20
> needed for the connectors.

There are a lot more parts than connectors in the BOM.  Wrong again.

> And it's all now written in C or C++ or=20
> whatever the hell language you want to write it in.

Well, "whatever the hell language" gets you off on a technicality there,
Matt.

> It's crazy easy to do embedded development work these=20
> days.  No more difficult then writing software on a full blown PC.

There is a class of such development. Pity it's not the class I'm
working in.

>     I'm sorry, but if that is your idea of complex then its roughly
>     equivalent to my idea of ridiculously easy.

No, Matt, it's not my idea of complex.

I see that you're more in need of your advice about not assuming than I
am.


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 00:47:58 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1543B106566B
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 00:47:58 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id E78628FC24
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 00:47:57 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m320luwe043381;
	Tue, 1 Apr 2008 17:47:56 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m320lun8043380;
	Tue, 1 Apr 2008 17:47:56 -0700 (PDT)
Date: Tue, 1 Apr 2008 17:47:56 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804020047.m320lun8043380@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
	<200804011748.m31HmE1h039800@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D19@EXCHANGE.danger.com>
	<200804012010.m31KAMpu041011@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1D@EXCHANGE.danger.com>
	<200804012226.m31MQ42O042173@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D21@EXCHANGE.danger.com>
Cc: freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 00:47:58 -0000


:My, my. Mr Dillon likes to be rude to people and tell them they are =
:'stupid' and 'silly', but when he makes na=EFve comments about systems =
:he doesn't understand and gets called on it, suddenly it's "complete =
:bullshit."
:
:I can see why PHK broke off trying to educate you.

    I really have no love for people who are so disrespectful to their
    peers.  A few of the people unfortunately associated with a project I
    had an interest in fit that category, some more then others. 
    Not too many, only two (well, three if you count yourself).  On the
    bright side my list is very limited.   

    I do not believe that you are any more qualified then you think I am.
    Clearly it is an issue for you and just as clearly you are unwilling
    to engage in any sort of technical conversation about the matter.  I
    really have no idea why.  If you decide you want to have a technical
    conversation, where you actually post meaningful information useful not
    only to me but to everyone reading this thread, instead of vague, broad,
    uninteresting references, then please go ahead and do so.  If you
    think those vague bits of information you post, condescending and
    secretive as if they were something so secret and special nobody needs
    to know the details... if you think those actually contribute to the
    conversation, then you are deluded.

    If it is important to you, then perhaps you should consider that the
    characteristics of NAND flash are only a small part of the equation.
    The characteristics are not this mystical scary beast that nobody
    understands, they are very well defined and fairly limited in scope,
    and thus can be discussed, theorized, implemented, and tested.  None
    of these processes are absolute.  Hell, filesystem design is just as
    important and I dare say that the only person on this list with more
    experience then I have on filesystem design is Kirk.

    I'm a technical theorist, a dreamer, and an implementer.  Theory
    always comes before function, always.  I don't know what your problem
    is and I really don't care, but it absolutely does not and never has
    required direct experience to have a technical conversation.  If that
    were true nobody would ever invent anything, try anything, or make
    any progress.

    So, yes, there is a great deal of value to having a technical
    conversation that mixes theory and actual direct experience.  Very
    few people have the breadth of direct experience required to be able
    to comment definitively on something.  Not a single person on this
    list, not myself, not you, not Poul... nobody has anywhere near the
    level of experience required to come to any sort of conclusion with
    regards to the material we are discussing.  All we can do is experiment,
    theorize, and have a technical conversation about the merits of one
    thing or another.

    So, again, if you have something to contribute to our technical
    conversation, perhaps some direct experience you've had trying to
    actually implement one of these 'failed' schemes???, I'm all ears.
    If not, then I recommend you stop posting.

						-Matt


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 01:03:31 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 08EED106564A;
	Wed,  2 Apr 2008 01:03:31 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id B98BA8FC21;
	Wed,  2 Apr 2008 01:03:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m3213Jhl043507;
	Tue, 1 Apr 2008 18:03:19 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m3213JEt043506;
	Tue, 1 Apr 2008 18:03:19 -0700 (PDT)
Date: Tue, 1 Apr 2008 18:03:19 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804020103.m3213JEt043506@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
	<200804012325.m31NPwM1042551@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 01:03:31 -0000


:> chip solution
:
:It's not a single chip or even two chips. It doesn't run linux.  Keep
:guessing wrong, Matt.

    I'm not guessing at all.  I don't really give a damn about your embedded
    project, or your constant innuendo's about what it does or does not do.

    If you decide you want to talk about it, that's up to you.  Personally
    speaking, I love talking about the projects I've done.  I love talking
    about the cool technical details and the hard problems that had to be
    solved.

    I'm talking about the embedded world in general and how it functions
    these days.  What made you think I was talking about YOUR particular
    project?  I have no information... getting anything from you is like
    pulling teeth, you are wholely unwilling to part with a single meaningful
    detail and yet you expect to have a technical conversation by referencing
    it?  Give me a break.

    Again, if you want to have an actual conversation, then the ball is in
    your court.  You clearly believe that I am not qualified to have that
    conversation... well, put your money where your mouth is then.  If
    you think my reasoning is so bad, then say something meaningful that
    directly addresses it, in technical terms.  Hell, you can even quote
    papers rather then produce your own thoughts if you think it is relevant.

    The devil is in the details.  That's what technical conversations are
    for.  If I went by your logic I would have never written Diablo,
    or dmail, or a database, or numerous filesystems, or HAMMER, or gotten
    involved with OSs (people kept saying they were harder then micro os's.
    Oops, I guess they weren't after all!).  Sheesh.

						-Matt

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 01:03:31 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 08EED106564A;
	Wed,  2 Apr 2008 01:03:31 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id B98BA8FC21;
	Wed,  2 Apr 2008 01:03:30 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m3213Jhl043507;
	Tue, 1 Apr 2008 18:03:19 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m3213JEt043506;
	Tue, 1 Apr 2008 18:03:19 -0700 (PDT)
Date: Tue, 1 Apr 2008 18:03:19 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804020103.m3213JEt043506@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
	<200804012325.m31NPwM1042551@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 01:03:31 -0000


:> chip solution
:
:It's not a single chip or even two chips. It doesn't run linux.  Keep
:guessing wrong, Matt.

    I'm not guessing at all.  I don't really give a damn about your embedded
    project, or your constant innuendo's about what it does or does not do.

    If you decide you want to talk about it, that's up to you.  Personally
    speaking, I love talking about the projects I've done.  I love talking
    about the cool technical details and the hard problems that had to be
    solved.

    I'm talking about the embedded world in general and how it functions
    these days.  What made you think I was talking about YOUR particular
    project?  I have no information... getting anything from you is like
    pulling teeth, you are wholely unwilling to part with a single meaningful
    detail and yet you expect to have a technical conversation by referencing
    it?  Give me a break.

    Again, if you want to have an actual conversation, then the ball is in
    your court.  You clearly believe that I am not qualified to have that
    conversation... well, put your money where your mouth is then.  If
    you think my reasoning is so bad, then say something meaningful that
    directly addresses it, in technical terms.  Hell, you can even quote
    papers rather then produce your own thoughts if you think it is relevant.

    The devil is in the details.  That's what technical conversations are
    for.  If I went by your logic I would have never written Diablo,
    or dmail, or a database, or numerous filesystems, or HAMMER, or gotten
    involved with OSs (people kept saying they were harder then micro os's.
    Oops, I guess they weren't after all!).  Sheesh.

						-Matt

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 02:09:09 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3F512106568B
	for <arch@freebsd.org>; Wed,  2 Apr 2008 02:09:09 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outF.internet-mail-service.net (outf.internet-mail-service.net
	[216.240.47.229])
	by mx1.freebsd.org (Postfix) with ESMTP id 208358FC24
	for <arch@freebsd.org>; Wed,  2 Apr 2008 02:09:09 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Tue, 01 Apr 2008 19:10:06 -0700
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id 6E55E2D600F;
	Tue,  1 Apr 2008 19:09:06 -0700 (PDT)
Message-ID: <47F2EAC4.1050206@elischer.org>
Date: Tue, 01 Apr 2008 19:09:08 -0700
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213)
MIME-Version: 1.0
To: Martin Fouts <mfouts@danger.com>
References: <20080330231544.A96475@localhost>	<200803310135.m2V1ZpiN018354@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>	<200803312125.29325.qpadla@gmail.com>	<200803311915.m2VJFSoR027593@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>	<200803312006.m2VK6Aom028133@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>	<200803312254.m2VMsPqZ029549@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>	<200804011733.m31HXF6e039649@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>	<200804012014.m31KEvTJ041049@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>	<200804012325.m31NPwM1042551@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
In-Reply-To: <B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 02:09:09 -0000

DING DING DING!

Will the contestants please go to their respective corners and calm down..
both of you are viewing what the other has said in light of your own 
current viewpoints instead of theirs and it's not reflectign well on 
either of you.

an we call this to an end and maybe you two can discuss it some time 
over a beer with a whiteboard.  It was fun in dintersting at the 
start, but it's gone to far..

STOPPIT!! .... NOT ONE more post.... leave it as it is..
(gee being a parent does have its uses...)


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 02:09:09 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 81AC71065692
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 02:09:09 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outG.internet-mail-service.net (outg.internet-mail-service.net
	[216.240.47.230])
	by mx1.freebsd.org (Postfix) with ESMTP id 20C418FC25
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 02:09:09 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Tue, 01 Apr 2008 19:10:06 -0700
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id 6E55E2D600F;
	Tue,  1 Apr 2008 19:09:06 -0700 (PDT)
Message-ID: <47F2EAC4.1050206@elischer.org>
Date: Tue, 01 Apr 2008 19:09:08 -0700
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213)
MIME-Version: 1.0
To: Martin Fouts <mfouts@danger.com>
References: <20080330231544.A96475@localhost>	<200803310135.m2V1ZpiN018354@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>	<200803312125.29325.qpadla@gmail.com>	<200803311915.m2VJFSoR027593@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>	<200803312006.m2VK6Aom028133@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>	<200803312254.m2VMsPqZ029549@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>	<200804011733.m31HXF6e039649@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>	<200804012014.m31KEvTJ041049@apollo.backplane.com>	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>	<200804012325.m31NPwM1042551@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
In-Reply-To: <B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 02:09:09 -0000

DING DING DING!

Will the contestants please go to their respective corners and calm down..
both of you are viewing what the other has said in light of your own 
current viewpoints instead of theirs and it's not reflectign well on 
either of you.

an we call this to an end and maybe you two can discuss it some time 
over a beer with a whiteboard.  It was fun in dintersting at the 
start, but it's gone to far..

STOPPIT!! .... NOT ONE more post.... leave it as it is..
(gee being a parent does have its uses...)


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 03:05:25 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EC1A81065672;
	Wed,  2 Apr 2008 03:05:25 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id CEA078FC16;
	Wed,  2 Apr 2008 03:05:25 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 40AD9403C21;
	Tue,  1 Apr 2008 20:05:16 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 20:05:25 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D26@EXCHANGE.danger.com>
In-Reply-To: <200804020103.m3213JEt043506@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUXV2iGs/PJPG5TaipOq5gR1lMmAAC4MkA
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
	<200804012325.m31NPwM1042551@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
	<200804020103.m3213JEt043506@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: <freebsd-arch@freebsd.org>,
	<arch@freebsd.org>
Cc: 
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 03:05:26 -0000


To summarize, so that it's all in one place:

1) NAND flash is sufficiently different than either NOR flash or
rotational media, that filesystem design optimizations aimed at either
NOR or rotational tend to be inefficient in NAND and NAND offers
opportunities for optimizations not present on either. It also presents
challenges that don't exist for NOR or rotational media.  In particular,
seek and rotational latency are not present, but bit error rate is high,
the size of the erase unit is large compared to the size of the write
unit, and the presence of extra storage in the spare area makes
optimizations possible that are not available in the other media, with
the caveat that small page NAND devices cannot take advantage of the
same degree of optimization as large page NAND devices

2) It is *possible* to use a flash translation layer to hide the
complexity of flash from a filesystem implementation, and commercial
file systems exist which do this, most notably the FATFS implementation
used on most NAND based USB device, on the M-Systems parts, and
commercially from Datalight.

3) It is not possible on consumer electronics "convergent" devices to
take advantage of the usual techniques available for performance
improvement through caching that is available on systems with relatively
large amounts of NAND. A CE device with an included NAND part does not
optimize in the same way as an SSD using NAND parts.

4) Power management on battery powered devices makes for different
optimization trade-offs than on wall-powered devices. Most notably, it
is often desirable to turn off power to RAM when the system is inactive,
which has a design impact on robustness and performance.

5) The reduction in BOM and the increase in performance due to
customized filesystem design has proven the usefulness of NAND-aware
filesystems, at least in the commercial marketplace.

6) There are good reasons for exposing transactional semantics to the
users of NAND file systems, having to do with robustness.

7) These are the well known approaches, with different strengths and
weaknesses, to NAND-aware file systems:
   A) File system completely unaware of NAND, FTL takes care of the
differences. This is used in USB devices, and has the advantage of being
able to support those devices as if they were FATFS devices without
changes to the host filesystem software. It has the disadvantage of
performance and robustness penalties due to the filesystem making
excessive writes to what it believes are fixed location datablocks.
   B) File systems aware of NAND, with an FTL. Datalight's RelianceFS
and FFX products combine to provide this sort of approach. The advantage
is that they tend to be much more robust than systems without the
knowledge and even have higher performance. The disadvantage is the
complexity of the translation layer, and the interfaces between it and
the filesystem layer and the device layer.
   C) File systems that manage the NAND directly without an FTL. These
fall into two camps:
      i) filesystems that treat NAND like NOR using a flash adaptation
layer. JFFS and JFFS2, combined with MTD are the canonical examples.
      ii) filesystems that optimize for NAND properties. YAFFS2 direct
is the canonical example.

Because NAND provides no guarenteed good block, the performance issues
with it are related to sensitivity to scan time to find state.

JFFS2 failed in this area because of the nature of its embedded b-tree
data structures, which are expensive to maintain robustly, difficult to
garbage collect, and prone to needing frequent scanning and rewriting.
It is conjectured that any filesystem which embeds a block renaming
scheme into NAND will suffer the same fate. I for one would be
interested in seeing a refutation of that conjecture, but there are now
four different projects which have attempted to do so with no luck that
I'm aware of. The issue is one of locality in the b-tree versus
robustness. Sufficiently frequent updates of the structure to NAND to
meet robustness requirements tend to put a great deal of write pressure
on the device, as well as frequent garbage collection.

At PalmSource, Mike Chen and myself took the NetBSD version of LFS and
modified it sufficiently to produce a working log-structured file system
that was used in the unshipped PalmOS Cobalt product. The conversion was
relatively easy, taking somewhat less than 1.5 man years, and the
resulting filesystem benchmarked favorably against other commercial
products, but never saw field trial, so robustness is indetrminanent. A
key to the modification was reducing the amount of state that had to be
read during mount scan to a single block per erase unit and to be very
careful about block selection for garbage collection.

Charles Manning had already taken that approach one step further, in
yaffs2, when he was able to reduce the amount of information needing
scanning to a single spare area per erase unit, greatly reducing the
mount scan time.

Both the modified LFS and YAFFS2 take advantage of other properties of
the NAND to reduce metadata write frequency and both relax timestamp
semantics to do so. YAFFS2 goes farther than we did by providing a
checkpoint facility which is used to further speed mount time and
reconstruction. Both take advantage of spare area writing to determine
write transaction completion.


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 03:05:25 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EC1A81065672;
	Wed,  2 Apr 2008 03:05:25 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id CEA078FC16;
	Wed,  2 Apr 2008 03:05:25 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id 40AD9403C21;
	Tue,  1 Apr 2008 20:05:16 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 1 Apr 2008 20:05:25 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D26@EXCHANGE.danger.com>
In-Reply-To: <200804020103.m3213JEt043506@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciUXV2iGs/PJPG5TaipOq5gR1lMmAAC4MkA
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
	<200803312254.m2VMsPqZ029549@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0D@EXCHANGE.danger.com>
	<200804011733.m31HXF6e039649@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D17@EXCHANGE.danger.com>
	<200804012014.m31KEvTJ041049@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1E@EXCHANGE.danger.com>
	<200804012325.m31NPwM1042551@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D22@EXCHANGE.danger.com>
	<200804020103.m3213JEt043506@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: <freebsd-arch@freebsd.org>,
	<arch@freebsd.org>
Cc: 
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 03:05:26 -0000


To summarize, so that it's all in one place:

1) NAND flash is sufficiently different than either NOR flash or
rotational media, that filesystem design optimizations aimed at either
NOR or rotational tend to be inefficient in NAND and NAND offers
opportunities for optimizations not present on either. It also presents
challenges that don't exist for NOR or rotational media.  In particular,
seek and rotational latency are not present, but bit error rate is high,
the size of the erase unit is large compared to the size of the write
unit, and the presence of extra storage in the spare area makes
optimizations possible that are not available in the other media, with
the caveat that small page NAND devices cannot take advantage of the
same degree of optimization as large page NAND devices

2) It is *possible* to use a flash translation layer to hide the
complexity of flash from a filesystem implementation, and commercial
file systems exist which do this, most notably the FATFS implementation
used on most NAND based USB device, on the M-Systems parts, and
commercially from Datalight.

3) It is not possible on consumer electronics "convergent" devices to
take advantage of the usual techniques available for performance
improvement through caching that is available on systems with relatively
large amounts of NAND. A CE device with an included NAND part does not
optimize in the same way as an SSD using NAND parts.

4) Power management on battery powered devices makes for different
optimization trade-offs than on wall-powered devices. Most notably, it
is often desirable to turn off power to RAM when the system is inactive,
which has a design impact on robustness and performance.

5) The reduction in BOM and the increase in performance due to
customized filesystem design has proven the usefulness of NAND-aware
filesystems, at least in the commercial marketplace.

6) There are good reasons for exposing transactional semantics to the
users of NAND file systems, having to do with robustness.

7) These are the well known approaches, with different strengths and
weaknesses, to NAND-aware file systems:
   A) File system completely unaware of NAND, FTL takes care of the
differences. This is used in USB devices, and has the advantage of being
able to support those devices as if they were FATFS devices without
changes to the host filesystem software. It has the disadvantage of
performance and robustness penalties due to the filesystem making
excessive writes to what it believes are fixed location datablocks.
   B) File systems aware of NAND, with an FTL. Datalight's RelianceFS
and FFX products combine to provide this sort of approach. The advantage
is that they tend to be much more robust than systems without the
knowledge and even have higher performance. The disadvantage is the
complexity of the translation layer, and the interfaces between it and
the filesystem layer and the device layer.
   C) File systems that manage the NAND directly without an FTL. These
fall into two camps:
      i) filesystems that treat NAND like NOR using a flash adaptation
layer. JFFS and JFFS2, combined with MTD are the canonical examples.
      ii) filesystems that optimize for NAND properties. YAFFS2 direct
is the canonical example.

Because NAND provides no guarenteed good block, the performance issues
with it are related to sensitivity to scan time to find state.

JFFS2 failed in this area because of the nature of its embedded b-tree
data structures, which are expensive to maintain robustly, difficult to
garbage collect, and prone to needing frequent scanning and rewriting.
It is conjectured that any filesystem which embeds a block renaming
scheme into NAND will suffer the same fate. I for one would be
interested in seeing a refutation of that conjecture, but there are now
four different projects which have attempted to do so with no luck that
I'm aware of. The issue is one of locality in the b-tree versus
robustness. Sufficiently frequent updates of the structure to NAND to
meet robustness requirements tend to put a great deal of write pressure
on the device, as well as frequent garbage collection.

At PalmSource, Mike Chen and myself took the NetBSD version of LFS and
modified it sufficiently to produce a working log-structured file system
that was used in the unshipped PalmOS Cobalt product. The conversion was
relatively easy, taking somewhat less than 1.5 man years, and the
resulting filesystem benchmarked favorably against other commercial
products, but never saw field trial, so robustness is indetrminanent. A
key to the modification was reducing the amount of state that had to be
read during mount scan to a single block per erase unit and to be very
careful about block selection for garbage collection.

Charles Manning had already taken that approach one step further, in
yaffs2, when he was able to reduce the amount of information needing
scanning to a single spare area per erase unit, greatly reducing the
mount scan time.

Both the modified LFS and YAFFS2 take advantage of other properties of
the NAND to reduce metadata write frequency and both relax timestamp
semantics to do so. YAFFS2 goes farther than we did by providing a
checkpoint facility which is used to further speed mount time and
reconstruction. Both take advantage of spare area writing to determine
write transaction completion.


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 09:10:47 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id ABD0C106566B
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 09:10:47 +0000 (UTC)
	(envelope-from avg@icyb.net.ua)
Received: from hosted.kievnet.com (hosted.kievnet.com [193.138.144.10])
	by mx1.freebsd.org (Postfix) with ESMTP id 694368FC1A
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 09:10:47 +0000 (UTC)
	(envelope-from avg@icyb.net.ua)
Received: from localhost ([127.0.0.1] helo=edge.pp.kiev.ua)
	by hosted.kievnet.com with esmtpa (Exim 4.62)
	(envelope-from <avg@icyb.net.ua>) id 1JgybC-0004fk-41
	for freebsd-arch@freebsd.org; Wed, 02 Apr 2008 11:45:38 +0300
Message-ID: <47F347B1.2020509@icyb.net.ua>
Date: Wed, 02 Apr 2008 11:45:37 +0300
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Thunderbird 2.0.0.12 (X11/20080320)
MIME-Version: 1.0
To: freebsd-arch@freebsd.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: kobj method signature/prototype checking/enforcement
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 09:10:47 -0000


As you are most probably aware, currently there is no
checking/enforcement for signatures of functions implementing kobj
methods. Internally all function pointers are stored as pointers to 'int
f(void)', and they are cast to and from as needed.
So, for example, if you set a function 'char * g(void **)' as
device_probe method then the compiler will compile everything just fine,
it will be only at run-time that you will get a trouble because of
mismatching arguments.

I propose to defend against this problem using the following macro for
KOBJMETHOD:
#define KOBJMETHOD(NAME, FUNC) \
{ &NAME##_desc, (kobjop_t) (FUNC != (NAME##_t *)NULL ? FUNC : NULL) }

This is an idea behind it:
1. the comparison expression is a NOP, its result is always the same as
(kobjop_t)FUNC
2. the expression is evaluated at compile time, so it doesn't create any
run-time overhead or binary differences
3. purpose of expression is to make use of GCC feature to warn about
comparing "distinct pointer types"

I tested this change with 6.3-RELEASE sources. It revealed a number of
signature mismatches in different places. Obviously all of them are
quite harmless - otherwise they would be already discovered in a hard
way (by people bitten).

Here's a general overview of issues discovered:
1. integer parameters differing in signedness (totally harmless, I think)
2. using void return type instead of int, usually for device_shutdown
method (not sure about this one)
3. using int return type instead of specific size integer return type,
typically for sound channel interface methods
4. 'char *' parameter instead of 'const char *' parameter (potentially
can result in future problems)
5. significantly different signatures for several "dummy" methods that
do not actually use any of the parameters and simply print a message or
panic.

While the above issues are quite harmless, I still think that adding
such a checking code is a good thing. It will help with new code
development and it will help general code quality and maintenance.

Unfortunately I don't have my FreeBSD development environment quite set
up (yet) for large scale development, so at this point I can not provide
a patch for HEAD that would fix all the build breakages (on all the
platforms) that would be caused by the proposed change (when -Werror is
in effect).

-- 
Andriy Gapon

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 18:17:46 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 371D71065786
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 18:17:46 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 22EF78FC24
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 18:17:46 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from zion.baldwin.cx (66-23-211-162.clients.speedfactory.net
	[66.23.211.162]) by elvis.mu.org (Postfix) with ESMTP id CFE8F1A4D80;
	Wed,  2 Apr 2008 11:17:45 -0700 (PDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Wed, 2 Apr 2008 13:09:54 -0400
User-Agent: KMail/1.9.7
References: <10004.1205307334@critter.freebsd.dk>
	<20080312152744.I29518@fledge.watson.org>
	<20080328202602.N72156@desktop>
In-Reply-To: <20080328202602.N72156@desktop>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200804021309.54956.jhb@freebsd.org>
Cc: 
Subject: Re: timeout/callout small step forward
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 18:17:46 -0000

On Saturday 29 March 2008 03:04:17 am Jeff Roberson wrote:
> http://people.freebsd.org/~jeff/callout.diff
>
> This patch takes the current callout implementation and makes it per-cpu.
> It also hides callout details from the rest of the kernel by making the
> callwheel structure private to kern_timeout.c among other things.

Looks good.  The kern_intr.c diff has a small bug (forgot to remove the return 
(intr_event_create(...)) from swi_add()).  A few style suggestions would be 
to always leave a blank line before a comment (I think I saw this in 
kern_calloutwheel_init()?) and usually there isn't a blank line before a 
SYSINIT().  Maybe make the panic messages when creating softclock threads 
more specific, but that's very minor.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 19:11:49 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4CBF11065670
	for <freebsd-arch@FreeBSD.org>; Wed,  2 Apr 2008 19:11:49 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 08DE28FC15
	for <freebsd-arch@FreeBSD.org>; Wed,  2 Apr 2008 19:11:48 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m32J9TZT015462;
	Wed, 2 Apr 2008 13:09:29 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Wed, 02 Apr 2008 13:10:19 -0600 (MDT)
Message-Id: <20080402.131019.-705186138.imp@bsdimp.com>
To: dillon@apollo.backplane.com
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <200804012226.m31MQ42O042173@apollo.backplane.com>
References: <200804012010.m31KAMpu041011@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D1D@EXCHANGE.danger.com>
	<200804012226.m31MQ42O042173@apollo.backplane.com>
X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@FreeBSD.org, mfouts@danger.com
Subject: Re: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 19:11:49 -0000

In message: <200804012226.m31MQ42O042173@apollo.backplane.com>
            Matthew Dillon <dillon@apollo.backplane.com> writes:
: 
: :>     64MB is tiny.  None of the problems with any of the=20
: :> approachs we've discussed even exist with devices that small in an=20
: :> embedded system.
: :
: :It is fairly clear that you're not familiar with NAND devices on
: :embedded systems, as you've just said that well known problems do not
: :exist.
: :
: :> To be clear, because I really don't understand how you=20
: :> can possibly argue that the named-block storage layer is bad in a=20
: :> device that small...
: :
: :Yes, your lack of understanding is very apparent.
: 
:     What complete bullshit.  If you want to argue technical merits, be
:     my guest.  So far you haven't made one single technical point in
:     any of your postings.  You've posted about your experience with NAND


AHEM!  Matt, you will keep a civil tongue, or you will be asked to
leave the list.  This goes for everybody else too.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr  2 19:21:02 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3947E1065674
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 19:21:02 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id C9EC38FC38
	for <freebsd-arch@freebsd.org>; Wed,  2 Apr 2008 19:21:01 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m32JGiYO015549;
	Wed, 2 Apr 2008 13:16:45 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Wed, 02 Apr 2008 13:17:34 -0600 (MDT)
Message-Id: <20080402.131734.255331081.imp@bsdimp.com>
To: avg@icyb.net.ua
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <47F347B1.2020509@icyb.net.ua>
References: <47F347B1.2020509@icyb.net.ua>
X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@freebsd.org
Subject: Re: kobj method signature/prototype checking/enforcement
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2008 19:21:02 -0000

In message: <47F347B1.2020509@icyb.net.ua>
            Andriy Gapon <avg@icyb.net.ua> writes:
: I propose to defend against this problem using the following macro for
: KOBJMETHOD:
: #define KOBJMETHOD(NAME, FUNC) \
: { &NAME##_desc, (kobjop_t) (FUNC != (NAME##_t *)NULL ? FUNC : NULL) }
...
: Here's a general overview of issues discovered:
: 1. integer parameters differing in signedness (totally harmless, I think)
: 2. using void return type instead of int, usually for device_shutdown
: method (not sure about this one)
: 3. using int return type instead of specific size integer return type,
: typically for sound channel interface methods
: 4. 'char *' parameter instead of 'const char *' parameter (potentially
: can result in future problems)
: 5. significantly different signatures for several "dummy" methods that
: do not actually use any of the parameters and simply print a message or
: panic.
: 
: While the above issues are quite harmless, I still think that adding
: such a checking code is a good thing. It will help with new code
: development and it will help general code quality and maintenance.
: 
: Unfortunately I don't have my FreeBSD development environment quite set
: up (yet) for large scale development, so at this point I can not provide
: a patch for HEAD that would fix all the build breakages (on all the
: platforms) that would be caused by the proposed change (when -Werror is
: in effect).

Yes!  I think I like this approach, and would like to see it fleshed
out more.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr  3 05:51:13 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2A2B3106566C;
	Thu,  3 Apr 2008 05:51:13 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id E64E88FC2A;
	Thu,  3 Apr 2008 05:51:12 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [10.0.1.199] (cpe-24-94-72-120.hawaii.res.rr.com [24.94.72.120])
	(authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	m335p6fA098719; Thu, 3 Apr 2008 01:51:09 -0400 (EDT)
	(envelope-from jroberson@chesapeake.net)
Date: Wed, 2 Apr 2008 19:51:32 -1000 (HST)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@desktop
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200804021309.54956.jhb@freebsd.org>
Message-ID: <20080402195001.O949@desktop>
References: <10004.1205307334@critter.freebsd.dk>
	<20080312152744.I29518@fledge.watson.org>
	<20080328202602.N72156@desktop> <200804021309.54956.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: timeout/callout small step forward
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Apr 2008 05:51:13 -0000

On Wed, 2 Apr 2008, John Baldwin wrote:

> On Saturday 29 March 2008 03:04:17 am Jeff Roberson wrote:
>> http://people.freebsd.org/~jeff/callout.diff
>>
>> This patch takes the current callout implementation and makes it per-cpu.
>> It also hides callout details from the rest of the kernel by making the
>> callwheel structure private to kern_timeout.c among other things.
>
> Looks good.  The kern_intr.c diff has a small bug (forgot to remove the return
> (intr_event_create(...)) from swi_add()).  A few style suggestions would be

Ah thanks.  I had fixed this in a tree but didn't update the patch.  Now 
it's in current.  I'll check that in.

> to always leave a blank line before a comment (I think I saw this in
> kern_calloutwheel_init()?) and usually there isn't a blank line before a
> SYSINIT().  Maybe make the panic messages when creating softclock threads
> more specific, but that's very minor.

Ok, I think kern_timeout.c could use some reformating and refactoring 
as well but I didn't want to tie this commit to that.  Some of those 
functions get too deep and should be broken off into simpler routines.

Thanks,
Jeff

>
> -- 
> John Baldwin
>

From owner-freebsd-arch@FreeBSD.ORG  Fri Apr  4 02:11:08 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A3F711065689
	for <freebsd-arch@freebsd.org>; Fri,  4 Apr 2008 02:11:08 +0000 (UTC)
	(envelope-from onlinefuturebazaar2007@gmail.com)
Received: from qb-out-0506.google.com (qb-out-0506.google.com [72.14.204.235])
	by mx1.freebsd.org (Postfix) with ESMTP id E74FF8FC17
	for <freebsd-arch@freebsd.org>; Fri,  4 Apr 2008 02:11:07 +0000 (UTC)
	(envelope-from onlinefuturebazaar2007@gmail.com)
Received: by qb-out-0506.google.com with SMTP id a10so4241067qbd.7
	for <freebsd-arch@freebsd.org>; Thu, 03 Apr 2008 19:11:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta;
	h=domainkey-signature:received:received:return-receipt-to:reply-to:from:to:subject:date:organization:message-id:mime-version:content-type:x-mailer:thread-index:x-mimeole:disposition-notification-to;
	bh=fV+nZ1zAwuCA+7O7scSERvPa6wTclCQ37XujbEsWrTk=;
	b=Hly/luIqbEUdVUE0Lz6nMVIGvrOhPEhK3krFhZvNSwP+LGaKizAw57kLjLA3xLNhXfrH1K6DePs7bF3X9kvR1x799m1Yqu1g5Das3/T+MJlbCPq46tbgOMlTdNnMCTbMsn+N60RZ8aaA4oXLpBs3/wKn3XnfbIIIiBsclOyybQk=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta;
	h=return-receipt-to:reply-to:from:to:subject:date:organization:message-id:mime-version:content-type:x-mailer:thread-index:x-mimeole:disposition-notification-to;
	b=NMs9gtBEn5gJuYOJIwslQBub3es2L+n6pcTwlgZ2jQmvqTz7slzyEbS2Y/uLZiD5TiiZWzhXlii8IQa+ByT6ix6LxGy10Olbp+nBosnu+sQmeGsyQQMUJFmCHko45q9R2ha7vtbi43ykVscOBYC/WlRkt+N+1I3rmDzylwSyDZA=
Received: by 10.142.222.21 with SMTP id u21mr430104wfg.231.1207274132325;
	Thu, 03 Apr 2008 18:55:32 -0700 (PDT)
Received: from onlinemain ( [59.161.47.100])
	by mx.google.com with ESMTPS id 27sm8094416wff.8.2008.04.03.18.55.29
	(version=SSLv3 cipher=RC4-MD5); Thu, 03 Apr 2008 18:55:31 -0700 (PDT)
From: "Suraj Saroj" <onlinefuturebazaar2007@gmail.com>
To: <freebsd-arch@freebsd.org>
Date: Fri, 4 Apr 2008 06:59:24 +0530
Organization: Online Future Bazaar
Message-ID: <!~!UENERkVCMDkAAQACAAAAAAAAAAAAAAAAABgAAAAAAAAAo5PkZ9KBn0ORtcXJ8R6q6MKAAAAQAAAAhXbQzEuhm028uUDUxdC0WgEAAAAA@gmail.com>
MIME-Version: 1.0
X-Mailer: Microsoft Office Outlook, Build 11.0.5510
Thread-Index: AciV6UeS1HjQezP1TKGXa9ZKdpU9AA==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Online Future Bazaar
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: onlinefuturebazaar2007@gmail.com
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 04 Apr 2008 02:11:08 -0000

Visit: www.onlinefuturebazaar.com

Online Future Bazaar
India