From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 23:55:22 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AE2441065672 for ; Mon, 31 Mar 2008 23:55:22 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 831758FC15 for ; Mon, 31 Mar 2008 23:55:22 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VNt7Jf030309; Mon, 31 Mar 2008 16:55:07 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VNt7ZY030308; Mon, 31 Mar 2008 16:55:07 -0700 (PDT) Date: Mon, 31 Mar 2008 16:55:07 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312355.m2VNt7ZY030308@apollo.backplane.com> To: Bakul Shah References: <20080331231829.26D865B50@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , Alfred Perlstein , qpadla@gmail.com, arch@freebsd.org, Poul-Henning Kamp Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 23:55:22 -0000 :Let us take the mtron MSD-SATA3025-032 device for example. It :has a capacity of 32GB + can do 16,000 random & 78,000 :sequential reads per second (of 512 byte blocks). If you :write a root block every megabyte, you have 2^15 potential :root blocks. Locating the latest one will require 16 random :reads + a scan of at most 1MB; which translates to about :26ms. Not too bad since this cost is incurred only on :the first mount or reboot. Yah, I think for NAND filesystems crash recovery is actually the easiest issue to deal with since all you need to do, pretty much, is rewind your append pointer a bit. Not only can you rewind the filesystem to an older state, but you can also reconstruct a great deal of the unsynced in-memory data by doing a limited reverse scan. It just does not take very long to do that and it can be done automatically by the filesystem at mount time. For example, if you write some file data and the new file block is flushed to flash but the related meta-data changes for the pointers to the now relocated data block have not yet been flushed to flash, on crash recovery it is possible to note this condition (the old physical block number would be stored in the aux data area of the new one) and regenerate the missing meta-data changes instead of being forced to back-out the write. Being able to do this has fairly substantial consequences because it means the fsync() only needs to flush some of the modified pages, not all of them, and that the remaining modified pages could in fact remain unflushed in the system cache and STILL be recoverable after a crash because their modification was mearly a side effect of the operation that *was* flushed to flash. Once you are able to do that, you can also simply decide not to synchronously flush that meta data at all and thus allow multiple changes to accumulate in system memory before flushing to flash. There is one issue here and that is the transactional nature of most filesystem operations. For example, if you append to a file with a write() you are not only writing new data to the backing store, you are also updating the on-disk inode's st_size field. Those are two widely separated pieces of information which must be transactionally bound --- all or nothing. In this regard the crash recovery code needs to understand that they are bound and either recover the whole transaction or none of it. Once you get to that point you also have to worry about interdependancies between transactions... a really sticky issue that is the reason for a huge chunk of softupdate's complexity in UFS. Basically you can wind up with interdependant transactions which must be properly recovered. An example would be doing a write() and then doing another write(). The second write() cannot be recovered unless the first can also be recovered. Separate transactions, but with a dependancy. Such interdependancies can become arbitrarily complex the longer you leave meta-data changes unflushed. The question ultimately becomes whether the recovery code can deal with the complexity or not. If not you may be forced to flush rather then create the interdependancy. HAMMER has precisely this issue with it's UNDO sequencing. The complexity is in the algorith, but not the (fairly small) amount of time it takes to actually perform the recovery operation. -Matt Matthew Dillon