From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 20:10:24 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 706F91065670 for ; Tue, 1 Apr 2008 20:10:24 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 4F6728FC3E for ; Tue, 1 Apr 2008 20:10:24 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31KAMJV041012; Tue, 1 Apr 2008 13:10:22 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31KAMpu041011; Tue, 1 Apr 2008 13:10:22 -0700 (PDT) Date: Tue, 1 Apr 2008 13:10:22 -0700 (PDT) From: Matthew Dillon Message-Id: <200804012010.m31KAMpu041011@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> <200804011748.m31HmE1h039800@apollo.backplane.com> Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 20:10:24 -0000 :If you've asked, I've missed the question. : :We tend to size ram and embedded NAND the same. The latest numbers I can :discuss are several years old and were 64mb/64mb. Engineering *always* :wants more of each, but the BOM rules. :=20 64MB is tiny. None of the problems with any of the approachs we've discussed even exist with devices that small in an embedded system. You barely need to even implement a filesystem topology, let alone anything sophisticated. To be clear, because I really don't understand how you can possibly argue that the named-block storage layer is bad in a device that small... the only sophistication a named-block storage model needs is when it must create a forward lookup on-flash to complement the reverse lookup you get from the auxillary storage. Given that you can trivially cache many translations in memory, not to mention do a boot-up scan of a flash that small, the only performance impact would be writing out a portion of the forward translation topology every N (N > 1000) or so page writes (however many translations can be conveniently cached in system memory). In a small flash device the embedded application will determine whether you even need such a table... frankly, unless it's a general purpose computing device like an iPhone you probably wouldn't need an on-flash forward lookup, you could simply size the blocks to guarantee 99% flash utilization verses the number of files you expect to have to maintain (so, for example, the named block size could be 512K if you expected to have to maintain 100 files on a 64MB device). This doesn't mean the filesystem would have to use a 512K block size, that would only be the case if the filesystem were flash-unaware. It's seriously a non-issue. You are making too many assumptions about how named blocks would be used, particularly if the filesystem is flash-aware. Named blocks do not have to be 'sectors' or 'filesystem blocks' or anything of the sort. They can be much larger.. they can easily be multiples of a flash page though you don't want to make them too large because a failed page also fails the whole named-block covering that page. They can be erase units (probably the best fit). This leaves the filesystem layer (and here we are talking about a flash-aware filesystem), with a hellofalot more implementation flexiblity. FORWARD LOOKUP ON-FLASH TOPOLOGY There are many choices available for the forward lookup topology, assuming you need one. Here again we are describing the need to have one (or at least one that would be considered sophisticated) only for larger flash devices -- really true solid state storage. We aren't talking about having to write out tiny little updates to B-Tree elements... that's stupid. Only an idiot would do that. Because you can cache new lookups in system memory and because you do NOT have to flush the forward lookup topology to flash for crash recovery purposes, the sole limiting factor for the efficiency of the forward lookup flush to flash is the amount of system memory you are willing to set aside to cache new translations. Since translations are fairly small structures we are probably talking not dozens, not hundreds, but probably at least a thousand translations before any sort of flush would be needed. Lets be clear here. That's ONE page write every THOUSAND page writes worth of overhead. There are no write performance issues. The actual on-flash topology for the forward lookup? With such a large rollup cache available it almost doesn't matter, but lets say we wanted to limit forward lookups to 3 levels. Lets take a 2G flash device with 8K pages, just to be nice about it. That's 262144 named blocks. Not a very large number, eh? Why you could almost fit that in system memory (and maybe you can!) and obviate the need for an on-flash forward lookup topology at all. But lets say you COULDN'T fit that in system memory. Hmm. 3 levels, 262144 entries maximum (less in real life). That would be a 3-layer radix tree with a radix of 64. The top layer would almost certainly be cacheable in system memory (and maybe even two layers) so we are talking one or two page reads from flash to do the lookup and the update mechanic, being a radix tree, would be to sync the bits of the radix tree that were modified by the new translations all the way up to the root. Clearly given the number of 'dirty' translations that would need to be synchronized, you could easily fill a flash page and then simply retire the synced translations from system memory, and repeat as often as necessary to maintain the dirty ratio in the cache in system memory at appropriate levels. You can also clearly accumulate enough dirty translations for the sync to be worthwhile... that is, be guaranteed to fill a whole page. You do NOT have to sync for recovery purposes so it becomes an issue that is solely related to the system cache and nothing else. I'll add something else with regards to radix trees using large radii... you can usually cache just about the whole damn thing except the leaf level in system memory. Think about that for a moment and in particular think about how it greatly reduces the number of actual flash reads needed to perform the lookup. I'll add something else with regards to on-storage radix trees. You can also adjust the layout so the BOTTOM few levels of the radix tree, relative to some leaf, reside in the same page. So now we've reduced a random uncached translation lookup to, at worse, ONE flash page read operation that ALSO guarantees us locality of reference for nearby file blocks (and hence has no performance issues for streaming reads either). -- Now, if you want to argue that this model would have serious performance penalities please go ahead, I'm all ears. -Matt