From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 19:15:55 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C8479106564A for ; Sun, 30 Mar 2008 19:15:55 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outI.internet-mail-service.net (outi.internet-mail-service.net [216.240.47.232]) by mx1.freebsd.org (Postfix) with ESMTP id A6D348FC19 for ; Sun, 30 Mar 2008 19:15:55 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Sun, 30 Mar 2008 17:47:33 -0700 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 52F7E2D6B2B; Sun, 30 Mar 2008 12:15:53 -0700 (PDT) Message-ID: <47EFE6EA.4000804@elischer.org> Date: Sun, 30 Mar 2008 12:15:54 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: Kirk McKusick References: <200803292353.m2TNrCOW094875@chez.mckusick.com> In-Reply-To: <200803292353.m2TNrCOW094875@chez.mckusick.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, Poul-Henning Kamp Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 19:15:55 -0000 Kirk McKusick wrote: > You should try running your experiment using ZFS. Because it is a > non-overwriting filesystem, it might work better with flash. trouble is the amount of ram it needs might be unsuitable for embedded systems. > > Kirk McKusick > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 20:16:56 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D0B9E106564A for ; Sun, 30 Mar 2008 20:16:56 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id A0FC58FC25 for ; Sun, 30 Mar 2008 20:16:56 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2UKGuqg015128; Sun, 30 Mar 2008 13:16:56 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2UKGuZA015127; Sun, 30 Mar 2008 13:16:56 -0700 (PDT) Date: Sun, 30 Mar 2008 13:16:56 -0700 (PDT) From: Matthew Dillon Message-Id: <200803302016.m2UKGuZA015127@apollo.backplane.com> To: Kirk McKusick , arch@freebsd.org References: <200803292353.m2TNrCOW094875@chez.mckusick.com> Cc: Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 20:16:57 -0000 :You should try running your experiment using ZFS. Because it is a :non-overwriting filesystem, it might work better with flash. : : Kirk McKusick I'm assuming ZFS still has to update indices and indirect blocks, though, which is the primary source for random updates in all filesystems. The right way to deal with flash is *NOT* to require that the filesystem be smart about flash storage, but instead to implement an intermediate storage layer which linearizes the writes to flash and removes all random erases from the critical path. This also causes erasures to be evenly spread out on the flash unit and *GREATLY* extends the life of the flash device (to the point where you can just treat it as a disk and not have to worry about wearing out cells). I wrote precisely that 20 years ago for the flash filesystem I built for our telemetry RTUs. Of course, 20 years ago flash devices were much smaller, only 1-4MB per chip. But the concept is sound and with proper design can be implemented for much larger devices. Basically the general idea is as follows: Break the flash into three pieces: Two sector translation tables and one bulk storage area. Whenever a modification is made that involves transitioning bits from 0->1 (1->0 doesn't need an erase cycle) instead of erasing the flash sector all you do is allocate a new flash sector, append an entry to the translation table, and write the data out to the new flash sector. The logical block is now renumbered. You cache (some or all of) the translation table in-memory for fast access. * Appends to the translation table only involve 1->0 transitions. You don't even have to zero-out the old translation but can use it for crash recovery purposes. Thus no erasures are needed until the table becomes full. * Any non-trivial overwrites append a new sector, again involving only 1->0 transitions and requiring no erasures. * When the translation table becomes full you repack it into the second translation table (which then becomes the primary table), and erase the previous table. You ping-pong the tables (that's why there are two). * Bulk space can be allocated linearly until the flash becomes full, then erased/repacked (you also switch to the alternate translation table when doing the repacking of the bulk space). This can be a little tricky but as long as you leave one erase-sector's worth of space available you can always repack the flash without any possibility of losing data. This latter operation is the most expensive but once some space is freed up it is possible to pack simultaniously with running new ops, or to repack continuously as a background operation when space is tight, as long as you don't get twisted up with a full translation table. The only 'hard' bit about this design is you need to come up with a translation table topology that works for large flash devices. My flash filesystem of long ago just used a linear array and cached the whole thing with a hash table in memory, so it didn't require a sophisticated topology on-flash. But for a large flash device you probably need something a bit more sophisticated that still does not involve erase cycles in the critical path. The critical point, however, is that the on-flash translation table does NOT need to be optimal because you can mirror or cache elements of it in-memory. In anycase, that's really the only acceptable way to do a flash filesystem and still be able to guarantee proper wear characteristics for the flash cells. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 20:31:26 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9CAFE1065672 for ; Sun, 30 Mar 2008 20:31:26 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 52B878FC20 for ; Sun, 30 Mar 2008 20:31:26 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id DF50B17104; Sun, 30 Mar 2008 20:31:24 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2UKVNjh012859; Sun, 30 Mar 2008 20:31:24 GMT (envelope-from phk@critter.freebsd.dk) To: Matthew Dillon From: "Poul-Henning Kamp" In-Reply-To: Your message of "Sun, 30 Mar 2008 13:16:56 MST." <200803302016.m2UKGuZA015127@apollo.backplane.com> Date: Sun, 30 Mar 2008 20:31:23 +0000 Message-ID: <12858.1206909083@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Kirk McKusick , arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 20:31:26 -0000 In message <200803302016.m2UKGuZA015127@apollo.backplane.com>, Matthew Dillon w rites: > The right way to deal with flash is *NOT* to require that the filesystem > be smart about flash storage, but instead to implement an intermediate > storage layer which linearizes the writes to flash and removes all > random erases from the critical path. Your description of a simplified version of what is commonly called a "Flash Adaptation Layer", is a very good example of why there is a clear difference between "camera grade" flash devices, like most CF cards, and the new generation of "SSD" devices, like the M-Tron disk now in my laptop. The Camera grade Flash devices get lousy random write performance because they implement in essense what you describe, only in a more complete fashion where they have error correction, both the data and on the bitmaps. The newer generation of SSD devices do things much smarter than that, which is why their random write performance is much better than camera-grade devices. See my earlier emails for references to how to do the really smart thing. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 21:00:16 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C36931065672 for ; Sun, 30 Mar 2008 21:00:16 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 79F9C8FC1F for ; Sun, 30 Mar 2008 21:00:16 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2UL0FmF015655; Sun, 30 Mar 2008 14:00:15 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2UL0FTd015654; Sun, 30 Mar 2008 14:00:15 -0700 (PDT) Date: Sun, 30 Mar 2008 14:00:15 -0700 (PDT) From: Matthew Dillon Message-Id: <200803302100.m2UL0FTd015654@apollo.backplane.com> To: "Poul-Henning Kamp" References: <12858.1206909083@critter.freebsd.dk> Cc: Kirk McKusick , arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 21:00:16 -0000 :Your description of a simplified version of what is commonly called :a "Flash Adaptation Layer", is a very good example of why there is :a clear difference between "camera grade" flash devices, like most :CF cards, and the new generation of "SSD" devices, like the M-Tron :disk now in my laptop. : :The Camera grade Flash devices get lousy random write performance :because they implement in essense what you describe, only in a more :complete fashion where they have error correction, both the data :and on the bitmaps. : :The newer generation of SSD devices do things much smarter than :that, which is why their random write performance is much better :than camera-grade devices. : :See my earlier emails for references to how to do the really smart :thing. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 Er, why don't you explain it again, because I can't find the reference. You can only write to flash so fast. What I described is a fairly maximal implementation. The only way to make things faster is to add some dime-cap-backed static ram as a front-end cache and to gang writes to multiple flash chips (which is fairly standard). A dime-cap-backed static ram will retain the cache for upwards of a month. If you go LI-battery backed static ram then cache retention is around 5-years. Most 'camera grade' devices are one or two physical chips. Write performance, particularly when writing out large linear files, tends to be limited by the fact that there aren't very many flash chips and so you have no ability to gang writes in parallel. Any sort of SSD device is typically going to have anywhere from four to 'many' physical flash devices on board. Write performance to such devices will be an order of magnitude faster, really only limited by design choices on how the flash devices are ganged. A 'wide data' bus is the most convenient way to gang writes. There are also current limitations which limit how many physical chips you can write to in parallel, though modern flash devices have much lower write current requirements then older ones and if it is packaged as a SATA drive then it has tons of current capability simply by having access to a power connector capable of delivering the currents required by normal hard drives. CF and other small-format flash devices do not have NEARLY the same current delivery capabilities. In anycase, there is nothing magical about any of this. You still need to spread the data out on the physical flash devices to avoid wearing out cells. Perceived improvements in performance are entirely due to having a front-end non-volatile ram cache and ganging writes in parallel. -Matt From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 21:06:58 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C371E106564A for ; Sun, 30 Mar 2008 21:06:58 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 78C448FC1C for ; Sun, 30 Mar 2008 21:06:58 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 38A8417104; Sun, 30 Mar 2008 21:06:57 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2UL6vYu013180; Sun, 30 Mar 2008 21:06:57 GMT (envelope-from phk@critter.freebsd.dk) To: Matthew Dillon From: "Poul-Henning Kamp" In-Reply-To: Your message of "Sun, 30 Mar 2008 14:00:15 MST." <200803302100.m2UL0FTd015654@apollo.backplane.com> Date: Sun, 30 Mar 2008 21:06:57 +0000 Message-ID: <13179.1206911217@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Kirk McKusick , arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 21:06:58 -0000 In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon w rites: > Er, why don't you explain it again, because I can't find the reference. You'll find it if you search for it. And no, I really don't want to discuss it any further with you. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 21:09:29 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5869A1065673 for ; Sun, 30 Mar 2008 21:09:29 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 122BB8FC17 for ; Sun, 30 Mar 2008 21:09:29 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2UL9S9H015763; Sun, 30 Mar 2008 14:09:28 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2UL9SV1015762; Sun, 30 Mar 2008 14:09:28 -0700 (PDT) Date: Sun, 30 Mar 2008 14:09:28 -0700 (PDT) From: Matthew Dillon Message-Id: <200803302109.m2UL9SV1015762@apollo.backplane.com> To: "Poul-Henning Kamp" References: <13179.1206911217@critter.freebsd.dk> Cc: Kirk McKusick , arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 21:09:29 -0000 : :In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon w :rites: : :> Er, why don't you explain it again, because I can't find the reference. : :You'll find it if you search for it. : :And no, I really don't want to discuss it any further with you. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 Well, no skin off my nose. I will say that I am not at all impressed with your idiotic answer, though. -Matt From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 21:11:07 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC54E1065677 for ; Sun, 30 Mar 2008 21:11:07 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 71A098FC1A for ; Sun, 30 Mar 2008 21:11:07 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 892F417107; Sun, 30 Mar 2008 21:11:06 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2ULB6fM013325; Sun, 30 Mar 2008 21:11:06 GMT (envelope-from phk@critter.freebsd.dk) To: Matthew Dillon From: "Poul-Henning Kamp" In-Reply-To: Your message of "Sun, 30 Mar 2008 14:09:28 MST." <200803302109.m2UL9SV1015762@apollo.backplane.com> Date: Sun, 30 Mar 2008 21:11:06 +0000 Message-ID: <13324.1206911466@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Kirk McKusick , arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 21:11:07 -0000 In message <200803302109.m2UL9SV1015762@apollo.backplane.com>, Matthew Dillon w rites: > >: >:In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon w >:rites: >: >:> Er, why don't you explain it again, because I can't find the reference. >: >:You'll find it if you search for it. >: >:And no, I really don't want to discuss it any further with you. >: >:-- >:Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > > Well, no skin off my nose. I will say that I am not at all impressed > with your idiotic answer, though. ... and that's why. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 21:15:01 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D669C1065678 for ; Sun, 30 Mar 2008 21:15:01 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 900F98FC31 for ; Sun, 30 Mar 2008 21:15:01 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2ULExU2015829; Sun, 30 Mar 2008 14:15:01 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2ULExWt015828; Sun, 30 Mar 2008 14:14:59 -0700 (PDT) Date: Sun, 30 Mar 2008 14:14:59 -0700 (PDT) From: Matthew Dillon Message-Id: <200803302114.m2ULExWt015828@apollo.backplane.com> To: "Poul-Henning Kamp" References: <13324.1206911466@critter.freebsd.dk> Cc: Kirk McKusick , arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 21:15:01 -0000 :> Well, no skin off my nose. I will say that I am not at all impressed :> with your idiotic answer, though. : :... and that's why. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 You really like making declarations by fiat. I actually explain the reason, in depth. If you are unable or unwilling to have a technical conversation and insist on simply putting down one-liners with nothing to back them up, then that's your problem, not mine. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 21:42:43 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B3967106564A for ; Sun, 30 Mar 2008 21:42:43 +0000 (UTC) (envelope-from chris@arnold.se) Received: from mailstore.infotropic.com (mailstore.infotropic.com [213.136.34.3]) by mx1.freebsd.org (Postfix) with ESMTP id E9B4E8FC18 for ; Sun, 30 Mar 2008 21:42:42 +0000 (UTC) (envelope-from chris@arnold.se) Received: (qmail 96681 invoked by uid 89); 30 Mar 2008 21:15:58 -0000 Received: by simscan 1.2.0 ppid: 96676, pid: 96678, t: 0.1362s scanners: attach: 1.2.0 clamav: 0.90/m:42 Received: from unknown (HELO ?192.168.123.123?) (chris@arnold.se@212.71.168.45) by mailstore.infotropic.com with ESMTPA; 30 Mar 2008 21:15:57 -0000 Date: Sun, 30 Mar 2008 23:15:57 +0200 (CEST) From: Christopher Arnold X-X-Sender: chris@localhost To: arch@freebsd.org Message-ID: <20080330231544.A96475@localhost> X-message-flag: =?ISO-8859-1?Q?Outlook_isn=B4t_compliant_with_current_standards?= =?ISO-8859-1?Q?_please_install_another_mail_client!?= MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 21:42:43 -0000 On Sun, 30 Mar 2008, Poul-Henning Kamp wrote: > In message <200803302100.m2UL0FTd015654@apollo.backplane.com>, Matthew Dillon > w > rites: > >> Er, why don't you explain it again, because I can't find the reference. > > You'll find it if you search for it. > I belive phk means that ggogling for "Flash Adaptation Layer" turns up some results. > And no, I really don't want to discuss it any further with you. > But please continue the duscussion for the sake of the silent majority, there are loads of us out here who are interested in flash fs development. Also, i had the impression that newer flash based hardrives had internal logig to spread out writs evenly over the disk and to remap worn out blocks. And that the result of these algoritms increased MTBF to atleast the MTBF for spinning disks. Or have i misread something? /Chris -- http://www.arnold.se/ http://www.infotropic.com/ From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 00:10:34 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ECC08106566B for ; Mon, 31 Mar 2008 00:10:33 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id BDF358FC22 for ; Mon, 31 Mar 2008 00:10:33 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2V0ALA3017187; Sun, 30 Mar 2008 17:10:21 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2V0ALRp017186; Sun, 30 Mar 2008 17:10:21 -0700 (PDT) Date: Sun, 30 Mar 2008 17:10:21 -0700 (PDT) From: Matthew Dillon Message-Id: <200803310010.m2V0ALRp017186@apollo.backplane.com> To: Christopher Arnold References: <20080330231544.A96475@localhost> Cc: arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 00:10:34 -0000 :I belive phk means that ggogling for "Flash Adaptation Layer" turns up some :results. : :> And no, I really don't want to discuss it any further with you. :> :But please continue the duscussion for the sake of the silent majority, there :are loads of us out here who are interested in flash fs development. : :Also, i had the impression that newer flash based hardrives had internal logig :to spread out writs evenly over the disk and to remap worn out blocks. And that :the result of these algoritms increased MTBF to atleast the MTBF for spinning :disks. Or have i misread something? : : : /Chris I found some of it, though I dunno if it's what he was specifically referencing. The slide show was interesting though there were a number of factual errors, but I didn't really see anything in-depth about 'Flash Adaptation Layer'. It seems to be a fairly generically coined term for something that is far from generic in actual implementation. The idea of remapping flash sectors could be considered a poor-man's way of dealing with wear issues in that remapping tends to be fairly limited... for example, you might use a fixed-sized table and once the table fills up the device is toast. Remapping doesn't actually prevent the uneven wear from occuring, it just gives you a fixed factor of additional runway. If remapping gets complex enough to work with an arbitrary number of dead sectors then it is effectively a 'Flash Adaptation Layer'. Limited remapping (e.g. using a fixed-sized table) is really easy to code up. But there are some huge differences between the two. Really huge differences. Detecting a worn cell requires generating a CRC and correcting it requires generating an ECC code. Neither CRCs nor ECCs are perfect and actually depending on them to handle situations that happen *normally* during the device's life-span is bad business. A proper sector translation mechanism guarantees even wear of all the cells. You don't *GET* CRC errors under normal operation of the device. You still want to have a CRC to detect the situation, and perhaps even a small ECC to try to correct it, but these exist to handle manufacturing defects (which can limit the life of individual cells) rather then to handle wear issues unrelated to manufacturing defects, which is what a limited remapping mechanic does. A wear issue can cause many cells to die (see later on w/ regards to data retention) whereas a manufacturing defect tends to result in single bit errors. Insofar as indestructability, in the short term flash storage is more resilient then disk storage especially considering that there are no moving parts, but flash cells will degrade over time whether you write to them or not, depending on temperature. Look at any flash part, bring up the technical specifications and there will be an entry for 'data retention' time. Usually it's around 10 years at 20 C. If it is hotter the data is retained for a shorter period of time, if it is colder the data is retained for a longer period of time. Retention is different from cell wear. What retention means is that if you have a flash device, you need to rewrite the cells (you can't just read the cell like a dram refresh, but you don't have to go through an erase cycle. You only have to rewrite the cell)... you need to do that at least once every 5 years to be safe, or you risk losing the data. Rewriting the cell does add wear to it so you don't want to rewrite it too often. I have personally seen flash devices lose data... I'm trying to remember how many years it was but I think it was on the order of 15 years in one unit out of 30 that was subject to fairly hot temperatures in the summer. A flash unit must therefore run a scrubber to really be reliable. It is absolutely required if you use a remapping algorithm, and a bit less so if you use a proper storage layer which generates even wear. The real difference between the two comes down to shelf life (when you aren't scrubbing anything), since worn cells will die a lot more quickly then unworn cells. A scrubber in this case must validate the CRC and there is usually a way to tell the device to operate at a different detection threshold in order to detect a failing cell *before* it actually fails (write-verify usually does this when writing but you also want to do this when scrubbing, if you want to do it right). The idea is for the scrubber to detect bit errors *before* the data becomes unrecoverable and, in fact, before the data even needs to be ECC'd. You should not have to actually use ECC correction under normal operation of the device over its entire life span. If you have a wear situation where multiple cells are failing and you do not scan the data in the flash often enough (using write-verify thresholds, NOT normal operations thresholds) to detect the failing cells, and/or you do not have a verification voltage capability to detect failing cells before they fail (for example you take a worn device offline and store it on a shelf somewhere), then you risk detecting the failed cells too late at a point where there are too many failed cells to correct. This is of particular concern for very large flash storage. One side-effect of having a proper storage layer is that the scrubber is typically built in to it. Just the mechanic of write-appending and having to repack the storage usually cycles the storage in a time frame less then 10 years. You can scrub either way, though, it isn't hard to do and doesn't require remapping the cell unless it has failed, just re-writing the same data resets the energy levels. A flash is still more reliable then a hard drive in the short-term. However, disk media tends to retain magnetic orientation longer then a flash cell (longer then 10 years)... well, I'm not sure about the absolute latest technology but that was certainly the case 10 years ago. Disk media has similar thermal erasure issues so, really, both types of media have a limited data retention span. Recovering data from an aging flash chip is a lot harder, though, because you have to remove the flash packaging and even shave the chip (yes, it can be done, there have been numerous cases where supposedly secure execute-only flash and E^2prom could be read out by shaving the chip, though I dunno if it has been done with recent super-high-density flashes). With disk media you can generally recover thermally erased bits using very expensive equipment with very sensitive detectors. If the data is important, and you are willing to pay for it, you can recover it off a HD. Typically the only difference between 'consumer' and 'industrial' flash is how they sort the chips coming out of the plant. It is possible to detect weak cells and sort the chips accordingly (thus consumer chips have fewer rewrite cycles), though frankly in most cases a consumer chip will be almost as good as an industrial one. If you run a proper sector translation layer which generates even wear and you have the ability to use the write-verify mechanism in your scrubbing code, it doesn't really matter which grade you use. -Matt From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 01:36:04 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2C428106566B for ; Mon, 31 Mar 2008 01:36:04 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id F41B28FC17 for ; Mon, 31 Mar 2008 01:36:03 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2V1ZqdA018355; Sun, 30 Mar 2008 18:35:52 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2V1ZpiN018354; Sun, 30 Mar 2008 18:35:51 -0700 (PDT) Date: Sun, 30 Mar 2008 18:35:51 -0700 (PDT) From: Matthew Dillon Message-Id: <200803310135.m2V1ZpiN018354@apollo.backplane.com> To: Christopher Arnold , arch@freebsd.org References: <20080330231544.A96475@localhost> Cc: Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 01:36:04 -0000 I just finished reading up on the latest NAND stuff, so I am going to add an addendum. There was one factual error in my last posting having to do with byte rewrites. I'm not sure this applies to all manufacturers but one spec sheet I looked at specifically limited non-erase rewriting to two consecutive page-write sequences. After that you have to perform an erase before you can write (and rewrite once) again. **** I'd be interested in knowing if any chip vendors support multiple **** consecutive page-write sequences without erase cycles inbetween **** (i.e. allowing 1->0 transitions like you can do with NOR). It looks like most vendors provide SECTOR_SIZE + 64 bytes of auxillary information. The auxillary information is where you typically store the CRC and ECC (they can be the same thing but it's a good idea to implement them separately). I was surprised that the vendors speced only a 2 bit detect / 1 bit correction code, which is actually the simplest hamming code you can have. Describing this type of hamming code in a paragraph is actually pretty easy. You can think of it as a code which identifies which bit in a block is in error and needs to be 'flipped' (aka the '1' bit correction). For example, if you are ECC'ing 8192 bytes you have 65536 bits which means the hamming code needs to be able to encode a 16 bit correction address, hence it requires 16 bits of storage for the correction, plus another (typically) log2(16) = 4 bits of storage for the detection, plus 1 more bit (you have to include the storage taken up by the ECC code itself). So ECC on 65536 bits requires 21 bits. I'm doing that from memory so don't quote me, we used those sorts of ECC in radio modem protocols 20 years ago. The actual construction of the correction address is a bit more complex but that is the basics of how a 2 bit detect / 1 bit correct hamming code works. The vendor bit error handling recommendation is to relocate the page and then erase the original rather then to rewrite the page, so the scrubbing code can't just rewrite the same page when it finds an error. You still have to scrub, though, or you risk accumulating too many errors to correct. write-verify is typically automatic in the chips but the two I checked do not seem to have a variable threshold for read operations for early detection of leaking bits. Older chips had separate power supplies for the programming power but newer ones incorporate internal charge pumps so it may not be doable, which would be too bad. Life span and shelf life information is correct. My assumption there is that the manufacturers are specing the shelf life for leakage in the worst case write verses verify cycle (the verify is internal to the chip, the external entity just does a write and reads the verification status after it finishes). If there is no way to do a read at a lower sensitivity level there is really no way to locate failing bits before they actually fail. That doesn't seem right so I may be missing something in the spec. With regards to averaging out the wear by not erase cycling the same page over and over again, my read from the chip specs is that you basically have no choice on the matter... you MUST average the wear out, period end of story. This also precludes using a simple sector remapping algorithm, particularly if the re-writes between erase cycles for a page are limited. The reason you MUST average the wear out is that the vendors do not appear to be guaranteeing even 100K erase cycles. I've read flash chip specs a billion times... when you read between the lines what the vendor is saying, basically, is that the shelf life of a stored bit is only guaranteed to be 10 years if you don't rewrite the cell more then X number of times. So while it may be possible to write more then X number of times, you risk serious data degredation ('shelf life') if you do, even if the write does not fail. This is the only guarantee they make, and it is based on the damage the cell takes when you erase/write to it which increases leakage which reduces shelf life. They do NOT guarantee that you can actually do X erase cycles, they simply say that the chip will tell you if an erase cycle fails, and that it can fail ANY TIME... the very first erase cycle you do on a particular page can fail. The ONLY thing the vendors guarantee is that the FIRST page on the device can go through a certain number of erase cycles, like 1000 or 10,000. No other page on the device has any sort of guarantee. This is very important. This means you MUST average the wear out, period, whether it is consumer OR industrial grade. -Matt From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 02:13:36 2008 Return-Path: Delivered-To: arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 908E41065674 for ; Mon, 31 Mar 2008 02:13:36 +0000 (UTC) (envelope-from das@FreeBSD.ORG) Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101]) by mx1.freebsd.org (Postfix) with ESMTP id 252798FC16 for ; Mon, 31 Mar 2008 02:13:35 +0000 (UTC) (envelope-from das@FreeBSD.ORG) Received: from zim.MIT.EDU (localhost [127.0.0.1]) by zim.MIT.EDU (8.14.2/8.14.2) with ESMTP id m2V2F4m4001804; Sun, 30 Mar 2008 22:15:04 -0400 (EDT) (envelope-from das@FreeBSD.ORG) Received: (from das@localhost) by zim.MIT.EDU (8.14.2/8.14.2/Submit) id m2V2F4ju001803; Sun, 30 Mar 2008 22:15:04 -0400 (EDT) (envelope-from das@FreeBSD.ORG) Date: Sun, 30 Mar 2008 22:15:04 -0400 From: David Schultz To: Matthew Dillon Message-ID: <20080331021504.GA1465@zim.MIT.EDU> Mail-Followup-To: Matthew Dillon , Christopher Arnold , arch@FreeBSD.ORG References: <20080330231544.A96475@localhost> <200803310010.m2V0ALRp017186@apollo.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200803310010.m2V0ALRp017186@apollo.backplane.com> Cc: Christopher Arnold , arch@FreeBSD.ORG Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 02:13:36 -0000 On Sun, Mar 30, 2008, Matthew Dillon wrote: > The idea of remapping flash sectors could be considered a poor-man's > way of dealing with wear issues in that remapping tends to be fairly > limited... for example, you might use a fixed-sized table and once the > table fills up the device is toast. Remapping doesn't actually prevent > the uneven wear from occuring, it just gives you a fixed factor of > additional runway. [...] > A flash unit must therefore run a scrubber to really be reliable. It is > absolutely required if you use a remapping algorithm, and a bit less so > if you use a proper storage layer which generates even wear. Yes, this is essentially what modern NAND flash devices do. I suggest that you read this article before you write any more essays about it: http://www.cs.tau.ac.il/~stoledo/Pubs/flash-survey.pdf Now if you think about issues such as sector mapping updates, writes smaller than the mapping granularity, and running the cleaner on fragmented erase units, you'll quickly see why random writes perform so poorly. You're right that you need additional algorithms to avoid uneven wear; remapping merely facilitates that even when the write access pattern is decidedly uneven. The article discusses several approaches. Several people have proposed flash-aware filesystems, also described in the article, to obviate the need for this sort of remapping layer. Confusingly, one of them is called FFS, for "Flash File System". Most of them resemble log-structured filesystems like LFS and ZFS, but often with additional considerations such as wear leveling. Your earlier characterization of ZFS wasn't quite right, by the way; ZFS arranges data and metadata in a tree of blocks, and even the indirect blocks, except for the top-level block, are copy-on-write. Unfortunately I can't find a good paper on it at the moment. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 11:06:57 2008 Return-Path: Delivered-To: freebsd-arch@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8F85810656A6 for ; Mon, 31 Mar 2008 11:06:57 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 666178FC12 for ; Mon, 31 Mar 2008 11:06:57 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m2VB6vpi038848 for ; Mon, 31 Mar 2008 11:06:57 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.2/8.14.1/Submit) id m2VB6uKW038844 for freebsd-arch@FreeBSD.org; Mon, 31 Mar 2008 11:06:56 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 31 Mar 2008 11:06:56 GMT Message-Id: <200803311106.m2VB6uKW038844@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 11:06:57 -0000 Current FreeBSD problem reports Critical problems Serious problems Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 18:05:25 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 00FDC1065679 for ; Mon, 31 Mar 2008 18:05:25 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id E09298FC14 for ; Mon, 31 Mar 2008 18:05:24 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 8A94940A2A6; Mon, 31 Mar 2008 10:36:01 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 10:36:09 -0700 Message-ID: In-Reply-To: <200803310135.m2V1ZpiN018354@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciSz5+GfnfZSxuDTqmryEuFc5lwBgAgsVwg References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" , "Christopher Arnold" , Cc: Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 18:05:25 -0000 I came late to this discussion, so pardon me if I'm repeating stuff that's already been discussed. You can guess a lot from vendor specs, but NAND flash requires experience before you understand the nuances; especially since the vendors tend not to document most of what you need to know to get good performance and reliability from a flash device. There are, basically, two approaches to using NAND devices. What PHK calls "flash adapation layer" or, sometimes, "flash translation layer" is widely used in devices that are meant to be seen as removable ms-dos file system devices, such as almost every USB NAND based flash device on the market. It is also used in at least two commercial flash file systems intended for embedded flash. It is also an approach available to the Linux MTD layer, although not used by any of the Linux filesystems. This approach works well enough for specific usage patterns and you will find several successful embedded devices on the CE market place that use it. The second approach is to have a 'flash aware filesystem', which understand the write/read/erase properties of NAND flash parts. There are three variants on this approach that I'm aware of. The first takes a 'traditional' filesystem like FFS and, in effect, adds a flash translation layer. The second takes a log-like file system and adapts its GC to NAND. The third approach is to write a file system specific to NAND devices from scratch. PalmOS Garnet's NAND file system is an example of the first. The modified version of LFS that Mike Chen and I did for PalmOS Cobalt is an example of the second. The MTD based file system jffs2 is an example of the third, and a cautionary tale for those who would write their own. In addition to the various points Matt Dillon has figured out from reading specs, there are several features of NAND parts that I haven't seen mentioned here that play a fairly important role in designing file systems around them. These include, but are probably not limited to: 1) Large page versus small page NAND 2) Broken or poorly performing hardware, especially ECC generation and write verification 3) Adjacent write effect Some interesting properties to take into account when designing a NAND file system: 1) No block can be assumed good, which means you have to scan the device to find your metadata starting point at boot time. 2) Small page NAND has less 'spare' available in the spare region than large page NAND, which means that you can do optimizations for large page nand that you can't for small. 3) write-back caching of writes makes NAND parts less reliable From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 18:48:55 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6F3CA106567C for ; Mon, 31 Mar 2008 18:48:55 +0000 (UTC) (envelope-from qpadla@gmail.com) Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.184]) by mx1.freebsd.org (Postfix) with ESMTP id 793228FC22 for ; Mon, 31 Mar 2008 18:48:53 +0000 (UTC) (envelope-from qpadla@gmail.com) Received: by nf-out-0910.google.com with SMTP id b2so847240nfb.33 for ; Mon, 31 Mar 2008 11:48:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id; bh=fYuXiG8PASQrrvp1e4aPEK0LeTgJiB5x/UpFL5GGeZk=; b=JELOKyuPOnJtTETL8fnFZknqqtREY+xgmmGOMVD4Tv8G8grvVEc7H13fWsVdvCE2VvmKNpXsZCimHRQz7ZHkYeSBbZCwnwG7sQ8MMBUpNsQ1zUHghDFu+HrZwPuSHe9YUZ7miJ7dQuEIUm5pOWLpA4eixbZ6ZeJGMlMnjNxotrs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id; b=sCAhc/LqZ6+A116H/ioYLETLNsFd5WuCQ+WDnaGDO01UEGbOXtgjNfumzYXoF8A5wYnwliceDXLG7UsXk8lPZYj4seFstn0vv9QSwMhkwzFDKe7NDoAjGaa+6KAQqbQDgicmfC5KVMAODzSIoEgoBb01Sv6c5BBtq3BAU6x2Zsg= Received: by 10.78.182.17 with SMTP id e17mr22674199huf.57.1206987879793; Mon, 31 Mar 2008 11:24:39 -0700 (PDT) Received: from atlas ( [89.162.141.1]) by mx.google.com with ESMTPS id d23sm865337nfh.12.2008.03.31.11.24.37 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 31 Mar 2008 11:24:38 -0700 (PDT) From: Nikolay Pavlov To: freebsd-arch@freebsd.org Date: Mon, 31 Mar 2008 21:25:28 +0300 User-Agent: KMail/1.9.6 (enterprise 0.20070907.709405) References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803312125.29325.qpadla@gmail.com> Cc: Christopher Arnold , arch@freebsd.org, Martin Fouts Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: qpadla@gmail.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 18:48:55 -0000 On Monday 31 March 2008 20:36:09 Martin Fouts wrote: > The MTD based file > system jffs2 is an example of the third, and a cautionary tale for those > who would write their own. Intrested parties could found this information usefull: http://kerneltrap.org/Linux/UBI_File_System It is related to new flash file system developed by Nokia engineers. -- ====================================================================== - Best regards, Nikolay Pavlov. <<<----------------------------------- ====================================================================== From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 18:53:15 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 736C1106566B for ; Mon, 31 Mar 2008 18:53:15 +0000 (UTC) (envelope-from qpadla@gmail.com) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.175]) by mx1.freebsd.org (Postfix) with ESMTP id ED22E8FC12 for ; Mon, 31 Mar 2008 18:53:14 +0000 (UTC) (envelope-from qpadla@gmail.com) Received: by ug-out-1314.google.com with SMTP id y2so565371uge.37 for ; Mon, 31 Mar 2008 11:53:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id; bh=fYuXiG8PASQrrvp1e4aPEK0LeTgJiB5x/UpFL5GGeZk=; b=lKd8Ky/BGXc68IDxOvoRHEyajPsode2do/7TRLXMMPR+bIde85AhwhW8BvV5G+TP97+1ZTrWwZDkaO5e9Ds4kn1DkZrUWnPNa5SmFZ+T5Bk0Z/+3+A0ec1pplkydF0tLnD3jP2EorUqgrRhUoim6WpEPdTA7dSwlYfsG3sZMuU0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:reply-to:to:subject:date:user-agent:cc:references:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:message-id; b=ha19YBnPRnl1iQREITLpmNeeZwam2i5MnCdtWQfPJTWoJo8zkPyif2OXsbRsdjoc2TrvuAwnuXajwQ7dh1IEqVsI/mxPKYQEjPai1Ya1PIFUAqj2VpB4nKt0emSSrhPvziGlvv4SZ0hrG0bkv2s7HsKavBAZ9NsUta+RXyXqZD8= Received: by 10.78.182.17 with SMTP id e17mr22674199huf.57.1206987879793; Mon, 31 Mar 2008 11:24:39 -0700 (PDT) Received: from atlas ( [89.162.141.1]) by mx.google.com with ESMTPS id d23sm865337nfh.12.2008.03.31.11.24.37 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 31 Mar 2008 11:24:38 -0700 (PDT) From: Nikolay Pavlov To: freebsd-arch@freebsd.org Date: Mon, 31 Mar 2008 21:25:28 +0300 User-Agent: KMail/1.9.6 (enterprise 0.20070907.709405) References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803312125.29325.qpadla@gmail.com> Cc: Christopher Arnold , arch@freebsd.org, Martin Fouts Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: qpadla@gmail.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 18:53:15 -0000 On Monday 31 March 2008 20:36:09 Martin Fouts wrote: > The MTD based file > system jffs2 is an example of the third, and a cautionary tale for those > who would write their own. Intrested parties could found this information usefull: http://kerneltrap.org/Linux/UBI_File_System It is related to new flash file system developed by Nokia engineers. -- ====================================================================== - Best regards, Nikolay Pavlov. <<<----------------------------------- ====================================================================== From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 19:15:41 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 09714106566C for ; Mon, 31 Mar 2008 19:15:41 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D05288FC17 for ; Mon, 31 Mar 2008 19:15:40 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VJFSqj027594; Mon, 31 Mar 2008 12:15:28 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VJFSoR027593; Mon, 31 Mar 2008 12:15:28 -0700 (PDT) Date: Mon, 31 Mar 2008 12:15:28 -0700 (PDT) From: Matthew Dillon Message-Id: <200803311915.m2VJFSoR027593@apollo.backplane.com> To: qpadla@gmail.com References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> Cc: Christopher Arnold , arch@freebsd.org, Martin Fouts , freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 19:15:41 -0000 This is all very good information. I was unaware of the adjacent write effect, but it makes sense considering the cell size. Hard drives have a similar effect (it's one of the limiting factors for density). Hamming codes (ECC codes) are very fragile beasts. While they are in the same family as a CRC it is a really bad idea to try to use the ECC code as your CRC which is why I recommended against it in my previous posting. A two-bit-detect/one-bit-correction code is utterly trivial to generate (both generating it and using it)... I've done such codes in 8-bit cpu's. Their fragility can be surprising to anyone who has never worked with them. I've written numerous filesystems, including a NOR flash filesystem (whos characteristics are somewhat different due to the availability of byte-write). In my opinion, designing a filesystem *specifically* for NAND flash is a mistake because the technology is rapidly evolving and such a filesystem would wind up being obsolete in fairly short order. For example, the simple addition of some front-end non-volatile cache, such as a dime-cap-backed static ram, would have a very serious effect on any such filesystem design. It is far far better to design the filesystem around generally desired characteristics, such as good write locality of reference (though, again, indices still have to be updated and those usually do not have good locality of reference). DragonFly's HAMMER has pretty good write-locality of reference but still does random updates for B-Tree indices and things like the mtime and atime fields. It also uses numerous blockmaps that could make direct use of a flash sector-mapping translation layer (1). It might be adaptable. (1) A flash sector-mapping translation layer gives a filesystem the ability to use 'named block numbers'. For example, the NOR filesystem I did used 32 bit named block numbers regardless of the size of the flash (which was typically only 2MB). The filesystem topology was actually encoded into the block number it self. In other words, the filesystem is not bound to a linear range of block numbers it is simply bound What does this mean? This means that what you really want to do is not necessarily write a filesystem that is explicitly designed for NAND operation, but instead write a filesystem that is explicitly designed to run on top of an abstracted topology (such as one where you can have named block numbers), and which generally has the desired features for locality of reference. Such a filesystem would not become obsolete anywhere near as quickly as a nand-specific filesystem would and rebuilding an abstracted topology (whos underlying code would become obsolete as the technology changes) is a whole lot easier then redesigning a filesystem. I am quite partial to the named-block concept, I really think it's the best way to go for flash filesystem design. The flash already has to have a sector-translation mechanism, making the jump to a full blown named-block model is only a small additional step. -Matt From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 19:27:08 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1F3EC1065673 for ; Mon, 31 Mar 2008 19:27:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D6E348FC16 for ; Mon, 31 Mar 2008 19:27:07 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VJFSqj027594; Mon, 31 Mar 2008 12:15:28 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VJFSoR027593; Mon, 31 Mar 2008 12:15:28 -0700 (PDT) Date: Mon, 31 Mar 2008 12:15:28 -0700 (PDT) From: Matthew Dillon Message-Id: <200803311915.m2VJFSoR027593@apollo.backplane.com> To: qpadla@gmail.com References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> Cc: Christopher Arnold , arch@freebsd.org, Martin Fouts , freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 19:27:08 -0000 This is all very good information. I was unaware of the adjacent write effect, but it makes sense considering the cell size. Hard drives have a similar effect (it's one of the limiting factors for density). Hamming codes (ECC codes) are very fragile beasts. While they are in the same family as a CRC it is a really bad idea to try to use the ECC code as your CRC which is why I recommended against it in my previous posting. A two-bit-detect/one-bit-correction code is utterly trivial to generate (both generating it and using it)... I've done such codes in 8-bit cpu's. Their fragility can be surprising to anyone who has never worked with them. I've written numerous filesystems, including a NOR flash filesystem (whos characteristics are somewhat different due to the availability of byte-write). In my opinion, designing a filesystem *specifically* for NAND flash is a mistake because the technology is rapidly evolving and such a filesystem would wind up being obsolete in fairly short order. For example, the simple addition of some front-end non-volatile cache, such as a dime-cap-backed static ram, would have a very serious effect on any such filesystem design. It is far far better to design the filesystem around generally desired characteristics, such as good write locality of reference (though, again, indices still have to be updated and those usually do not have good locality of reference). DragonFly's HAMMER has pretty good write-locality of reference but still does random updates for B-Tree indices and things like the mtime and atime fields. It also uses numerous blockmaps that could make direct use of a flash sector-mapping translation layer (1). It might be adaptable. (1) A flash sector-mapping translation layer gives a filesystem the ability to use 'named block numbers'. For example, the NOR filesystem I did used 32 bit named block numbers regardless of the size of the flash (which was typically only 2MB). The filesystem topology was actually encoded into the block number it self. In other words, the filesystem is not bound to a linear range of block numbers it is simply bound What does this mean? This means that what you really want to do is not necessarily write a filesystem that is explicitly designed for NAND operation, but instead write a filesystem that is explicitly designed to run on top of an abstracted topology (such as one where you can have named block numbers), and which generally has the desired features for locality of reference. Such a filesystem would not become obsolete anywhere near as quickly as a nand-specific filesystem would and rebuilding an abstracted topology (whos underlying code would become obsolete as the technology changes) is a whole lot easier then redesigning a filesystem. I am quite partial to the named-block concept, I really think it's the best way to go for flash filesystem design. The flash already has to have a sector-translation mechanism, making the jump to a full blown named-block model is only a small additional step. -Matt From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 19:51:45 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 941BD106567D for ; Mon, 31 Mar 2008 19:51:45 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 604DD8FC44 for ; Mon, 31 Mar 2008 19:51:45 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 4B44E414A67; Mon, 31 Mar 2008 12:51:27 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 12:51:35 -0700 Message-ID: In-Reply-To: <200803311915.m2VJFSoR027593@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTY6BS0ooCSuYbSOeNPD7xBxIXYwAAn46g References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" , Cc: Christopher Arnold , arch@freebsd.org, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 19:51:45 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 >=20 > Hamming codes (ECC codes) are very fragile beasts. While=20 > they are in the same family as a CRC it is a really bad idea to=20 > try to use the ECC code as your CRC which is why I recommended=20 > against it in my previous posting. True, but when you're working with a part that does ECC in HW, you're stuck with the ECC it does. > I've written numerous filesystems, including a NOR flash=20 > filesystem (whos characteristics are somewhat different due to the=20 > availability of byte-write). In my opinion, designing a filesystem=20 > *specifically* for NAND flash is a mistake because the technology is rapidly=20 > evolving and such a filesystem would wind up being obsolete in fairly=20 > short order. Well, those of us who are shipping devices with flash parts in them have a somewhat different view on that, which is why I've worked on three NAND specific file systems in the last four years. Two of those are in use in shipping devices, and are expected to be in use for five or more years. > For example, the simple addition of some front-end non-volatile cache, > such as a dime-cap-backed static ram, would have a very serious effect > on any such filesystem design. Yes. However since the phone market makes such a change very unlikely, because of cost pressures, it's not one we take into consideration. > It is far far better to design the filesystem around generally desired > characteristics, such as good write locality of reference (though, again, indices still=20 > have to be updated and those usually do not have good locality of reference). You've talked yourself into pretty much the same mistake that led to jffs2, which turned out to be a terrible idea. > DragonFly's HAMMER has pretty good write-locality of=20 > reference but still does random updates for B-Tree indices and things like=20 > the mtime and atime fields. It also uses numerous blockmaps that could=20 > make direct use of a flash sector-mapping translation layer (1). It=20 > might be adaptable. You are pretty much describing the data structures that have made jffs2 such a poor performer. >=20 > (1) A flash sector-mapping translation layer gives a=20 > filesystem the ability to use 'named block numbers'. For example, the > NOR filesystem I did used 32 bit named block numbers regardless of the > size of the flash (which was typically only 2MB). The filesystem topology was > actually encoded into the block number it self. In other=20 > words, the filesystem is not bound to a linear range of block numbers it is > simply bound Works OK for NOR. Has interesting problems, mainly with maintaining the block number map reliabily in storage, when attempted on NAND. > What does this mean? This means that what you really=20 > want to do is not necessarily write a filesystem that is explicitly=20 > designed for NAND operation, but instead write a filesystem that is=20 > explicitly designed to run on top of an abstracted topology (such as one=20 > where you can have named block numbers), and which generally has the desired=20 > features for locality of reference. Such a filesystem would not=20 > become obsolete anywhere near as quickly as a nand-specific filesystem would and=20 > rebuilding an abstracted topology (whos underlying code=20 > would become obsolete as the technology changes) is a whole lot easier then > redesigning a filesystem. There's really only one topology that's efficient for a NAND device, and that's to do log-like writing coupled with garbage collection. > I am quite partial to the named-block concept, I really=20 > think it's the best way to go for flash filesystem design. The flash=20 > already has to have a sector-translation mechanism, making the jump to a=20 > full blown named-block model is only a small additional step. The devil in the details of your naming scheme turns out to be managing the name translation information within the NAND storage itself. This is the source of significant performance problems in jffs2, for example, and have a huge amount of code complexity in the commercial system I work with. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 20:06:30 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6E6E21065671; Mon, 31 Mar 2008 20:06:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 318988FC18; Mon, 31 Mar 2008 20:06:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VK6ANu028134; Mon, 31 Mar 2008 13:06:10 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VK6Aom028133; Mon, 31 Mar 2008 13:06:10 -0700 (PDT) Date: Mon, 31 Mar 2008 13:06:10 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312006.m2VK6Aom028133@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 20:06:30 -0000 :You've talked yourself into pretty much the same mistake that led to :jffs2, which turned out to be a terrible idea. I'm not familiar with jffs2 but a blockmap abstraction in of itself just doesn't have the terrible characteristics you are implying. The implementations might have been bad but the concept is quite sound. Here's a question. Ok so the best write performance is to essentially append to the NAND device. That's fairly obvious though as long as you are able to fully complete a page it doesn't really matter where the data goes. So the main issue is being able to complete a page (since you can't rewrite it). But how do you index that information? You can't simply append the information to the NAND unless you also have a way to access it. So does the filesystem have to scan the NAND (or significant portions of it) in order to build an index of the filesystem topology in system memory? No matter what you do you have to index the information *SOMEWHERE*. That somewhere is either going to be in-NAND or in-memory or some combination of the two. If it is entirely in-memory you have to scan the auxillary information in nearly the entire NAND array to build your index. If it is entirely in-NAND you have a significant updating problem. A named-block model, done right, can serve as the index. That is, it is exactly the same problem just viewed from a different angle. A named-block model does not necessarily imply that the indexing topology has to be stored entirely in-NAND, it does not imply any sort of linear array, and it does not imply any random-updating requirement. I don't know what the jffs2 folks did but you shouldn't take their performance failure as an indication that the general concept is incorrect. -Matt From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 20:06:30 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6E6E21065671; Mon, 31 Mar 2008 20:06:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 318988FC18; Mon, 31 Mar 2008 20:06:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VK6ANu028134; Mon, 31 Mar 2008 13:06:10 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VK6Aom028133; Mon, 31 Mar 2008 13:06:10 -0700 (PDT) Date: Mon, 31 Mar 2008 13:06:10 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312006.m2VK6Aom028133@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 20:06:30 -0000 :You've talked yourself into pretty much the same mistake that led to :jffs2, which turned out to be a terrible idea. I'm not familiar with jffs2 but a blockmap abstraction in of itself just doesn't have the terrible characteristics you are implying. The implementations might have been bad but the concept is quite sound. Here's a question. Ok so the best write performance is to essentially append to the NAND device. That's fairly obvious though as long as you are able to fully complete a page it doesn't really matter where the data goes. So the main issue is being able to complete a page (since you can't rewrite it). But how do you index that information? You can't simply append the information to the NAND unless you also have a way to access it. So does the filesystem have to scan the NAND (or significant portions of it) in order to build an index of the filesystem topology in system memory? No matter what you do you have to index the information *SOMEWHERE*. That somewhere is either going to be in-NAND or in-memory or some combination of the two. If it is entirely in-memory you have to scan the auxillary information in nearly the entire NAND array to build your index. If it is entirely in-NAND you have a significant updating problem. A named-block model, done right, can serve as the index. That is, it is exactly the same problem just viewed from a different angle. A named-block model does not necessarily imply that the indexing topology has to be stored entirely in-NAND, it does not imply any sort of linear array, and it does not imply any random-updating requirement. I don't know what the jffs2 folks did but you shouldn't take their performance failure as an indication that the general concept is incorrect. -Matt From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 20:08:25 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B99011065670 for ; Mon, 31 Mar 2008 20:08:25 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id A2B988FC16 for ; Mon, 31 Mar 2008 20:08:25 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 4B44E414A67; Mon, 31 Mar 2008 12:51:27 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 12:51:35 -0700 Message-ID: In-Reply-To: <200803311915.m2VJFSoR027593@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTY6BS0ooCSuYbSOeNPD7xBxIXYwAAn46g References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" , Cc: Christopher Arnold , arch@freebsd.org, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 20:08:25 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 >=20 > Hamming codes (ECC codes) are very fragile beasts. While=20 > they are in the same family as a CRC it is a really bad idea to=20 > try to use the ECC code as your CRC which is why I recommended=20 > against it in my previous posting. True, but when you're working with a part that does ECC in HW, you're stuck with the ECC it does. > I've written numerous filesystems, including a NOR flash=20 > filesystem (whos characteristics are somewhat different due to the=20 > availability of byte-write). In my opinion, designing a filesystem=20 > *specifically* for NAND flash is a mistake because the technology is rapidly=20 > evolving and such a filesystem would wind up being obsolete in fairly=20 > short order. Well, those of us who are shipping devices with flash parts in them have a somewhat different view on that, which is why I've worked on three NAND specific file systems in the last four years. Two of those are in use in shipping devices, and are expected to be in use for five or more years. > For example, the simple addition of some front-end non-volatile cache, > such as a dime-cap-backed static ram, would have a very serious effect > on any such filesystem design. Yes. However since the phone market makes such a change very unlikely, because of cost pressures, it's not one we take into consideration. > It is far far better to design the filesystem around generally desired > characteristics, such as good write locality of reference (though, again, indices still=20 > have to be updated and those usually do not have good locality of reference). You've talked yourself into pretty much the same mistake that led to jffs2, which turned out to be a terrible idea. > DragonFly's HAMMER has pretty good write-locality of=20 > reference but still does random updates for B-Tree indices and things like=20 > the mtime and atime fields. It also uses numerous blockmaps that could=20 > make direct use of a flash sector-mapping translation layer (1). It=20 > might be adaptable. You are pretty much describing the data structures that have made jffs2 such a poor performer. >=20 > (1) A flash sector-mapping translation layer gives a=20 > filesystem the ability to use 'named block numbers'. For example, the > NOR filesystem I did used 32 bit named block numbers regardless of the > size of the flash (which was typically only 2MB). The filesystem topology was > actually encoded into the block number it self. In other=20 > words, the filesystem is not bound to a linear range of block numbers it is > simply bound Works OK for NOR. Has interesting problems, mainly with maintaining the block number map reliabily in storage, when attempted on NAND. > What does this mean? This means that what you really=20 > want to do is not necessarily write a filesystem that is explicitly=20 > designed for NAND operation, but instead write a filesystem that is=20 > explicitly designed to run on top of an abstracted topology (such as one=20 > where you can have named block numbers), and which generally has the desired=20 > features for locality of reference. Such a filesystem would not=20 > become obsolete anywhere near as quickly as a nand-specific filesystem would and=20 > rebuilding an abstracted topology (whos underlying code=20 > would become obsolete as the technology changes) is a whole lot easier then > redesigning a filesystem. There's really only one topology that's efficient for a NAND device, and that's to do log-like writing coupled with garbage collection. > I am quite partial to the named-block concept, I really=20 > think it's the best way to go for flash filesystem design. The flash=20 > already has to have a sector-translation mechanism, making the jump to a=20 > full blown named-block model is only a small additional step. The devil in the details of your naming scheme turns out to be managing the name translation information within the NAND storage itself. This is the source of significant performance problems in jffs2, for example, and have a huge amount of code complexity in the commercial system I work with. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 21:34:30 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5C849106567A; Mon, 31 Mar 2008 21:34:30 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 45B8F8FC21; Mon, 31 Mar 2008 21:34:30 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 62F3440A445; Mon, 31 Mar 2008 14:34:21 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 14:34:29 -0700 Message-ID: In-Reply-To: <200803312006.m2VK6Aom028133@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTaslf/zlwdF8FTa+bHVN44JtuagAC86dQ References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 21:34:30 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Monday, March 31, 2008 1:06 PM > To: Martin Fouts > Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20 > Arnold; arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics >=20 >=20 > But how do you index that information? You can't simply=20 > append the information to the NAND unless you also have a way to=20 > access it. So does the filesystem have to scan the NAND (or significant=20 > portions of it) in order to build an index of the filesystem topology in=20 > system memory? >=20 > No matter what you do you have to index the information=20 > *SOMEWHERE*. And NAND devices have a *SOMEWHERE* that makes them different than other persistent storage devices in ways that make them interesting to do file systems for. It's not _that_ you have to scan the NAND, by the way, it's _when_ you scan the NAND that has the major impact on performance. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 21:34:30 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5C849106567A; Mon, 31 Mar 2008 21:34:30 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 45B8F8FC21; Mon, 31 Mar 2008 21:34:30 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 62F3440A445; Mon, 31 Mar 2008 14:34:21 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 14:34:29 -0700 Message-ID: In-Reply-To: <200803312006.m2VK6Aom028133@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTaslf/zlwdF8FTa+bHVN44JtuagAC86dQ References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 21:34:30 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Monday, March 31, 2008 1:06 PM > To: Martin Fouts > Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20 > Arnold; arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics >=20 >=20 > But how do you index that information? You can't simply=20 > append the information to the NAND unless you also have a way to=20 > access it. So does the filesystem have to scan the NAND (or significant=20 > portions of it) in order to build an index of the filesystem topology in=20 > system memory? >=20 > No matter what you do you have to index the information=20 > *SOMEWHERE*. And NAND devices have a *SOMEWHERE* that makes them different than other persistent storage devices in ways that make them interesting to do file systems for. It's not _that_ you have to scan the NAND, by the way, it's _when_ you scan the NAND that has the major impact on performance. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:20:08 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB06F1065673; Mon, 31 Mar 2008 22:20:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id CA0508FC14; Mon, 31 Mar 2008 22:20:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMJlOU029241; Mon, 31 Mar 2008 15:19:47 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMJlkT029240; Mon, 31 Mar 2008 15:19:47 -0700 (PDT) Date: Mon, 31 Mar 2008 15:19:47 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312219.m2VMJlkT029240@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:20:09 -0000 :True, but when you're working with a part that does ECC in HW, you're :stuck with the ECC it does. Well, I suppose if you can't get access to the original data *OR* the HW ECC code (to undo the broken correction) you would wind up with an uncorrectable double error (A failed ECC correction always makes things worse rather then better, particularly a 1 bit correct / 2 bit detect hamming code). If you DO have access to the original data or the HW ECC code you can undo the failed correction and then ignore the hardware ECC and do your own in the auxillary storage. I can see why people would want the hardware to do the data validation, it's a performance issue in many respects, but if the hardware is only doing a simple ECC and doesn't do a separate CRC it will do more harm then good and simply cannot be depended upon for anything. :... : (some stuff I reordered later one below so I can get this out of the way) :... :> For example, the simple addition of some front-end non-volatile :cache, :> such as a dime-cap-backed static ram, would have a very serious effect :> on any such filesystem design. : :Yes. However since the phone market makes such a change very unlikely, :because of cost pressures, it's not one we take into consideration. For flash storage systems competitive with hard drive storage, that is any flash storage device of significant size (e.g. 64GB-1TB or more), the incremental cost of adding non-volatile front-end cache ram is going to be in the $1-$3 range. If that vendor can then advertise a performance advantage over their competitors they can easily price the drive to match the incremental cost. Very easily. This is what happened to the HD market. The economics that drive front-end cache implementations for flash SSD are going to be the same economics that drive front-end cache implementations for hard drives. All hard drives these days have at least 8MB of sector cache (and many have 32MB or more), so the writing is on the wall. Any filesystem you design which does not take into account a front-end cache is going to be obsolete in probably less then 2 years (if not already). For the phone market? You mean small flash storage devices? Performance is almost irrelevant there and you certainly do not need a front-end cache, or much of one. A sector translation model for a small flash storage device (< 2G) is utterly trivial to implement but so is the log-append model. There is going to be a huge scaling difference between the two when you get into large amounts of storage. :Well, those of us who are shipping devices with flash parts in them have :a somewhat different view on that, which is why I've worked on three :NAND specific file systems in the last four years. Two of those are in :use in shipping devices, and are expected to be in use for five or more :years. Three in five years? Is that an illustration of my point with regards to flash filesystem design? Ok, that was a joke :-) But I don't think we can count small flash storage systems. Both models devolve into trivialities when you are managing small amounts of flash storage. :... :reference). : :You've talked yourself into pretty much the same mistake that led to :jffs2, which turned out to be a terrible idea. : :> DragonFly's HAMMER has pretty good write-locality of=20 :> reference but still does random updates for B-Tree indices and things :like=20 :> the mtime and atime fields. It also uses numerous blockmaps that :could=20 :> make direct use of a flash sector-mapping translation layer (1). It=20 :> might be adaptable. : :You are pretty much describing the data structures that have made jffs2 :such a poor performer. :... :Works OK for NOR. Has interesting problems, mainly with maintaining the :block number map reliabily in storage, when attempted on NAND. :... :The devil in the details of your naming scheme turns out to be managing :the name translation information within the NAND storage itself. This is :the source of significant performance problems in jffs2, for example, :and have a huge amount of code complexity in the commercial system I :work with. Again, I am not familiar with jffs2 but you are painting a very broad brush that is more then likely an issue specifically with the jffs2 design and not the concept of using named blocks in general. I understand where you are coming from. Regardless of the model you use you have to index the data somehow. What you are advocating is a filesystem which uses an absolute sector referencing scheme. Any change made to the filesystem requires a new page to essentially be appended to the flash storage. In order to properly index the information and maintain the filesystem topology you also have to recopy *ALL* pages containing references to the updated absolute sector in order to repoint them to the new absolute sector. The root of your filesystem winds up being the last page appended, in simple terms. While some modifications to this scheme are possible, it's pretty much the way you have to do things if you use that model. I really understand that model, and it has the advantage of simplicity but it also has some severe disadvantages when used as a general purpose filesystem (verses an embedded filesystem), not the least of which is that a single update can result in a chain reaction that requires considerably more write bandwidth, considerably more garbage collection, and some extra (but probably minor) wear of the flash. In contrast, if a filesystem is referencing named blocks and you have to move a block (either due to an error or a modification of that block through normal filesystem activity), NO changes need to be made to those elements of the filesystem that pointed to the block that got moved. All you have to do is append the new block that is renaming the old one, which includes the name (aka 64 bit quantity) in its auxillary data area, and cache the change in the translation in system memory until you decide to flush out the named block index (which I will describe a bit later on)... that's non critical information, by the way, and does not have to be synchronously in order to be crash recoverable. Write bandwidth is greatly reduced, particularly because when using a named block you only have to flush the actual modified page to the flash and nothing else other then a topological rollup record (which I will describe a bit later on). This works particularly well with a filesystem designed to use named blocks because there are *NO* indirect blockmaps to reference data or inodes in the filesystem. An absolute-sector-based filesystem has blockmaps, e.g. to locate a block in a file. In a named-block filesystem the blockmap *IS* the named block. That is, the 'name' of the named block is effectively the inode number and file block number combined into one 64 bit (or larger) key. Let me be clear about this distinction. In a filesystem that references absolute sectors an append to a file requires (typically) updating a blockmap of absolute sectors which in turns requires the blockmap block to be rewritten, along with any reference to it and so on and so forth up the chain. In a named-block filesystem appending to a file simply means writing out a new named block. The filesystem itself has NO concept of a blockmap... the blockmap is built into the sector translation layer. In other words, a filesystem using the named-block model is not any more complex then a filesystem using an absolute sector numbering model, and a filesystem using the named-block model is far easier from the perspective of caching changes in system memory without requiring a sync to flash for crash recovery. That is a huge deal. Now is there some work involved with making the named block translations efficient? Yes, there is some... but it is really not much more complex then the work involved in an absolute-sector-based filesystem which must index files, directories, and so forth within the filesystem itself. In particular, when using a named-block model you still have to occassionally flush out the translation topology to the flash media Since this topology references physical block numbers it, in fact, uses exactly the SAME mechanism that the absolute filesystem model used to maintain its topology. In other words, no more complex then the absolute filesystem model. The big difference is that the translation topology does not have to be written synchronously and the frequency of the rollup writes is based ENTIRELY on how much system memory you are willing to dedicate to caching topological changes. E.G. if you dedicate, say, 100KB of system memory you can store the topology for, say, 3200 filesystem updates (using 32 byte structures) before you have to 'flush' it to the flash. A filesystem based on absolute blocks pretty much has to cache the related (modified) blocks in memory which are far larger, and thus must flush them to storage far more frequently. But translations are tiny little records... 10's of gigabytes worth of updates can be cached in a small amount of system memory. The translation topology does NOT have to be synced to disk on fsync() because all the information can be recovered when the filesystem is mounted after a reboot. That is critical. Going back to the absolute filesystem model, such a filesystem does have the advantage of locality of reference (that is, not so much seek-wise which is irrelevant for flash but more from the point of view of being able to chain to the desired information). A filesystem using the named-block model must lookup the block, typically in a global index, which means it must maintain a cache of pointers into its translation tree. This is a little more expensive when looking up inodes but, actually, I use a very similar scheme in HAMMER (which has a global B-Tree for everything) and the caching required is so simple that it just becomes a NOP. Just storing a cached absolute sector number in the in-memory inode structure for use as a starting point when looking up elements related to that inode winds up being no less efficient then embedding a blockmap in an inode as you see in a more typical filesystem. In anycase, I hope this clarifies the issues. I really do understand where you are coming from, the simplicity of chaining the physical topology cannot be denied, and I like the elegance, but I hope I've shown that it is not actually simplifying the overall design much over a named block scheme, and that there are some fairly severe issues that can crop up that are complete non-issues when using a named block scheme. Long winded, I know. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:20:08 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB06F1065673; Mon, 31 Mar 2008 22:20:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id CA0508FC14; Mon, 31 Mar 2008 22:20:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMJlOU029241; Mon, 31 Mar 2008 15:19:47 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMJlkT029240; Mon, 31 Mar 2008 15:19:47 -0700 (PDT) Date: Mon, 31 Mar 2008 15:19:47 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312219.m2VMJlkT029240@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:20:09 -0000 :True, but when you're working with a part that does ECC in HW, you're :stuck with the ECC it does. Well, I suppose if you can't get access to the original data *OR* the HW ECC code (to undo the broken correction) you would wind up with an uncorrectable double error (A failed ECC correction always makes things worse rather then better, particularly a 1 bit correct / 2 bit detect hamming code). If you DO have access to the original data or the HW ECC code you can undo the failed correction and then ignore the hardware ECC and do your own in the auxillary storage. I can see why people would want the hardware to do the data validation, it's a performance issue in many respects, but if the hardware is only doing a simple ECC and doesn't do a separate CRC it will do more harm then good and simply cannot be depended upon for anything. :... : (some stuff I reordered later one below so I can get this out of the way) :... :> For example, the simple addition of some front-end non-volatile :cache, :> such as a dime-cap-backed static ram, would have a very serious effect :> on any such filesystem design. : :Yes. However since the phone market makes such a change very unlikely, :because of cost pressures, it's not one we take into consideration. For flash storage systems competitive with hard drive storage, that is any flash storage device of significant size (e.g. 64GB-1TB or more), the incremental cost of adding non-volatile front-end cache ram is going to be in the $1-$3 range. If that vendor can then advertise a performance advantage over their competitors they can easily price the drive to match the incremental cost. Very easily. This is what happened to the HD market. The economics that drive front-end cache implementations for flash SSD are going to be the same economics that drive front-end cache implementations for hard drives. All hard drives these days have at least 8MB of sector cache (and many have 32MB or more), so the writing is on the wall. Any filesystem you design which does not take into account a front-end cache is going to be obsolete in probably less then 2 years (if not already). For the phone market? You mean small flash storage devices? Performance is almost irrelevant there and you certainly do not need a front-end cache, or much of one. A sector translation model for a small flash storage device (< 2G) is utterly trivial to implement but so is the log-append model. There is going to be a huge scaling difference between the two when you get into large amounts of storage. :Well, those of us who are shipping devices with flash parts in them have :a somewhat different view on that, which is why I've worked on three :NAND specific file systems in the last four years. Two of those are in :use in shipping devices, and are expected to be in use for five or more :years. Three in five years? Is that an illustration of my point with regards to flash filesystem design? Ok, that was a joke :-) But I don't think we can count small flash storage systems. Both models devolve into trivialities when you are managing small amounts of flash storage. :... :reference). : :You've talked yourself into pretty much the same mistake that led to :jffs2, which turned out to be a terrible idea. : :> DragonFly's HAMMER has pretty good write-locality of=20 :> reference but still does random updates for B-Tree indices and things :like=20 :> the mtime and atime fields. It also uses numerous blockmaps that :could=20 :> make direct use of a flash sector-mapping translation layer (1). It=20 :> might be adaptable. : :You are pretty much describing the data structures that have made jffs2 :such a poor performer. :... :Works OK for NOR. Has interesting problems, mainly with maintaining the :block number map reliabily in storage, when attempted on NAND. :... :The devil in the details of your naming scheme turns out to be managing :the name translation information within the NAND storage itself. This is :the source of significant performance problems in jffs2, for example, :and have a huge amount of code complexity in the commercial system I :work with. Again, I am not familiar with jffs2 but you are painting a very broad brush that is more then likely an issue specifically with the jffs2 design and not the concept of using named blocks in general. I understand where you are coming from. Regardless of the model you use you have to index the data somehow. What you are advocating is a filesystem which uses an absolute sector referencing scheme. Any change made to the filesystem requires a new page to essentially be appended to the flash storage. In order to properly index the information and maintain the filesystem topology you also have to recopy *ALL* pages containing references to the updated absolute sector in order to repoint them to the new absolute sector. The root of your filesystem winds up being the last page appended, in simple terms. While some modifications to this scheme are possible, it's pretty much the way you have to do things if you use that model. I really understand that model, and it has the advantage of simplicity but it also has some severe disadvantages when used as a general purpose filesystem (verses an embedded filesystem), not the least of which is that a single update can result in a chain reaction that requires considerably more write bandwidth, considerably more garbage collection, and some extra (but probably minor) wear of the flash. In contrast, if a filesystem is referencing named blocks and you have to move a block (either due to an error or a modification of that block through normal filesystem activity), NO changes need to be made to those elements of the filesystem that pointed to the block that got moved. All you have to do is append the new block that is renaming the old one, which includes the name (aka 64 bit quantity) in its auxillary data area, and cache the change in the translation in system memory until you decide to flush out the named block index (which I will describe a bit later on)... that's non critical information, by the way, and does not have to be synchronously in order to be crash recoverable. Write bandwidth is greatly reduced, particularly because when using a named block you only have to flush the actual modified page to the flash and nothing else other then a topological rollup record (which I will describe a bit later on). This works particularly well with a filesystem designed to use named blocks because there are *NO* indirect blockmaps to reference data or inodes in the filesystem. An absolute-sector-based filesystem has blockmaps, e.g. to locate a block in a file. In a named-block filesystem the blockmap *IS* the named block. That is, the 'name' of the named block is effectively the inode number and file block number combined into one 64 bit (or larger) key. Let me be clear about this distinction. In a filesystem that references absolute sectors an append to a file requires (typically) updating a blockmap of absolute sectors which in turns requires the blockmap block to be rewritten, along with any reference to it and so on and so forth up the chain. In a named-block filesystem appending to a file simply means writing out a new named block. The filesystem itself has NO concept of a blockmap... the blockmap is built into the sector translation layer. In other words, a filesystem using the named-block model is not any more complex then a filesystem using an absolute sector numbering model, and a filesystem using the named-block model is far easier from the perspective of caching changes in system memory without requiring a sync to flash for crash recovery. That is a huge deal. Now is there some work involved with making the named block translations efficient? Yes, there is some... but it is really not much more complex then the work involved in an absolute-sector-based filesystem which must index files, directories, and so forth within the filesystem itself. In particular, when using a named-block model you still have to occassionally flush out the translation topology to the flash media Since this topology references physical block numbers it, in fact, uses exactly the SAME mechanism that the absolute filesystem model used to maintain its topology. In other words, no more complex then the absolute filesystem model. The big difference is that the translation topology does not have to be written synchronously and the frequency of the rollup writes is based ENTIRELY on how much system memory you are willing to dedicate to caching topological changes. E.G. if you dedicate, say, 100KB of system memory you can store the topology for, say, 3200 filesystem updates (using 32 byte structures) before you have to 'flush' it to the flash. A filesystem based on absolute blocks pretty much has to cache the related (modified) blocks in memory which are far larger, and thus must flush them to storage far more frequently. But translations are tiny little records... 10's of gigabytes worth of updates can be cached in a small amount of system memory. The translation topology does NOT have to be synced to disk on fsync() because all the information can be recovered when the filesystem is mounted after a reboot. That is critical. Going back to the absolute filesystem model, such a filesystem does have the advantage of locality of reference (that is, not so much seek-wise which is irrelevant for flash but more from the point of view of being able to chain to the desired information). A filesystem using the named-block model must lookup the block, typically in a global index, which means it must maintain a cache of pointers into its translation tree. This is a little more expensive when looking up inodes but, actually, I use a very similar scheme in HAMMER (which has a global B-Tree for everything) and the caching required is so simple that it just becomes a NOP. Just storing a cached absolute sector number in the in-memory inode structure for use as a starting point when looking up elements related to that inode winds up being no less efficient then embedding a blockmap in an inode as you see in a more typical filesystem. In anycase, I hope this clarifies the issues. I really do understand where you are coming from, the simplicity of chaining the physical topology cannot be denied, and I like the elegance, but I hope I've shown that it is not actually simplifying the overall design much over a named block scheme, and that there are some fairly severe issues that can crop up that are complete non-issues when using a named block scheme. Long winded, I know. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:21:57 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D2B29106566B for ; Mon, 31 Mar 2008 22:21:57 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (ns1.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id C3C888FC19 for ; Mon, 31 Mar 2008 22:21:56 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id C976C5B50; Mon, 31 Mar 2008 15:21:54 -0700 (PDT) To: Matthew Dillon In-reply-to: Your message of "Mon, 31 Mar 2008 13:06:10 PDT." <200803312006.m2VK6Aom028133@apollo.backplane.com> Date: Mon, 31 Mar 2008 15:21:54 -0700 From: Bakul Shah Message-Id: <20080331222154.C976C5B50@mail.bitblocks.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org, Martin Fouts Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:21:57 -0000 On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: > But how do you index that information? You can't simply append the > information to the NAND unless you also have a way to access it. So > does the filesystem have to scan the NAND (or significant portions of it) > in order to build an index of the filesystem topology in system memory? One possible way: I'd design the system so that each update ends with the write of a root block[1]. I'd also write root blocks at fixed locations to find them easily without having to scann the whole disk. Given this, on reboot use binary search to locate the latest root block at a fixed location. There may be further updates so scan forward until you locate the most uptodate root block and once you have that, you are home free! Everything before that root block will be consistent with it. Even if the system crashes in the middle of a compacting GC, the design should be able to recover all data. What I am not sure about is whether one can do incremental GC. A stop-and-copy GC is always possible but I don't like the idea of long pauses. [1] The root block contains block # of the earliest valid block, a sequence number (that will not roll over in device's lifetime), block #s for various structures such as the root of inodes, superblock, freelist if any, etc. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:23:42 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 753B71065673; Mon, 31 Mar 2008 22:23:42 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 3B5268FC13; Mon, 31 Mar 2008 22:23:41 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id DCFB017104; Mon, 31 Mar 2008 22:23:39 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2VMNbjD026081; Mon, 31 Mar 2008 22:23:38 GMT (envelope-from phk@critter.freebsd.dk) To: Bakul Shah From: "Poul-Henning Kamp" In-Reply-To: Your message of "Mon, 31 Mar 2008 15:21:54 MST." <20080331222154.C976C5B50@mail.bitblocks.com> Date: Mon, 31 Mar 2008 22:23:37 +0000 Message-ID: <26080.1207002217@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Christopher Arnold , Martin Fouts , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:23:42 -0000 In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: >> But how do you index that information? You can't simply append the >> information to the NAND unless you also have a way to access it. So >> does the filesystem have to scan the NAND (or significant portions of it) >> in order to build an index of the filesystem topology in system memory? > >One possible way: > >I'd design the system so that each update ends with the write >of a root block[1]. This is sort of the approach Margo Seltzer used for her (Kludge-)LFS it has many drawbacks, in particular when it comes to recovery. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:23:42 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 753B71065673; Mon, 31 Mar 2008 22:23:42 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 3B5268FC13; Mon, 31 Mar 2008 22:23:41 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id DCFB017104; Mon, 31 Mar 2008 22:23:39 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2VMNbjD026081; Mon, 31 Mar 2008 22:23:38 GMT (envelope-from phk@critter.freebsd.dk) To: Bakul Shah From: "Poul-Henning Kamp" In-Reply-To: Your message of "Mon, 31 Mar 2008 15:21:54 MST." <20080331222154.C976C5B50@mail.bitblocks.com> Date: Mon, 31 Mar 2008 22:23:37 +0000 Message-ID: <26080.1207002217@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Christopher Arnold , Martin Fouts , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:23:42 -0000 In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: >> But how do you index that information? You can't simply append the >> information to the NAND unless you also have a way to access it. So >> does the filesystem have to scan the NAND (or significant portions of it) >> in order to build an index of the filesystem topology in system memory? > >One possible way: > >I'd design the system so that each update ends with the write >of a root block[1]. This is sort of the approach Margo Seltzer used for her (Kludge-)LFS it has many drawbacks, in particular when it comes to recovery. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:29:39 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 308DF1065673; Mon, 31 Mar 2008 22:29:39 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 22D468FC28; Mon, 31 Mar 2008 22:29:38 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id A779A1A4D8D; Mon, 31 Mar 2008 15:29:38 -0700 (PDT) Date: Mon, 31 Mar 2008 15:29:38 -0700 From: Alfred Perlstein To: Poul-Henning Kamp Message-ID: <20080331222938.GS95731@elvis.mu.org> References: <20080331222154.C976C5B50@mail.bitblocks.com> <26080.1207002217@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <26080.1207002217@critter.freebsd.dk> User-Agent: Mutt/1.4.2.3i Cc: Christopher Arnold , Martin Fouts , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:29:39 -0000 * Poul-Henning Kamp [080331 15:24] wrote: > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: > >> But how do you index that information? You can't simply append the > >> information to the NAND unless you also have a way to access it. So > >> does the filesystem have to scan the NAND (or significant portions of it) > >> in order to build an index of the filesystem topology in system memory? > > > >One possible way: > > > >I'd design the system so that each update ends with the write > >of a root block[1]. > > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS > it has many drawbacks, in particular when it comes to recovery. Can you explain why? I could see it being a problem because recovering the filesystem's most recent change might require significant scanning? -- - Alfred Perlstein From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:29:39 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 308DF1065673; Mon, 31 Mar 2008 22:29:39 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 22D468FC28; Mon, 31 Mar 2008 22:29:38 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id A779A1A4D8D; Mon, 31 Mar 2008 15:29:38 -0700 (PDT) Date: Mon, 31 Mar 2008 15:29:38 -0700 From: Alfred Perlstein To: Poul-Henning Kamp Message-ID: <20080331222938.GS95731@elvis.mu.org> References: <20080331222154.C976C5B50@mail.bitblocks.com> <26080.1207002217@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <26080.1207002217@critter.freebsd.dk> User-Agent: Mutt/1.4.2.3i Cc: Christopher Arnold , Martin Fouts , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:29:39 -0000 * Poul-Henning Kamp [080331 15:24] wrote: > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: > >> But how do you index that information? You can't simply append the > >> information to the NAND unless you also have a way to access it. So > >> does the filesystem have to scan the NAND (or significant portions of it) > >> in order to build an index of the filesystem topology in system memory? > > > >One possible way: > > > >I'd design the system so that each update ends with the write > >of a root block[1]. > > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS > it has many drawbacks, in particular when it comes to recovery. Can you explain why? I could see it being a problem because recovering the filesystem's most recent change might require significant scanning? -- - Alfred Perlstein From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:38:47 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C6F401065671; Mon, 31 Mar 2008 22:38:47 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id AD7E08FC19; Mon, 31 Mar 2008 22:38:47 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id CFD975BAE; Mon, 31 Mar 2008 15:38:46 -0700 (PDT) To: "Poul-Henning Kamp" In-reply-to: Your message of "Mon, 31 Mar 2008 22:23:37 -0000." <26080.1207002217@critter.freebsd.dk> Date: Mon, 31 Mar 2008 15:38:46 -0700 From: Bakul Shah Message-Id: <20080331223846.CFD975BAE@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:38:47 -0000 On Mon, 31 Mar 2008 22:23:37 -0000 "Poul-Henning Kamp" wrote: > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon > wrote: > >> But how do you index that information? You can't simply append the > >> information to the NAND unless you also have a way to access it. So > >> does the filesystem have to scan the NAND (or significant portions of > it) > >> in order to build an index of the filesystem topology in system memory > ? > > > >One possible way: > > > >I'd design the system so that each update ends with the write > >of a root block[1]. > > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS > it has many drawbacks, in particular when it comes to recovery. [Poul, use positive encouragement and you'd inspire a lot more people!] Note that in effect this is exactly what zfs does. Update of any block implies finding a new place for the updated copy, which means the block pointing to it must be also updated, which means a new place for it etc. etc. But hey, I spent just a few minutes sketching out the idea so it is possible I missed a whole bunch of things! If I was actually implementing this (which I am tempted to...) I'd certainly want to know what others did. One thing I forgot to add: I'd let the lower level handle bad block forwarding and wear levelling (like on the m-tron device). From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:38:47 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C6F401065671; Mon, 31 Mar 2008 22:38:47 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id AD7E08FC19; Mon, 31 Mar 2008 22:38:47 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id CFD975BAE; Mon, 31 Mar 2008 15:38:46 -0700 (PDT) To: "Poul-Henning Kamp" In-reply-to: Your message of "Mon, 31 Mar 2008 22:23:37 -0000." <26080.1207002217@critter.freebsd.dk> Date: Mon, 31 Mar 2008 15:38:46 -0700 From: Bakul Shah Message-Id: <20080331223846.CFD975BAE@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:38:47 -0000 On Mon, 31 Mar 2008 22:23:37 -0000 "Poul-Henning Kamp" wrote: > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon > wrote: > >> But how do you index that information? You can't simply append the > >> information to the NAND unless you also have a way to access it. So > >> does the filesystem have to scan the NAND (or significant portions of > it) > >> in order to build an index of the filesystem topology in system memory > ? > > > >One possible way: > > > >I'd design the system so that each update ends with the write > >of a root block[1]. > > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS > it has many drawbacks, in particular when it comes to recovery. [Poul, use positive encouragement and you'd inspire a lot more people!] Note that in effect this is exactly what zfs does. Update of any block implies finding a new place for the updated copy, which means the block pointing to it must be also updated, which means a new place for it etc. etc. But hey, I spent just a few minutes sketching out the idea so it is possible I missed a whole bunch of things! If I was actually implementing this (which I am tempted to...) I'd certainly want to know what others did. One thing I forgot to add: I'd let the lower level handle bad block forwarding and wear levelling (like on the m-tron device). From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:54:38 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DA727106568B; Mon, 31 Mar 2008 22:54:37 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id BCBD58FC1A; Mon, 31 Mar 2008 22:54:37 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMsQqB029550; Mon, 31 Mar 2008 15:54:26 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMsPqZ029549; Mon, 31 Mar 2008 15:54:25 -0700 (PDT) Date: Mon, 31 Mar 2008 15:54:25 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312254.m2VMsPqZ029549@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:54:38 -0000 :>=20 :> No matter what you do you have to index the information=20 :> *SOMEWHERE*. : :And NAND devices have a *SOMEWHERE* that makes them different than other :persistent storage devices in ways that make them interesting to do file :systems for. : :It's not _that_ you have to scan the NAND, by the way, it's _when_ you :scan the NAND that has the major impact on performance. I know where you are coming from there. The flash filesystem I did for our telemetry product (20 years ago, which is still in operation today) uses a named-block translation scheme but simply builds the topology out in main memory when the filesystem is mounted. These are small flash devices, two 1 MBytes NOR chips if I remember right. It just scans the translation table which is just a linear array and bulids the topology in ram, which takes maybe a few milliseconds to do on boot and after that, zero cost. Of course, that was for a small flash device so I could get away with it. And it was NOR so the translation table was all in one place and could be trivially scanned and updated. I have a similar issue in HAMMER. HAMMER is designed as a multi-terrabyte filesystem. HAMMER isn't a flash filesystem but it effectively uses a naming mechanic to locate inodes and data, so the problem is similar. I was really worried about this mechanic as compared to, say, UFS, where the absolute location of the on-disk inode can be directly calculated from the inode number. HAMMER has to look the inode number up in the global B-Tree. Even though it's a 15-way B-Tree (meaning it is fairly shallow and good locality of reference in the buffer cache), I was really worried about performance so I implemented a B-Tree pointer cache in the in-memory inode structure. So, e.g. if you lookup a filename in a directory the directory inode cached a pointer into the B-Tree 'near' the directory inode element, and another for the most recent inode number it looked up. These cached pointers then served as a heuristical starting point for the B-Tree lookup to locate the file in the directory and the inode number. Well, to shorten the description... the overhead of having to do the lookup turned out to not matter at all with the cache in place. Even better, since an inode's data blocks (and other information) is also indexed in the global B-Tree, the same cache also served for accesses into the file or directory itself. Whatever overhead might have been added from having to lookup the inode was completely covered by the savings of not having to run through a multi-level blockmap like FFS does. In many cases a B-Tree search is so well localized that it doesn't even have to leave the node. (p.s. when I talk about localization here, I'm talking about in-memory disk cache, not seek localization). In anycase, this is why I just don't worry about named-block translations. If one had a filesystem-layer blockmap AND named-block translations it could be pretty nasty due to the updating requirements. But if the filesystem revolves entirely around named-block translations and did not implement any blockmaps of its own, the only thing that happens is that some overheads that used to be in one part of the filesystem are now in another part instead, resulting in a net zero. HAMMER actually does implement a couple of blockmaps in addition to its global B-Tree, but in the case of HAMMER the blockmap is mapping *huge* physical translations... 8MB per block. They aren't named blocks like the B-Tree, but instead a virtualized address space designed to localize records, B-Tree nodes, large data blocks, and small data blocks. It's a different sort of blockmap then what one typically hears described for a filesystem... really more of an allocation space. I do this for several implementation reasons most specifically because HAMMER is designed for a hard disk and seek locality of reference is important, but also so information can be relocated in 8MB chunks to be able to add and remove physical storage. If I were reimplementing HAMMER as a flash filesystem (which I am NOT doing), I would probably do away with the blockmap layer entirely since seek locality of reference is not needed for a flash filesystem, and the global B-Tree would serve directly as the named-block topology. Kinda cool to think about. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:54:38 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DA727106568B; Mon, 31 Mar 2008 22:54:37 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id BCBD58FC1A; Mon, 31 Mar 2008 22:54:37 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMsQqB029550; Mon, 31 Mar 2008 15:54:26 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMsPqZ029549; Mon, 31 Mar 2008 15:54:25 -0700 (PDT) Date: Mon, 31 Mar 2008 15:54:25 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312254.m2VMsPqZ029549@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:54:38 -0000 :>=20 :> No matter what you do you have to index the information=20 :> *SOMEWHERE*. : :And NAND devices have a *SOMEWHERE* that makes them different than other :persistent storage devices in ways that make them interesting to do file :systems for. : :It's not _that_ you have to scan the NAND, by the way, it's _when_ you :scan the NAND that has the major impact on performance. I know where you are coming from there. The flash filesystem I did for our telemetry product (20 years ago, which is still in operation today) uses a named-block translation scheme but simply builds the topology out in main memory when the filesystem is mounted. These are small flash devices, two 1 MBytes NOR chips if I remember right. It just scans the translation table which is just a linear array and bulids the topology in ram, which takes maybe a few milliseconds to do on boot and after that, zero cost. Of course, that was for a small flash device so I could get away with it. And it was NOR so the translation table was all in one place and could be trivially scanned and updated. I have a similar issue in HAMMER. HAMMER is designed as a multi-terrabyte filesystem. HAMMER isn't a flash filesystem but it effectively uses a naming mechanic to locate inodes and data, so the problem is similar. I was really worried about this mechanic as compared to, say, UFS, where the absolute location of the on-disk inode can be directly calculated from the inode number. HAMMER has to look the inode number up in the global B-Tree. Even though it's a 15-way B-Tree (meaning it is fairly shallow and good locality of reference in the buffer cache), I was really worried about performance so I implemented a B-Tree pointer cache in the in-memory inode structure. So, e.g. if you lookup a filename in a directory the directory inode cached a pointer into the B-Tree 'near' the directory inode element, and another for the most recent inode number it looked up. These cached pointers then served as a heuristical starting point for the B-Tree lookup to locate the file in the directory and the inode number. Well, to shorten the description... the overhead of having to do the lookup turned out to not matter at all with the cache in place. Even better, since an inode's data blocks (and other information) is also indexed in the global B-Tree, the same cache also served for accesses into the file or directory itself. Whatever overhead might have been added from having to lookup the inode was completely covered by the savings of not having to run through a multi-level blockmap like FFS does. In many cases a B-Tree search is so well localized that it doesn't even have to leave the node. (p.s. when I talk about localization here, I'm talking about in-memory disk cache, not seek localization). In anycase, this is why I just don't worry about named-block translations. If one had a filesystem-layer blockmap AND named-block translations it could be pretty nasty due to the updating requirements. But if the filesystem revolves entirely around named-block translations and did not implement any blockmaps of its own, the only thing that happens is that some overheads that used to be in one part of the filesystem are now in another part instead, resulting in a net zero. HAMMER actually does implement a couple of blockmaps in addition to its global B-Tree, but in the case of HAMMER the blockmap is mapping *huge* physical translations... 8MB per block. They aren't named blocks like the B-Tree, but instead a virtualized address space designed to localize records, B-Tree nodes, large data blocks, and small data blocks. It's a different sort of blockmap then what one typically hears described for a filesystem... really more of an allocation space. I do this for several implementation reasons most specifically because HAMMER is designed for a hard disk and seek locality of reference is important, but also so information can be relocated in 8MB chunks to be able to add and remove physical storage. If I were reimplementing HAMMER as a flash filesystem (which I am NOT doing), I would probably do away with the blockmap layer entirely since seek locality of reference is not needed for a flash filesystem, and the global B-Tree would serve directly as the named-block topology. Kinda cool to think about. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:54:48 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5BDBB1065702 for ; Mon, 31 Mar 2008 22:54:48 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id 43B748FC12 for ; Mon, 31 Mar 2008 22:54:48 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id C976C5B50; Mon, 31 Mar 2008 15:21:54 -0700 (PDT) To: Matthew Dillon In-reply-to: Your message of "Mon, 31 Mar 2008 13:06:10 PDT." <200803312006.m2VK6Aom028133@apollo.backplane.com> Date: Mon, 31 Mar 2008 15:21:54 -0700 From: Bakul Shah Message-Id: <20080331222154.C976C5B50@mail.bitblocks.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org, Martin Fouts Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:54:48 -0000 On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: > But how do you index that information? You can't simply append the > information to the NAND unless you also have a way to access it. So > does the filesystem have to scan the NAND (or significant portions of it) > in order to build an index of the filesystem topology in system memory? One possible way: I'd design the system so that each update ends with the write of a root block[1]. I'd also write root blocks at fixed locations to find them easily without having to scann the whole disk. Given this, on reboot use binary search to locate the latest root block at a fixed location. There may be further updates so scan forward until you locate the most uptodate root block and once you have that, you are home free! Everything before that root block will be consistent with it. Even if the system crashes in the middle of a compacting GC, the design should be able to recover all data. What I am not sure about is whether one can do incremental GC. A stop-and-copy GC is always possible but I don't like the idea of long pauses. [1] The root block contains block # of the earliest valid block, a sequence number (that will not roll over in device's lifetime), block #s for various structures such as the root of inodes, superblock, freelist if any, etc. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 23:06:29 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 892F4106566C for ; Mon, 31 Mar 2008 23:06:29 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 6D0C48FC25 for ; Mon, 31 Mar 2008 23:06:29 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VN6Smg029759; Mon, 31 Mar 2008 16:06:28 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VN6SRa029758; Mon, 31 Mar 2008 16:06:28 -0700 (PDT) Date: Mon, 31 Mar 2008 16:06:28 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312306.m2VN6SRa029758@apollo.backplane.com> To: Bakul Shah References: <20080331223846.CFD975BAE@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , qpadla@gmail.com, arch@freebsd.org, Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 23:06:29 -0000 :[Poul, use positive encouragement and you'd inspire a lot more :people!] : :Note that in effect this is exactly what zfs does. Update of :any block implies finding a new place for the updated copy, :which means the block pointing to it must be also updated, :which means a new place for it etc. etc. : :But hey, I spent just a few minutes sketching out the idea so :it is possible I missed a whole bunch of things! If I was :actually implementing this (which I am tempted to...) I'd :certainly want to know what others did. : :One thing I forgot to add: I'd let the lower level handle bad :block forwarding and wear levelling (like on the m-tron :device). This is my understanding of what ZFS does too, and I considered it when I was designing HAMMER. I ultimately decided not to go that route because I was worried it would destroy seek-locality-of-reference on-disk (i.e. read/access performance). Seek locality of reference is of course very important for a disk-based filesystem but not so important for a flash-based filesystem. The one hard part I have left to do in HAMMER is the UNDO meta-data log. Or, more precisely, the recover-on-mount code for the UNDO meta-data log. Everything else is done and working. I knew it would be the hardest part of the filesystem when I ultimately decided not to go ZFS's route. The UNDO log is basically one seek-write per fsync or whenever the filesystem is flushed (every 30 seconds on BSDs)... not too bad, particularly because it stores only meta-data changes and not data-changes. Ultimately I think I can make it worthwhile by including data elements for small seek/write/fsync sequences in the UNDO record and just syncing it, which would be awesome for database applications. I have no immediate plans to do that right now, though. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 23:06:40 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EFA591065681 for ; Mon, 31 Mar 2008 23:06:40 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D39328FC22 for ; Mon, 31 Mar 2008 23:06:40 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VN6Smg029759; Mon, 31 Mar 2008 16:06:28 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VN6SRa029758; Mon, 31 Mar 2008 16:06:28 -0700 (PDT) Date: Mon, 31 Mar 2008 16:06:28 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312306.m2VN6SRa029758@apollo.backplane.com> To: Bakul Shah References: <20080331223846.CFD975BAE@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , qpadla@gmail.com, arch@freebsd.org, Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 23:06:41 -0000 :[Poul, use positive encouragement and you'd inspire a lot more :people!] : :Note that in effect this is exactly what zfs does. Update of :any block implies finding a new place for the updated copy, :which means the block pointing to it must be also updated, :which means a new place for it etc. etc. : :But hey, I spent just a few minutes sketching out the idea so :it is possible I missed a whole bunch of things! If I was :actually implementing this (which I am tempted to...) I'd :certainly want to know what others did. : :One thing I forgot to add: I'd let the lower level handle bad :block forwarding and wear levelling (like on the m-tron :device). This is my understanding of what ZFS does too, and I considered it when I was designing HAMMER. I ultimately decided not to go that route because I was worried it would destroy seek-locality-of-reference on-disk (i.e. read/access performance). Seek locality of reference is of course very important for a disk-based filesystem but not so important for a flash-based filesystem. The one hard part I have left to do in HAMMER is the UNDO meta-data log. Or, more precisely, the recover-on-mount code for the UNDO meta-data log. Everything else is done and working. I knew it would be the hardest part of the filesystem when I ultimately decided not to go ZFS's route. The UNDO log is basically one seek-write per fsync or whenever the filesystem is flushed (every 30 seconds on BSDs)... not too bad, particularly because it stores only meta-data changes and not data-changes. Ultimately I think I can make it worthwhile by including data elements for small seek/write/fsync sequences in the UNDO record and just syncing it, which would be awesome for database applications. I have no immediate plans to do that right now, though. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 23:18:30 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 127CF106566C; Mon, 31 Mar 2008 23:18:30 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id C4A498FC27; Mon, 31 Mar 2008 23:18:29 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id 26D865B50; Mon, 31 Mar 2008 16:18:29 -0700 (PDT) To: Alfred Perlstein In-reply-to: Your message of "Mon, 31 Mar 2008 15:29:38 PDT." <20080331222938.GS95731@elvis.mu.org> Date: Mon, 31 Mar 2008 16:18:29 -0700 From: Bakul Shah Message-Id: <20080331231829.26D865B50@mail.bitblocks.com> Cc: Christopher Arnold , arch@freebsd.org, Poul-Henning Kamp , qpadla@gmail.com, Martin Fouts Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 23:18:30 -0000 On Mon, 31 Mar 2008 15:29:38 PDT Alfred Perlstein wrote: > * Poul-Henning Kamp [080331 15:24] wrote: > > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes > : > > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon om> wrote: > > >> But how do you index that information? You can't simply append the > > >> information to the NAND unless you also have a way to access it. So > > >> does the filesystem have to scan the NAND (or significant portions o > f it) > > >> in order to build an index of the filesystem topology in system memo > ry? > > > > > >One possible way: > > > > > >I'd design the system so that each update ends with the write > > >of a root block[1]. > > > > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS > > it has many drawbacks, in particular when it comes to recovery. > > Can you explain why? > > I could see it being a problem because recovering the filesystem's > most recent change might require significant scanning? Let us take the mtron MSD-SATA3025-032 device for example. It has a capacity of 32GB + can do 16,000 random & 78,000 sequential reads per second (of 512 byte blocks). If you write a root block every megabyte, you have 2^15 potential root blocks. Locating the latest one will require 16 random reads + a scan of at most 1MB; which translates to about 26ms. Not too bad since this cost is incurred only on the first mount or reboot. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 23:55:22 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AE2441065672 for ; Mon, 31 Mar 2008 23:55:22 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 831758FC15 for ; Mon, 31 Mar 2008 23:55:22 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VNt7Jf030309; Mon, 31 Mar 2008 16:55:07 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VNt7ZY030308; Mon, 31 Mar 2008 16:55:07 -0700 (PDT) Date: Mon, 31 Mar 2008 16:55:07 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312355.m2VNt7ZY030308@apollo.backplane.com> To: Bakul Shah References: <20080331231829.26D865B50@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , Alfred Perlstein , qpadla@gmail.com, arch@freebsd.org, Poul-Henning Kamp Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 23:55:22 -0000 :Let us take the mtron MSD-SATA3025-032 device for example. It :has a capacity of 32GB + can do 16,000 random & 78,000 :sequential reads per second (of 512 byte blocks). If you :write a root block every megabyte, you have 2^15 potential :root blocks. Locating the latest one will require 16 random :reads + a scan of at most 1MB; which translates to about :26ms. Not too bad since this cost is incurred only on :the first mount or reboot. Yah, I think for NAND filesystems crash recovery is actually the easiest issue to deal with since all you need to do, pretty much, is rewind your append pointer a bit. Not only can you rewind the filesystem to an older state, but you can also reconstruct a great deal of the unsynced in-memory data by doing a limited reverse scan. It just does not take very long to do that and it can be done automatically by the filesystem at mount time. For example, if you write some file data and the new file block is flushed to flash but the related meta-data changes for the pointers to the now relocated data block have not yet been flushed to flash, on crash recovery it is possible to note this condition (the old physical block number would be stored in the aux data area of the new one) and regenerate the missing meta-data changes instead of being forced to back-out the write. Being able to do this has fairly substantial consequences because it means the fsync() only needs to flush some of the modified pages, not all of them, and that the remaining modified pages could in fact remain unflushed in the system cache and STILL be recoverable after a crash because their modification was mearly a side effect of the operation that *was* flushed to flash. Once you are able to do that, you can also simply decide not to synchronously flush that meta data at all and thus allow multiple changes to accumulate in system memory before flushing to flash. There is one issue here and that is the transactional nature of most filesystem operations. For example, if you append to a file with a write() you are not only writing new data to the backing store, you are also updating the on-disk inode's st_size field. Those are two widely separated pieces of information which must be transactionally bound --- all or nothing. In this regard the crash recovery code needs to understand that they are bound and either recover the whole transaction or none of it. Once you get to that point you also have to worry about interdependancies between transactions... a really sticky issue that is the reason for a huge chunk of softupdate's complexity in UFS. Basically you can wind up with interdependant transactions which must be properly recovered. An example would be doing a write() and then doing another write(). The second write() cannot be recovered unless the first can also be recovered. Separate transactions, but with a dependancy. Such interdependancies can become arbitrarily complex the longer you leave meta-data changes unflushed. The question ultimately becomes whether the recovery code can deal with the complexity or not. If not you may be forced to flush rather then create the interdependancy. HAMMER has precisely this issue with it's UNDO sequencing. The complexity is in the algorith, but not the (fairly small) amount of time it takes to actually perform the recovery operation. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 01:03:34 2008 Return-Path: Delivered-To: arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 29EAF106564A for ; Tue, 1 Apr 2008 01:03:34 +0000 (UTC) (envelope-from das@FreeBSD.ORG) Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101]) by mx1.freebsd.org (Postfix) with ESMTP id DAF938FC12 for ; Tue, 1 Apr 2008 01:03:33 +0000 (UTC) (envelope-from das@FreeBSD.ORG) Received: from zim.MIT.EDU (localhost [127.0.0.1]) by zim.MIT.EDU (8.14.2/8.14.2) with ESMTP id m31159Zv007875; Mon, 31 Mar 2008 21:05:09 -0400 (EDT) (envelope-from das@FreeBSD.ORG) Received: (from das@localhost) by zim.MIT.EDU (8.14.2/8.14.2/Submit) id m31158QH007874; Mon, 31 Mar 2008 21:05:08 -0400 (EDT) (envelope-from das@FreeBSD.ORG) Date: Mon, 31 Mar 2008 21:05:08 -0400 From: David Schultz To: Poul-Henning Kamp Message-ID: <20080401010508.GA7708@zim.MIT.EDU> Mail-Followup-To: Poul-Henning Kamp , Bakul Shah , Christopher Arnold , Martin Fouts , arch@FreeBSD.ORG, qpadla@gmail.com, freebsd-arch@FreeBSD.ORG References: <20080331222154.C976C5B50@mail.bitblocks.com> <26080.1207002217@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <26080.1207002217@critter.freebsd.dk> Cc: Christopher Arnold , Martin Fouts , arch@FreeBSD.ORG, qpadla@gmail.com, freebsd-arch@FreeBSD.ORG Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 01:03:34 -0000 On Mon, Mar 31, 2008, Poul-Henning Kamp wrote: > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: > >> But how do you index that information? You can't simply append the > >> information to the NAND unless you also have a way to access it. So > >> does the filesystem have to scan the NAND (or significant portions of it) > >> in order to build an index of the filesystem topology in system memory? > > > >One possible way: > > > >I'd design the system so that each update ends with the write > >of a root block[1]. This is exactly what ZFS does (except that it wasn't designed for flash, so the primary copy of the root block is always stored at a well-known location.) Countless other systems dating back to the use of shadow paging in System R use the same technique, including WAFL and several flash file systems. > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS > it has many drawbacks, in particular when it comes to recovery. Generally not. Recovery is trivial, especially compared to other techniques such as journalling. You simply find the root block, and it has pointers to a consistent snapshot of the system. The main limitation is that making updates durable immediately (i.e., fsync()) is inefficient, since all the dirty indirect blocks up to the root need to be flushed to disk. ZFS addresses this by writing updates that must be synchronous to a logical redo log, which does introduce complications for recovery. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 01:13:07 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8CE9B1065672; Tue, 1 Apr 2008 01:13:07 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id 22F4E8FC12; Tue, 1 Apr 2008 01:13:07 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id 2A4875B41; Mon, 31 Mar 2008 18:13:06 -0700 (PDT) To: Matthew Dillon In-reply-to: Your message of "Mon, 31 Mar 2008 16:55:07 PDT." <200803312355.m2VNt7ZY030308@apollo.backplane.com> Date: Mon, 31 Mar 2008 18:13:05 -0700 From: Bakul Shah Message-Id: <20080401011306.2A4875B41@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , Alfred Perlstein , qpadla@gmail.com, arch@freebsd.org, Poul-Henning Kamp Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 01:13:07 -0000 On Mon, 31 Mar 2008 16:55:07 PDT Matthew Dillon wrote: > There is one issue here and that is the transactional nature of most > filesystem operations. For example, if you append to a file with > a write() you are not only writing new data to the backing store, you > are also updating the on-disk inode's st_size field. Those are two > widely separated pieces of information which must be transactionally > bound --- all or nothing. In this regard the crash recovery code needs > to understand that they are bound and either recover the whole > transaction or none of it. > > Once you get to that point you also have to worry about > interdependancies between transactions... a really sticky issue > that is the reason for a huge chunk of softupdate's complexity in > UFS. Basically you can wind up with interdependant transactions > which must be properly recovered. An example would be doing a write() > and then doing another write(). The second write() cannot be recovered > unless the first can also be recovered. Separate transactions, but with > a dependancy. My instinct is to not combine transactions. That is, every data write results in a sequence: {data, [indirect blocks], inode, ..., root block}. Until the root block is written to the disk this is not a "commited" transaction and can be thrown away. In a Log-FS we always append on write; we never overwrite any data/metadata so this is easy and the FS state remains consistent. FFS overwrites blocks so all this gets far more complicated. Sort of like the difference between reasoning about functional programs & imperative programs! Now, it may be possible to define certain rules that allows one to combine transactions. For instance, write1(block n), write2(block n) == write2(block n) write(block n of file f1), delete file f1 == delete file f1 etc. That is, as long as write1 & associated metadata writes are not flushed to the disk, and a later write (write2) comes along, the earlier write (write1) can be thrown away. [But I have no idea if this is worth doing or even doable!] This is reminiscent of the bottom up rewrite system (BURS) used in some code generators (such as lcc's). The idea is the same here: replace a sequence of operations with an equivalent but lower cost sequence. > Such interdependancies can become arbitrarily complex the longer > you leave meta-data changes unflushed. The question ultimately becomes > whether the recovery code can deal with the complexity or not. If not > you may be forced to flush rather then create the interdependancy. > > HAMMER has precisely this issue with it's UNDO sequencing. The complexity > is in the algorith, but not the (fairly small) amount of time it takes > to actually perform the recovery operation. I don't understand the complexity. Basically your log should allow you to define a functional programming abstraction -- where you never overwrite any data/metadata for any active transactions and so reasoning becomes easier. [But may be we should take any hammer discussion offline] From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 01:33:43 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7CE1C106564A for ; Tue, 1 Apr 2008 01:33:43 +0000 (UTC) (envelope-from das@FreeBSD.ORG) Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101]) by mx1.freebsd.org (Postfix) with ESMTP id 39D7E8FC17 for ; Tue, 1 Apr 2008 01:33:43 +0000 (UTC) (envelope-from das@FreeBSD.ORG) Received: from zim.MIT.EDU (localhost [127.0.0.1]) by zim.MIT.EDU (8.14.2/8.14.2) with ESMTP id m31159Zv007875; Mon, 31 Mar 2008 21:05:09 -0400 (EDT) (envelope-from das@FreeBSD.ORG) Received: (from das@localhost) by zim.MIT.EDU (8.14.2/8.14.2/Submit) id m31158QH007874; Mon, 31 Mar 2008 21:05:08 -0400 (EDT) (envelope-from das@FreeBSD.ORG) Date: Mon, 31 Mar 2008 21:05:08 -0400 From: David Schultz To: Poul-Henning Kamp Message-ID: <20080401010508.GA7708@zim.MIT.EDU> Mail-Followup-To: Poul-Henning Kamp , Bakul Shah , Christopher Arnold , Martin Fouts , arch@FreeBSD.ORG, qpadla@gmail.com, freebsd-arch@FreeBSD.ORG References: <20080331222154.C976C5B50@mail.bitblocks.com> <26080.1207002217@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <26080.1207002217@critter.freebsd.dk> Cc: Christopher Arnold , Martin Fouts , arch@FreeBSD.ORG, qpadla@gmail.com, freebsd-arch@FreeBSD.ORG Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 01:33:43 -0000 On Mon, Mar 31, 2008, Poul-Henning Kamp wrote: > In message <20080331222154.C976C5B50@mail.bitblocks.com>, Bakul Shah writes: > >On Mon, 31 Mar 2008 13:06:10 PDT Matthew Dillon wrote: > >> But how do you index that information? You can't simply append the > >> information to the NAND unless you also have a way to access it. So > >> does the filesystem have to scan the NAND (or significant portions of it) > >> in order to build an index of the filesystem topology in system memory? > > > >One possible way: > > > >I'd design the system so that each update ends with the write > >of a root block[1]. This is exactly what ZFS does (except that it wasn't designed for flash, so the primary copy of the root block is always stored at a well-known location.) Countless other systems dating back to the use of shadow paging in System R use the same technique, including WAFL and several flash file systems. > This is sort of the approach Margo Seltzer used for her (Kludge-)LFS > it has many drawbacks, in particular when it comes to recovery. Generally not. Recovery is trivial, especially compared to other techniques such as journalling. You simply find the root block, and it has pointers to a consistent snapshot of the system. The main limitation is that making updates durable immediately (i.e., fsync()) is inefficient, since all the dirty indirect blocks up to the root need to be flushed to disk. ZFS addresses this by writing updates that must be synchronous to a logical redo log, which does introduce complications for recovery. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 04:53:48 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E4DB7106564A; Tue, 1 Apr 2008 04:53:48 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id C97878FC24; Tue, 1 Apr 2008 04:53:48 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 5F47F41310C; Mon, 31 Mar 2008 21:53:39 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 21:53:47 -0700 Message-ID: In-Reply-To: <200803312254.m2VMsPqZ029549@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTgjIRUdfyD0S9R1C3o5kg62MZ6QAMdO2Q References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 04:53:49 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Monday, March 31, 2008 3:54 PM > To: Martin Fouts > Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20 > Arnold; arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics > > If I were reimplementing HAMMER as a flash filesystem=20 > (which I am NOT doing), I would probably do away > with the blockmap layer entirely since seek locality of=20 > reference is not needed for a flash filesystem, and the global > B-Tree would serve directly as the named-block topology. Which would lead you almost directly to the sort of performance problems that jffs2 has. Until you've done it, you'll be surprised at the cost of maintaining b-trees in NAND. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 04:53:48 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E4DB7106564A; Tue, 1 Apr 2008 04:53:48 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id C97878FC24; Tue, 1 Apr 2008 04:53:48 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 5F47F41310C; Mon, 31 Mar 2008 21:53:39 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 21:53:47 -0700 Message-ID: In-Reply-To: <200803312254.m2VMsPqZ029549@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTgjIRUdfyD0S9R1C3o5kg62MZ6QAMdO2Q References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 04:53:49 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Monday, March 31, 2008 3:54 PM > To: Martin Fouts > Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20 > Arnold; arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics > > If I were reimplementing HAMMER as a flash filesystem=20 > (which I am NOT doing), I would probably do away > with the blockmap layer entirely since seek locality of=20 > reference is not needed for a flash filesystem, and the global > B-Tree would serve directly as the named-block topology. Which would lead you almost directly to the sort of performance problems that jffs2 has. Until you've done it, you'll be surprised at the cost of maintaining b-trees in NAND. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 04:55:21 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8E7AF10656CD; Tue, 1 Apr 2008 04:55:21 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 7617F8FC30; Tue, 1 Apr 2008 04:55:21 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 6F771409D23; Mon, 31 Mar 2008 21:55:07 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 21:55:15 -0700 Message-ID: In-Reply-To: <20080331223846.CFD975BAE@mail.bitblocks.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTf/3rlDZiIGCpSl2L4XsPLGVpdwANHLxA References: Your message of "Mon, 31 Mar 2008 22:23:37 -0000." <26080.1207002217@critter.freebsd.dk> <20080331223846.CFD975BAE@mail.bitblocks.com> From: "Martin Fouts" To: "Bakul Shah" , "Poul-Henning Kamp" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 04:55:21 -0000 =20 > -----Original Message----- > From: Bakul Shah [mailto:bakul@bitblocks.com]=20 > Sent: Monday, March 31, 2008 3:39 PM > To: Poul-Henning Kamp > Cc: Matthew Dillon; Christopher Arnold; arch@freebsd.org;=20 > qpadla@gmail.com; freebsd-arch@freebsd.org; Martin Fouts > Subject: Re: Flash disks and FFS layout heuristics=20 > One thing I forgot to add: I'd let the lower level handle bad=20 > block forwarding and wear levelling (like on the m-tron device). >=20 One of the difficulties of doing things this way comes from the complexity of dealing with garbage collection when you want to reuse an erase unit. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 04:55:21 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8E7AF10656CD; Tue, 1 Apr 2008 04:55:21 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 7617F8FC30; Tue, 1 Apr 2008 04:55:21 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 6F771409D23; Mon, 31 Mar 2008 21:55:07 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 21:55:15 -0700 Message-ID: In-Reply-To: <20080331223846.CFD975BAE@mail.bitblocks.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTf/3rlDZiIGCpSl2L4XsPLGVpdwANHLxA References: Your message of "Mon, 31 Mar 2008 22:23:37 -0000." <26080.1207002217@critter.freebsd.dk> <20080331223846.CFD975BAE@mail.bitblocks.com> From: "Martin Fouts" To: "Bakul Shah" , "Poul-Henning Kamp" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 04:55:21 -0000 =20 > -----Original Message----- > From: Bakul Shah [mailto:bakul@bitblocks.com]=20 > Sent: Monday, March 31, 2008 3:39 PM > To: Poul-Henning Kamp > Cc: Matthew Dillon; Christopher Arnold; arch@freebsd.org;=20 > qpadla@gmail.com; freebsd-arch@freebsd.org; Martin Fouts > Subject: Re: Flash disks and FFS layout heuristics=20 > One thing I forgot to add: I'd let the lower level handle bad=20 > block forwarding and wear levelling (like on the m-tron device). >=20 One of the difficulties of doing things this way comes from the complexity of dealing with garbage collection when you want to reuse an erase unit. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 05:30:00 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 80DA91065674; Tue, 1 Apr 2008 05:30:00 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 669398FC2D; Tue, 1 Apr 2008 05:30:00 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id F11E7414D25; Mon, 31 Mar 2008 22:27:41 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 22:27:50 -0700 Message-ID: In-Reply-To: <200803312219.m2VMJlkT029240@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTfnsz5JkHfGFbRHmgUX7Qxg+vbgANl4/w References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 05:30:00 -0000 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Monday, March 31, 2008 3:20 PM >=20 > For flash storage systems competitive with hard drive storage,=20 In embedded systems, it's RAM that flash storage competes with, not hard drive storage. SSD is a completely different engineering problem. > For the phone market? You mean small flash storage=20 > devices? Performance is almost irrelevant there Actually, we're very performance sensitive in this area, and getting more so as audio and video demands grow. > Three in five years? Is that an illustration of my point=20 > with regards to flash filesystem design? Ok, that was a joke :-) >=20 It's illustrative of my changing career. Three different filel sytems for three different products. ;) > But I don't think we can count small flash storage systems. Both models > devolve into trivialities when you are managing small amounts of > flash storage. I don't know who your "we" is, but *my* "we" counts small flash storage systems as rather critical. And the 'trivialities' aren't so trivial when you have to maintain reliability in the face of easily removable batteries. > Again, I am not familiar with jffs2 but you are painting=20 > a very broad brush that is more then likely an issue specifically > with the jffs2 design and not the concept of using named blocks in > general. That's the assumption that led from jffs1 to jffs2. It's an incorrect assumption. > What you are advocating is a filesystem which uses an=20 > absolute sector referencing scheme. I haven't actually advocated anything. Merely pointed out problems. But no, the scheme that we're currently using doesn't use the sort of absolute sector referencing scheme you're suggesting below. > Any change made to the filesystem requires a new > page to essentially be appended to the flash storage. In order to > properly index the information and maintain the=20 > filesystem topology you also have to recopy *ALL* pages containing=20 > references to the updated absolute sector in order to repoint them=20 > to the new absolute sector. Sorry, no. Doesn't work like that at all. This is, after all, computer science, and indirection is your friend. > I really understand that model, and it has the advantage=20 I'm sure you do. It's not the one we're using though. > I really do understand where you are coming from, the=20 > simplicity of chaining the physical topology cannot be denied, > and I like the elegance, but I hope I've > shown that it is not actually simplifying the overall design much > over a named block scheme, and that there are some fairly severe > issues that can crop up that are complete non-issues when=20 > using a named block scheme. All you've really shown is that the difference between theory and practice, as usual, remains larger in practice than in theory. You have made it painfully clear that you are immersed in large scale file systems, an area I left behind a decade ago when I abandoned my work on CUE at HP Labs. It is a fascinating and difficult area, and I heartily approve of experimentation in it. It also has almost no engineering tradeoffs in common with persistent storage for battery powered devices. In summary, then: NAND devices are critical to CE products, especially so-called convergent devices, in which there is no hard disk and persistent storage takes the form of an embedded NAND device and zero or more removable NAND devices. Power issues are critical and performance is becoming more so as the devices become more complex. Reliability of the file systems on these devices is also critical. The usual technique of disk optimization performance (throw more ram at in in order to cache) is unavailable, the usual hardware need for optimization (seek and rotational latency) are not present, and the peculiarities of NAND, most notably the size of the erase unit compared to the size of the write unit, the existence of the spare area, and the much higher bit error rates than either disk or ram experience, coupled with those requirements lead to a need for NAND-specific file systems on such devices. Experience has shown that brute force approaches based on flash translation layers work, but are inefficient and overly complex. Attempts to use generalized NOR file systems in NAND tend to have significant performance problems because of the cost maintaining the embedded data structures, such as b-trees, that replaced the more straightfoward data structures of earlier more linear file system designs. Experience has also shown that the file system needs to expose transaction semantics to the application, and that leaving bad block handling to a translation layer (even a block naming scheme) leads to performance problems consequent to garbage collection, which is inevitable in devices that have such large erase units. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 05:30:00 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 80DA91065674; Tue, 1 Apr 2008 05:30:00 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 669398FC2D; Tue, 1 Apr 2008 05:30:00 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id F11E7414D25; Mon, 31 Mar 2008 22:27:41 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 22:27:50 -0700 Message-ID: In-Reply-To: <200803312219.m2VMJlkT029240@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTfnsz5JkHfGFbRHmgUX7Qxg+vbgANl4/w References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 05:30:00 -0000 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Monday, March 31, 2008 3:20 PM >=20 > For flash storage systems competitive with hard drive storage,=20 In embedded systems, it's RAM that flash storage competes with, not hard drive storage. SSD is a completely different engineering problem. > For the phone market? You mean small flash storage=20 > devices? Performance is almost irrelevant there Actually, we're very performance sensitive in this area, and getting more so as audio and video demands grow. > Three in five years? Is that an illustration of my point=20 > with regards to flash filesystem design? Ok, that was a joke :-) >=20 It's illustrative of my changing career. Three different filel sytems for three different products. ;) > But I don't think we can count small flash storage systems. Both models > devolve into trivialities when you are managing small amounts of > flash storage. I don't know who your "we" is, but *my* "we" counts small flash storage systems as rather critical. And the 'trivialities' aren't so trivial when you have to maintain reliability in the face of easily removable batteries. > Again, I am not familiar with jffs2 but you are painting=20 > a very broad brush that is more then likely an issue specifically > with the jffs2 design and not the concept of using named blocks in > general. That's the assumption that led from jffs1 to jffs2. It's an incorrect assumption. > What you are advocating is a filesystem which uses an=20 > absolute sector referencing scheme. I haven't actually advocated anything. Merely pointed out problems. But no, the scheme that we're currently using doesn't use the sort of absolute sector referencing scheme you're suggesting below. > Any change made to the filesystem requires a new > page to essentially be appended to the flash storage. In order to > properly index the information and maintain the=20 > filesystem topology you also have to recopy *ALL* pages containing=20 > references to the updated absolute sector in order to repoint them=20 > to the new absolute sector. Sorry, no. Doesn't work like that at all. This is, after all, computer science, and indirection is your friend. > I really understand that model, and it has the advantage=20 I'm sure you do. It's not the one we're using though. > I really do understand where you are coming from, the=20 > simplicity of chaining the physical topology cannot be denied, > and I like the elegance, but I hope I've > shown that it is not actually simplifying the overall design much > over a named block scheme, and that there are some fairly severe > issues that can crop up that are complete non-issues when=20 > using a named block scheme. All you've really shown is that the difference between theory and practice, as usual, remains larger in practice than in theory. You have made it painfully clear that you are immersed in large scale file systems, an area I left behind a decade ago when I abandoned my work on CUE at HP Labs. It is a fascinating and difficult area, and I heartily approve of experimentation in it. It also has almost no engineering tradeoffs in common with persistent storage for battery powered devices. In summary, then: NAND devices are critical to CE products, especially so-called convergent devices, in which there is no hard disk and persistent storage takes the form of an embedded NAND device and zero or more removable NAND devices. Power issues are critical and performance is becoming more so as the devices become more complex. Reliability of the file systems on these devices is also critical. The usual technique of disk optimization performance (throw more ram at in in order to cache) is unavailable, the usual hardware need for optimization (seek and rotational latency) are not present, and the peculiarities of NAND, most notably the size of the erase unit compared to the size of the write unit, the existence of the spare area, and the much higher bit error rates than either disk or ram experience, coupled with those requirements lead to a need for NAND-specific file systems on such devices. Experience has shown that brute force approaches based on flash translation layers work, but are inefficient and overly complex. Attempts to use generalized NOR file systems in NAND tend to have significant performance problems because of the cost maintaining the embedded data structures, such as b-trees, that replaced the more straightfoward data structures of earlier more linear file system designs. Experience has also shown that the file system needs to expose transaction semantics to the application, and that leaving bad block handling to a translation layer (even a block naming scheme) leads to performance problems consequent to garbage collection, which is inevitable in devices that have such large erase units. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 07:45:44 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C45ED1065676 for ; Tue, 1 Apr 2008 07:45:44 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from palm.hoeg.nl (mx0.hoeg.nl [IPv6:2001:610:652::211]) by mx1.freebsd.org (Postfix) with ESMTP id 883448FC21 for ; Tue, 1 Apr 2008 07:45:44 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: by palm.hoeg.nl (Postfix, from userid 1000) id 8393E1CC30; Tue, 1 Apr 2008 09:45:43 +0200 (CEST) Date: Tue, 1 Apr 2008 09:45:43 +0200 From: Ed Schouten To: Martin Fouts Message-ID: <20080401074543.GK51074@hoeg.nl> References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="np6E2rbShIadjoVu" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) Cc: Christopher Arnold , arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 07:45:44 -0000 --np6E2rbShIadjoVu Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Martin Fouts wrote: > The MTD based file system jffs2 is an example of the third, and a > cautionary tale for those who would write their own. I can remember there is also a newer MTD based file system called LogFS: http://logfs.org/ --=20 Ed Schouten WWW: http://g-rave.nl/ --np6E2rbShIadjoVu Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (FreeBSD) iEYEARECAAYFAkfx6CcACgkQ52SDGA2eCwUlrQCff2XreQyrIcjzv0F9852VCHkf o1cAnjWRUYuPv2m1wjg2meh6lX16m1RN =ovEd -----END PGP SIGNATURE----- --np6E2rbShIadjoVu-- From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 07:56:15 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E0B751065671 for ; Tue, 1 Apr 2008 07:56:15 +0000 (UTC) (envelope-from wangyi6854@gmail.com) Received: from ti-out-0910.google.com (ti-out-0910.google.com [209.85.142.190]) by mx1.freebsd.org (Postfix) with ESMTP id 704248FC1B for ; Tue, 1 Apr 2008 07:56:15 +0000 (UTC) (envelope-from wangyi6854@gmail.com) Received: by ti-out-0910.google.com with SMTP id j2so619668tid.3 for ; Tue, 01 Apr 2008 00:56:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=DJW7dMd74MIdDXgD33AMUdw6IjftB2ZRGg2PHvDAybU=; b=cesp+97OYGZm/XliwUrwCb+63W46AyJ7u5S3q/7BHClJYSv1RicWH5BNFtgrUwD0wbBClWFje/ElgyKB8IMRH7b+wiSrhxPLfH5X7Vw53xP4Mr7R+RUT6p97toVDMfDFP7r/0OeD56kYSa+iw0rrX4z4VFbFKBfYplse7hG50Ic= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=bNwpShVcS6Kzw8Ku2mqPB7kVBvAzYhTzILfCz12WbSqSFyf2DsjxYMaWHvLCWJJfM2aPAnjKNKSY6FTHgVAOdghyrAp2r8xbfylwuR8wv3KNaQRBXQ2e2bYH6w1xdwBY/JkDDfT3+rKpLZL6M2jBtSEoytrqai04O7cGisgVol4= Received: by 10.110.31.11 with SMTP id e11mr3341169tie.56.1207034859245; Tue, 01 Apr 2008 00:27:39 -0700 (PDT) Received: by 10.110.10.14 with HTTP; Tue, 1 Apr 2008 00:27:39 -0700 (PDT) Message-ID: <5ea5cca50804010027k51b59658mb28a481c516e84b0@mail.gmail.com> Date: Tue, 1 Apr 2008 15:27:39 +0800 From: "Yi Wang" To: "Attilio Rao" In-Reply-To: <3bbf2fe10802061700p253e68b8s704deb3e5e4ad086@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <3bbf2fe10802061700p253e68b8s704deb3e5e4ad086@mail.gmail.com> Cc: Yar Tikhiy , Doug Barton , Jeff Roberson , freebsd-fs@freebsd.org, Scot Hetzel , freebsd-arch@freebsd.org Subject: Re: [RFC] Remove NTFS kernel support X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 07:56:16 -0000 On 2/7/08, Attilio Rao wrote: > As exposed by several users, NTFS seems to be broken even before first > VFS commits happeing around the end of December. Those commits exposed > some problems about NTFS which are currently under investigation. > Ultimately, This filesystem is also unmaintained at the moment. > > Speaking with jeff, we agreed on what can be a possible compromise: > remove the kernel support for NTFS and maybe take care of the FUSE > implementation. > What I now propose is a small survey which can shade a light on us > about what do you think about this idea and its implications: > - Do you use NTFS? Yes. I have a dual-boot machine. > - Are you interested in maintaining it? No. I'm not familiar with kernel/fs programming. > - Do you know a good reason to not use FUSE ntfs implementation? What Yes. Listening music and watching video on ntfs disks stops frequently using ntfs-3g. > the kernel counter part adds? I've no idea. > - Do you think axing the kernel support a good idea? For servers, Yes. For desktops, NO! > > Thanks, > Attilio > > > > -- > Peace can only be achieved by understanding - A. Einstein > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > -- Regards, Wang Yi From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 17:33:31 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1469C106566B; Tue, 1 Apr 2008 17:33:31 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D624D8FC15; Tue, 1 Apr 2008 17:33:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31HXFaJ039652; Tue, 1 Apr 2008 10:33:15 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31HXF6e039649; Tue, 1 Apr 2008 10:33:15 -0700 (PDT) Date: Tue, 1 Apr 2008 10:33:15 -0700 (PDT) From: Matthew Dillon Message-Id: <200804011733.m31HXF6e039649@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 17:33:31 -0000 :> with the blockmap layer entirely since seek locality of=20 :> reference is not needed for a flash filesystem, and the global :> B-Tree would serve directly as the named-block topology. : :Which would lead you almost directly to the sort of performance problems :that jffs2 has. : :Until you've done it, you'll be surprised at the cost of maintaining :b-trees in NAND. Well, I'm not advocating a B-Tree storage model for indexing in NAND. That would be kinda nasty. What I've done is simply describe a mechanism whereby a filesystem topology is able to make use of an abstraction to the point of being able to do away with what would normally have to be implemented by the filesystem itself. It doesn't have to be a B-Tree. You keep mentioning jffs2 and you keep mentioning 'the sort of performance problems that jffs2 has'... ok, but you aren't actually saying what they are with any specificity. Just saying that a blockmap or a named-block model is bad is wholely insufficient... it's way too broad a brush that ignores the literally thousands of ways such entities can be implemented. I've described numerous ways such entities can work, particularly if one is manipulating large blocks. If you want to address those please feel free but holding up jffs2 as a poster child of fail for an entire class of storage modeling is stupid. Please also remember, since you've appeared to have forgotten, that topologies can be implemented in both ram and storage together and are NOT necessarily ram intensive. This is going to be particularly true for any application reading or writing large files, such as an audio application, and is even more particularly true when dealing with fairly large files in fairly small amounts of storage. Synthesis is a major design component for small scale filesystems. I can't comment on your filesystem specifically, but you are welcome to describe it in more detail. I've doing embedded work for over 20 years now, everything from single chip microcomputers with 256 bytes of ram to little ARM chipsets running linux. I still have all that goddamn machine code burned into my brain, in fact, like a lost cousin. Please do not make the inference that I somehow do not understand the issues involved. I know precisely what the issues are and I will only repeat that for small scale devices, particularly recording and playback devices, the filesystem design devolves into trivialities that are easily cached, even if you don't have a lot of ram. Large linear files are extremely well suited for synthetic topologies and ridiculously easy to manage the performance characteristics of. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 17:33:31 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1469C106566B; Tue, 1 Apr 2008 17:33:31 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D624D8FC15; Tue, 1 Apr 2008 17:33:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31HXFaJ039652; Tue, 1 Apr 2008 10:33:15 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31HXF6e039649; Tue, 1 Apr 2008 10:33:15 -0700 (PDT) Date: Tue, 1 Apr 2008 10:33:15 -0700 (PDT) From: Matthew Dillon Message-Id: <200804011733.m31HXF6e039649@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 17:33:31 -0000 :> with the blockmap layer entirely since seek locality of=20 :> reference is not needed for a flash filesystem, and the global :> B-Tree would serve directly as the named-block topology. : :Which would lead you almost directly to the sort of performance problems :that jffs2 has. : :Until you've done it, you'll be surprised at the cost of maintaining :b-trees in NAND. Well, I'm not advocating a B-Tree storage model for indexing in NAND. That would be kinda nasty. What I've done is simply describe a mechanism whereby a filesystem topology is able to make use of an abstraction to the point of being able to do away with what would normally have to be implemented by the filesystem itself. It doesn't have to be a B-Tree. You keep mentioning jffs2 and you keep mentioning 'the sort of performance problems that jffs2 has'... ok, but you aren't actually saying what they are with any specificity. Just saying that a blockmap or a named-block model is bad is wholely insufficient... it's way too broad a brush that ignores the literally thousands of ways such entities can be implemented. I've described numerous ways such entities can work, particularly if one is manipulating large blocks. If you want to address those please feel free but holding up jffs2 as a poster child of fail for an entire class of storage modeling is stupid. Please also remember, since you've appeared to have forgotten, that topologies can be implemented in both ram and storage together and are NOT necessarily ram intensive. This is going to be particularly true for any application reading or writing large files, such as an audio application, and is even more particularly true when dealing with fairly large files in fairly small amounts of storage. Synthesis is a major design component for small scale filesystems. I can't comment on your filesystem specifically, but you are welcome to describe it in more detail. I've doing embedded work for over 20 years now, everything from single chip microcomputers with 256 bytes of ram to little ARM chipsets running linux. I still have all that goddamn machine code burned into my brain, in fact, like a lost cousin. Please do not make the inference that I somehow do not understand the issues involved. I know precisely what the issues are and I will only repeat that for small scale devices, particularly recording and playback devices, the filesystem design devolves into trivialities that are easily cached, even if you don't have a lot of ram. Large linear files are extremely well suited for synthetic topologies and ridiculously easy to manage the performance characteristics of. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 17:48:16 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 080171065676 for ; Tue, 1 Apr 2008 17:48:16 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D871F8FC13 for ; Tue, 1 Apr 2008 17:48:15 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31HmE7w039801; Tue, 1 Apr 2008 10:48:14 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31HmE1h039800; Tue, 1 Apr 2008 10:48:14 -0700 (PDT) Date: Tue, 1 Apr 2008 10:48:14 -0700 (PDT) From: Matthew Dillon Message-Id: <200804011748.m31HmE1h039800@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 17:48:16 -0000 : :> -----Original Message----- :> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 :> Sent: Monday, March 31, 2008 3:20 PM :>=20 :> For flash storage systems competitive with hard drive storage,=20 : :In embedded systems, it's RAM that flash storage competes with, not hard : :drive storage. : :SSD is a completely different engineering problem. You know, I think I've asked this already and you don't have to answer it if you don't want to, but exactly how large a flash device are you working with in your embedded project(s)? -Matt From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 17:56:16 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B755E1065672; Tue, 1 Apr 2008 17:56:16 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 987C68FC2D; Tue, 1 Apr 2008 17:56:16 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 595724020D7; Tue, 1 Apr 2008 10:56:06 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 10:56:14 -0700 Message-ID: In-Reply-To: <200804011733.m31HXF6e039649@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUHoB86TsNjwfyS+ixG2NdJB9SywAACYKg References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 17:56:16 -0000 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Tuesday, April 01, 2008 10:33 AM > To: Martin Fouts > Well, I'm not advocating a B-Tree storage model for=20 > indexing in NAND. That would be kinda nasty. What I've done=20 > is simply describe a mechanism whereby a filesystem topology=20 > is able to make use of an abstraction to > the point of being able to do away with what would=20 > normally have to be implemented by the filesystem itself. It=20 > doesn't have to be a B-Tree. >=20 It has to be a data structure with certain properties, most notably what's required to maintain consistency. It might in theory be possible to invent such a data structure that doesn't trip over NAND performance issues. In practice, it has not turned out to be so. I welcome your demonstration of such a design. > You keep mentioning jffs2 and you keep mentioning 'the sort of > performance problems that jffs2 has'... ok, but you=20 > aren't actually saying what they are with any specificity. There's plenty of information on jffs2's performance problems available. > Just saying that a blockmap or a named-block model is bad=20 > is wholely insufficient...=20 Saying that it's good, and then describing an implementation that's known in practice to be bad is much less sufficient. > it's way too broad a brush that ignores the literally > thousands of ways such entities can be implemented. I've > described numerous ways such entities can work, particularly > if one is manipulating large blocks. And I've pointed out that your idea of 'large' is too large to be of value in CE devices. > If you want to address those please feel free but > holding up jffs2 as a poster child of fail for an=20 > entire class of storage modeling is stupid. Indeed it would be. It's good that I haven't done so. The only times I've brought jffs2 up is when you've described approaches that are jffs2-like, and I've pointed out that those specific approaches have failed in jffs2. > Please also remember, since you've appeared to have=20 > forgotten, that topologies can be implemented in both ram > and storage together and are NOT necessarily ram intensive. No, Matt, I haven't "forgotten". It's a trivial statement. At runtime *all* topologies have in-ram and on-storage components. > I've doing embedded work for over 20 years now,=20 But, by your own earlier admission, you have no experience with NAND in such systems. It is a common mistake to extrapolate from NOR flash to inappropriate assumptions about NAND flash. > Large linear files are extremely well suited for > synthetic topologies and ridiculously easy to manage the=20 > performance characteristics of. "large linear files" are fairly rare on the ground in convergent devices. What you say may well be true for a simple MP3 player, but that's not what we're talking about here. You've done the same thing in this email that you did in your earlier comparison. You've found a trivial subset of the problem and then make the generalization that solving that subset shows that the solution to the problem is trivial. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 17:56:16 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B755E1065672; Tue, 1 Apr 2008 17:56:16 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 987C68FC2D; Tue, 1 Apr 2008 17:56:16 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 595724020D7; Tue, 1 Apr 2008 10:56:06 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 10:56:14 -0700 Message-ID: In-Reply-To: <200804011733.m31HXF6e039649@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUHoB86TsNjwfyS+ixG2NdJB9SywAACYKg References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 17:56:16 -0000 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Tuesday, April 01, 2008 10:33 AM > To: Martin Fouts > Well, I'm not advocating a B-Tree storage model for=20 > indexing in NAND. That would be kinda nasty. What I've done=20 > is simply describe a mechanism whereby a filesystem topology=20 > is able to make use of an abstraction to > the point of being able to do away with what would=20 > normally have to be implemented by the filesystem itself. It=20 > doesn't have to be a B-Tree. >=20 It has to be a data structure with certain properties, most notably what's required to maintain consistency. It might in theory be possible to invent such a data structure that doesn't trip over NAND performance issues. In practice, it has not turned out to be so. I welcome your demonstration of such a design. > You keep mentioning jffs2 and you keep mentioning 'the sort of > performance problems that jffs2 has'... ok, but you=20 > aren't actually saying what they are with any specificity. There's plenty of information on jffs2's performance problems available. > Just saying that a blockmap or a named-block model is bad=20 > is wholely insufficient...=20 Saying that it's good, and then describing an implementation that's known in practice to be bad is much less sufficient. > it's way too broad a brush that ignores the literally > thousands of ways such entities can be implemented. I've > described numerous ways such entities can work, particularly > if one is manipulating large blocks. And I've pointed out that your idea of 'large' is too large to be of value in CE devices. > If you want to address those please feel free but > holding up jffs2 as a poster child of fail for an=20 > entire class of storage modeling is stupid. Indeed it would be. It's good that I haven't done so. The only times I've brought jffs2 up is when you've described approaches that are jffs2-like, and I've pointed out that those specific approaches have failed in jffs2. > Please also remember, since you've appeared to have=20 > forgotten, that topologies can be implemented in both ram > and storage together and are NOT necessarily ram intensive. No, Matt, I haven't "forgotten". It's a trivial statement. At runtime *all* topologies have in-ram and on-storage components. > I've doing embedded work for over 20 years now,=20 But, by your own earlier admission, you have no experience with NAND in such systems. It is a common mistake to extrapolate from NOR flash to inappropriate assumptions about NAND flash. > Large linear files are extremely well suited for > synthetic topologies and ridiculously easy to manage the=20 > performance characteristics of. "large linear files" are fairly rare on the ground in convergent devices. What you say may well be true for a simple MP3 player, but that's not what we're talking about here. You've done the same thing in this email that you did in your earlier comparison. You've found a trivial subset of the problem and then make the generalization that solving that subset shows that the solution to the problem is trivial. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 18:06:35 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DDC28106564A for ; Tue, 1 Apr 2008 18:06:35 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id C324E8FC22 for ; Tue, 1 Apr 2008 18:06:35 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 88086403750; Tue, 1 Apr 2008 11:06:26 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 11:06:35 -0700 Message-ID: In-Reply-To: <200804011748.m31HmE1h039800@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUIJASYk2JPFFoQSiKOw+tRD4dxgAAkzWQ References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> <200804011748.m31HmE1h039800@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 18:06:36 -0000 If you've asked, I've missed the question. We tend to size ram and embedded NAND the same. The latest numbers I can discuss are several years old and were 64mb/64mb. Engineering *always* wants more of each, but the BOM rules. =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Tuesday, April 01, 2008 10:48 AM > To: Martin Fouts > Cc: freebsd-arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics >=20 > : > :> -----Original Message----- > :> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=3D20 > :> Sent: Monday, March 31, 2008 3:20 PM > :>=3D20 > :> For flash storage systems competitive with hard drive storage,=3D20 > : > :In embedded systems, it's RAM that flash storage competes=20 > with, not hard > : > :drive storage. > : > :SSD is a completely different engineering problem. >=20 > You know, I think I've asked this already and you don't=20 > have to answer > it if you don't want to, but exactly how large a flash=20 > device are you > working with in your embedded project(s)? >=20 > -Matt >=20 >=20 From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 18:07:58 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B7E111065671 for ; Tue, 1 Apr 2008 18:07:58 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 994DC8FC1A for ; Tue, 1 Apr 2008 18:07:58 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31I7iFF039980; Tue, 1 Apr 2008 11:07:44 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31I7g8I039974; Tue, 1 Apr 2008 11:07:42 -0700 (PDT) Date: Tue, 1 Apr 2008 11:07:42 -0700 (PDT) From: Matthew Dillon Message-Id: <200804011807.m31I7g8I039974@apollo.backplane.com> To: Bakul Shah References: <20080401011306.2A4875B41@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , Alfred Perlstein , qpadla@gmail.com, arch@freebsd.org, Poul-Henning Kamp Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 18:07:58 -0000 :My instinct is to not combine transactions. That is, every :data write results in a sequence: {data, [indirect blocks], :inode, ..., root block}. Until the root block is written to :the disk this is not a "commited" transaction and can be :thrown away. In a Log-FS we always append on write; we never :overwrite any data/metadata so this is easy and the FS state :remains consistent. FFS overwrites blocks so all this gets :far more complicated. Sort of like the difference between :reasoning about functional programs & imperative programs! : :Now, it may be possible to define certain rules that allows :one to combine transactions. For instance, : : write1(block n), write2(block n) == write2(block n) : write(block n of file f1), delete file f1 == delete file f1 : :etc. That is, as long as write1 & associated metadata writes :are not flushed to the disk, and a later write (write2) comes :along, the earlier write (write1) can be thrown away. [But I :have no idea if this is worth doing or even doable!] This is a somewhat different problem, one that is actually fairly easy to solve in larger systems because operating systems tend to want to cache everything. So really what is going on is that your operations (until you fsync()) are being cached in system memory and are not immediately committed to the underlying storage. Because of that, overwrites and deletions can simply destroy the related cache entities in system memory and never touch the disk. Ultimately you have to flush something to disk, and that is where the transactional atomicy and ordering issues start popping up. :This is reminiscent of the bottom up rewrite system (BURS) :used in some code generators (such as lcc's). The idea is the :same here: replace a sequence of operations with an :equivalent but lower cost sequence. What it comes down to is how expensive do you want your fsync() to be? You can always commit everything down to the root block and your recovery code can always throw everything away until it finds a good root block, and avoid the whole issue, but if you do things that way then fsync() becomes an extremely expensive call to make. Certain applications, primarily database applications, really depend on having an efficient fsync(). Brute force is always simpler, but not necessarily always desireable. :... :> is in the algorith, but not the (fairly small) amount of time it takes :> to actually perform the recovery operation. : :I don't understand the complexity. Basically your log should :allow you to define a functional programming abstraction -- :where you never overwrite any data/metadata for any active :transactions and so reasoning becomes easier. [But may be we :should take any hammer discussion offline] The complexity is there because a filesystem is actually a multi-layer entity. One has a storage topology which must be indexed in some manner, but one also has the implementation on top of that storage topology which has its own consistency requirements. For example, UFS stores inodes in specific places and has bitmaps for allocating data blocks and blockmaps to access data from its inodes. But UFS also has to maintain the link count for a file, the st_size field in the inode, the directory entry in the directory, and so forth. Certain operations require multiple filesystem entities to be adjusted as one atomic operation. For example removing a file requires the link count in the inode to be decremented and the entry in the directory to be removed. Undo logs are very good at describing the low level entity, allowing you to undo changes in time order, but undo logs need additional logic to recognize groups of transactions which must be recovered or thrown away as a single atomic entity, or which depend on each other. One reason why it's a big issue is that portions of those transactions can be committed to disk out of order. The recovery code has to recognize that dependant pieces are not present even if other micro-transactions have been fully committed. Taking UFS as an example: UFS's fsck can clean up link counts and directory entries, but has no concept of lost file data so you can wind up with an inode specifying a 100K file which after recovery is actually full of zero's (or garbage) instead of the 100K of data that was written to it. That is an example of a low level recovery operation that is unable to take into account high level dependancies. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 20:10:24 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 706F91065670 for ; Tue, 1 Apr 2008 20:10:24 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 4F6728FC3E for ; Tue, 1 Apr 2008 20:10:24 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31KAMJV041012; Tue, 1 Apr 2008 13:10:22 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31KAMpu041011; Tue, 1 Apr 2008 13:10:22 -0700 (PDT) Date: Tue, 1 Apr 2008 13:10:22 -0700 (PDT) From: Matthew Dillon Message-Id: <200804012010.m31KAMpu041011@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> <200804011748.m31HmE1h039800@apollo.backplane.com> Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 20:10:24 -0000 :If you've asked, I've missed the question. : :We tend to size ram and embedded NAND the same. The latest numbers I can :discuss are several years old and were 64mb/64mb. Engineering *always* :wants more of each, but the BOM rules. :=20 64MB is tiny. None of the problems with any of the approachs we've discussed even exist with devices that small in an embedded system. You barely need to even implement a filesystem topology, let alone anything sophisticated. To be clear, because I really don't understand how you can possibly argue that the named-block storage layer is bad in a device that small... the only sophistication a named-block storage model needs is when it must create a forward lookup on-flash to complement the reverse lookup you get from the auxillary storage. Given that you can trivially cache many translations in memory, not to mention do a boot-up scan of a flash that small, the only performance impact would be writing out a portion of the forward translation topology every N (N > 1000) or so page writes (however many translations can be conveniently cached in system memory). In a small flash device the embedded application will determine whether you even need such a table... frankly, unless it's a general purpose computing device like an iPhone you probably wouldn't need an on-flash forward lookup, you could simply size the blocks to guarantee 99% flash utilization verses the number of files you expect to have to maintain (so, for example, the named block size could be 512K if you expected to have to maintain 100 files on a 64MB device). This doesn't mean the filesystem would have to use a 512K block size, that would only be the case if the filesystem were flash-unaware. It's seriously a non-issue. You are making too many assumptions about how named blocks would be used, particularly if the filesystem is flash-aware. Named blocks do not have to be 'sectors' or 'filesystem blocks' or anything of the sort. They can be much larger.. they can easily be multiples of a flash page though you don't want to make them too large because a failed page also fails the whole named-block covering that page. They can be erase units (probably the best fit). This leaves the filesystem layer (and here we are talking about a flash-aware filesystem), with a hellofalot more implementation flexiblity. FORWARD LOOKUP ON-FLASH TOPOLOGY There are many choices available for the forward lookup topology, assuming you need one. Here again we are describing the need to have one (or at least one that would be considered sophisticated) only for larger flash devices -- really true solid state storage. We aren't talking about having to write out tiny little updates to B-Tree elements... that's stupid. Only an idiot would do that. Because you can cache new lookups in system memory and because you do NOT have to flush the forward lookup topology to flash for crash recovery purposes, the sole limiting factor for the efficiency of the forward lookup flush to flash is the amount of system memory you are willing to set aside to cache new translations. Since translations are fairly small structures we are probably talking not dozens, not hundreds, but probably at least a thousand translations before any sort of flush would be needed. Lets be clear here. That's ONE page write every THOUSAND page writes worth of overhead. There are no write performance issues. The actual on-flash topology for the forward lookup? With such a large rollup cache available it almost doesn't matter, but lets say we wanted to limit forward lookups to 3 levels. Lets take a 2G flash device with 8K pages, just to be nice about it. That's 262144 named blocks. Not a very large number, eh? Why you could almost fit that in system memory (and maybe you can!) and obviate the need for an on-flash forward lookup topology at all. But lets say you COULDN'T fit that in system memory. Hmm. 3 levels, 262144 entries maximum (less in real life). That would be a 3-layer radix tree with a radix of 64. The top layer would almost certainly be cacheable in system memory (and maybe even two layers) so we are talking one or two page reads from flash to do the lookup and the update mechanic, being a radix tree, would be to sync the bits of the radix tree that were modified by the new translations all the way up to the root. Clearly given the number of 'dirty' translations that would need to be synchronized, you could easily fill a flash page and then simply retire the synced translations from system memory, and repeat as often as necessary to maintain the dirty ratio in the cache in system memory at appropriate levels. You can also clearly accumulate enough dirty translations for the sync to be worthwhile... that is, be guaranteed to fill a whole page. You do NOT have to sync for recovery purposes so it becomes an issue that is solely related to the system cache and nothing else. I'll add something else with regards to radix trees using large radii... you can usually cache just about the whole damn thing except the leaf level in system memory. Think about that for a moment and in particular think about how it greatly reduces the number of actual flash reads needed to perform the lookup. I'll add something else with regards to on-storage radix trees. You can also adjust the layout so the BOTTOM few levels of the radix tree, relative to some leaf, reside in the same page. So now we've reduced a random uncached translation lookup to, at worse, ONE flash page read operation that ALSO guarantees us locality of reference for nearby file blocks (and hence has no performance issues for streaming reads either). -- Now, if you want to argue that this model would have serious performance penalities please go ahead, I'm all ears. -Matt From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 20:15:12 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5108F106564A; Tue, 1 Apr 2008 20:15:12 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 29A028FC3D; Tue, 1 Apr 2008 20:15:12 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31KEv0e041050; Tue, 1 Apr 2008 13:14:58 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31KEvTJ041049; Tue, 1 Apr 2008 13:14:57 -0700 (PDT) Date: Tue, 1 Apr 2008 13:14:57 -0700 (PDT) From: Matthew Dillon Message-Id: <200804012014.m31KEvTJ041049@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 20:15:12 -0000 :You've done the same thing in this email that you did in your earlier :comparison. You've found a trivial subset of the problem and then make :the generalization that solving that subset shows that the solution to :the problem is trivial. You know as well as I do that embedded projects are ALWAYS a trivial subset of something. Until you get to the level of sophistication of an iPhone. You only need to solve the subset of the problem that the embedded project covers. Most general problem sets become trivialized when used in degenerate environments. This is not a description of a trivialized solution to the problem set being generalized up, it is a description of the generalized solution to the problem set being applied to a degenerate application which trivializes many aspects of the general solution. My interest is in large scale systems, OF COURSE I'm approaching the problem from the point of view of large scale systems and not small scale systems. Don't be silly. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 20:15:12 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5108F106564A; Tue, 1 Apr 2008 20:15:12 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 29A028FC3D; Tue, 1 Apr 2008 20:15:12 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31KEv0e041050; Tue, 1 Apr 2008 13:14:58 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31KEvTJ041049; Tue, 1 Apr 2008 13:14:57 -0700 (PDT) Date: Tue, 1 Apr 2008 13:14:57 -0700 (PDT) From: Matthew Dillon Message-Id: <200804012014.m31KEvTJ041049@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 20:15:12 -0000 :You've done the same thing in this email that you did in your earlier :comparison. You've found a trivial subset of the problem and then make :the generalization that solving that subset shows that the solution to :the problem is trivial. You know as well as I do that embedded projects are ALWAYS a trivial subset of something. Until you get to the level of sophistication of an iPhone. You only need to solve the subset of the problem that the embedded project covers. Most general problem sets become trivialized when used in degenerate environments. This is not a description of a trivialized solution to the problem set being generalized up, it is a description of the generalized solution to the problem set being applied to a degenerate application which trivializes many aspects of the general solution. My interest is in large scale systems, OF COURSE I'm approaching the problem from the point of view of large scale systems and not small scale systems. Don't be silly. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 20:20:20 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F1E1F1065685 for ; Tue, 1 Apr 2008 20:20:20 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id D6F3F8FC27 for ; Tue, 1 Apr 2008 20:20:20 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 5A5E8402FDF; Tue, 1 Apr 2008 13:20:11 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 13:20:19 -0700 Message-ID: In-Reply-To: <200804012010.m31KAMpu041011@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUNGuMC9ZvsF/0TUeT8Q90DJXWjAAAFO/g References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> <200804011748.m31HmE1h039800@apollo.backplane.com> <200804012010.m31KAMpu041011@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 20:20:21 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Tuesday, April 01, 2008 1:10 PM > To: Martin Fouts > Cc: freebsd-arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics >=20 > 64MB is tiny. None of the problems with any of the=20 > approachs we've discussed even exist with devices that small in an=20 > embedded system. It is fairly clear that you're not familiar with NAND devices on embedded systems, as you've just said that well known problems do not exist. > To be clear, because I really don't understand how you=20 > can possibly argue that the named-block storage layer is bad in a=20 > device that small... Yes, your lack of understanding is very apparent. > It's seriously a non-issue. You are making too many=20 > assumptions about how named blocks would be used, particularly > if the filesystem is flash-aware. Now you're moving your goal posts. You came into this suggesting that the file system not be flash-aware. If I make the file system flash aware than many of the problems become managable. That *was* my starting thesis, after all. > Now, if you want to argue that this model would have=20 > serious performance penalities please go ahead, > I'm all ears. Feel free to implement it and see for yourself. The only point I had wished to make is that you get performance wins out of making the file system flash aware. Now that you've agreed to that, feel free to experiment with any of a number of ways of making it flash aware. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 20:32:42 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 868EE106564A; Tue, 1 Apr 2008 20:32:42 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 688538FC15; Tue, 1 Apr 2008 20:32:42 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 282AB403D0A; Tue, 1 Apr 2008 13:32:33 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 13:32:41 -0700 Message-ID: In-Reply-To: <200804012014.m31KEvTJ041049@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUNRcYybraZFx2R9SkG1318plQVgAAL2Xg References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 20:32:42 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Tuesday, April 01, 2008 1:15 PM > To: Martin Fouts > Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20 > Arnold; arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics > You know as well as I do that embedded projects are=20 > ALWAYS a trivial subset of something. No, I don't know that. It is hard to "know" something that is not true. > Until you get to the level of sophistication of > an iPhone. Although Apple is getting much hype about the sophistication of the iPhone, we've been shipping convergent devices of that complexity for some time now. Apple have better industrial design, but they're not doing anything, other than the touch screen, that we haven't already done. You are now *starting* to understand the level of complexity of CE embedded devices. > My interest is in large scale systems, OF COURSE I'm=20 > approaching the problem from the point of view > of large scale systems and not small > scale systems. Don't be silly. Actually, Matt, it's you, by trying to solve a complex embedded systems problem as if it were a 'degenerate' large scale systems problem, who are "being silly." You keep handing me crowbars when I need a scapel. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 20:32:42 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 868EE106564A; Tue, 1 Apr 2008 20:32:42 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 688538FC15; Tue, 1 Apr 2008 20:32:42 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 282AB403D0A; Tue, 1 Apr 2008 13:32:33 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 13:32:41 -0700 Message-ID: In-Reply-To: <200804012014.m31KEvTJ041049@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUNRcYybraZFx2R9SkG1318plQVgAAL2Xg References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 20:32:42 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Tuesday, April 01, 2008 1:15 PM > To: Martin Fouts > Cc: qpadla@gmail.com; freebsd-arch@freebsd.org; Christopher=20 > Arnold; arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics > You know as well as I do that embedded projects are=20 > ALWAYS a trivial subset of something. No, I don't know that. It is hard to "know" something that is not true. > Until you get to the level of sophistication of > an iPhone. Although Apple is getting much hype about the sophistication of the iPhone, we've been shipping convergent devices of that complexity for some time now. Apple have better industrial design, but they're not doing anything, other than the touch screen, that we haven't already done. You are now *starting* to understand the level of complexity of CE embedded devices. > My interest is in large scale systems, OF COURSE I'm=20 > approaching the problem from the point of view > of large scale systems and not small > scale systems. Don't be silly. Actually, Matt, it's you, by trying to solve a complex embedded systems problem as if it were a 'degenerate' large scale systems problem, who are "being silly." You keep handing me crowbars when I need a scapel. From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 22:11:52 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4BAFB106567B for ; Tue, 1 Apr 2008 22:11:52 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 08C908FC23 for ; Tue, 1 Apr 2008 22:11:51 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m31M94wI093631; Tue, 1 Apr 2008 16:09:04 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Tue, 01 Apr 2008 16:09:52 -0600 (MDT) Message-Id: <20080401.160952.1678772361.imp@bsdimp.com> To: jroberson@chesapeake.net From: "M. Warner Losh" In-Reply-To: <20080326230322.H72156@desktop> References: <20080327.013229.1649766744.imp@bsdimp.com> <20080326230322.H72156@desktop> X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org Subject: Re: AsiaBSDCon DEVSUMMIT patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 22:11:52 -0000 In message: <20080326230322.H72156@desktop> Jeff Roberson writes: : : On Thu, 27 Mar 2008, M. Warner Losh wrote: : : > Greetings, : > : > We've been talking about the situation with suspend/resume in the : > tree. Here's a quick hack to allow one to suspend/resume an : > individual device. This may or may not work too well, but it is : > offered up for testing and criticism. : > : > http://people.freebsd.org/~imp/devctl.diff : > : > devctl -s ath 0 suspend ath0 : > devctl -r ath 0 resume ath0 : : Hey Warner, : : This is a great idea. Would it be possible to provide a little more : background about what the expected failure/success modes are? If we had : some easy to follow steps we could ask for testers on current@ and create : a wiki with a list of known working/broken hardware. That'd be a great : step towards having widespread suspend/resume support. There's two areas of testing/use here. The first is to run it like so: devctl -s ath 0 && sleep 10 && devctl -r ath 0 Eg, suspend and resume an individual device, or even tree of devices. At least one bug has been found with this technique (it is actually a rediscovery of an older bug, but I digress). You'd want the kernel to not panic, and you'd want things to be good after as before. One can also use it to test to make sure that a device remains sane after a long time suspended as well. This can have power savings potential too, but that's a secondary effect at this time. Warner From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 22:26:06 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 490FA106566C for ; Tue, 1 Apr 2008 22:26:06 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 279328FC21 for ; Tue, 1 Apr 2008 22:26:06 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31MQ4Di042174; Tue, 1 Apr 2008 15:26:04 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31MQ42O042173; Tue, 1 Apr 2008 15:26:04 -0700 (PDT) Date: Tue, 1 Apr 2008 15:26:04 -0700 (PDT) From: Matthew Dillon Message-Id: <200804012226.m31MQ42O042173@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> <200804011748.m31HmE1h039800@apollo.backplane.com> <200804012010.m31KAMpu041011@apollo.backplane.com> Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 22:26:06 -0000 :> 64MB is tiny. None of the problems with any of the=20 :> approachs we've discussed even exist with devices that small in an=20 :> embedded system. : :It is fairly clear that you're not familiar with NAND devices on :embedded systems, as you've just said that well known problems do not :exist. : :> To be clear, because I really don't understand how you=20 :> can possibly argue that the named-block storage layer is bad in a=20 :> device that small... : :Yes, your lack of understanding is very apparent. What complete bullshit. If you want to argue technical merits, be my guest. So far you haven't made one single technical point in any of your postings. You've posted about your experience with NAND flash in embedded systems, very clearly with SMALL flash devices and simple filesystems, and that's fine, it's similar to my flash filesystem experience (which, yes, was primarily on NOR devices but, no, that doesn't magically make you an expert on NAND and me an idiot about it). Considering I've pretty much spent my entire life working with hardware that is about as ridiculous an assertion as you could make, but clearly you believe it. But then you generalized to the entire market and that's not fine. Real filesystems are far more sophisticated then what you will ever see in the embedded flash product, and consequently real filesystems tend to broken down into more abstract terms so the higher layers can actually implement the filesystem functions without it taking 10 man years of programming. My interest is squarely with real filesystems targetted to mass storage, these days. I didn't start out smearing people, but if you are going to start acting like an asshole then I have no problem ratcheting it up to your level. :> It's seriously a non-issue. You are making too many=20 :> assumptions about how named blocks would be used, particularly :> if the filesystem is flash-aware. : :Now you're moving your goal posts. You came into this suggesting that :the file system not be flash-aware. If I make the file system flash :aware than many of the problems become managable. That *was* my :starting thesis, after all. More bullshit. My first posting was not addressing performance issues, it was specifically addressing FFS and ZFS and the (bad) idea of making them more flash aware. ZFS on a 2G flash device? What the hell would be the point of that? We're talking about two completely different things. It used a NOR flash translation table by way of example. I sure as hell would never say that a flash-unaware filesystem would perform better then a flash-aware one. Duh! You have seriously misread the meaning behind that posting and you clearly didn't read any of the other postings. I suggest you go back and READ THE POSTINGS and maybe you'll start to understand the issues being addressed. Since you don't understand my position, let me lay it out for you in simple terms: * There's no point trying to adapt a flash-unaware filesystem to become flash-aware. It is a complete waste of time. You might as well write a new filesystem. If you want to use a flash-unaware filesystem you use a translation layer, eat any performance issues, and be done with it. MAYBE spend a few days optimizing the one or two critical paths you want to eek a little more performance out of. This has nothing to do with having to use translation tables and everything to do with the fact that the existance and use of those REQUIRED translation tables are not integrated into the flash-unaware filesystem, so inefficiencies are compounded rather then reduced. It's like jamming a square peg into a round hole. * Just because flash-unaware filesystems HAVE To use a translation layer doesn't mean that a translation layer is bad for a flash-aware filesystem. * A named-block translation layer can be an extremely valuable abstraction for use in filesystem designs which directly integrate its features (that is, the filesystem NAMES the block instead of ALLOCATES the block). There is absolutely NOTHING inherently bad about the model from a performance point of view, particularly if your storage media requires relocation (as NAND does). The key point is that a named-block layer takes over the functionality of all the indirect pointers that would normally have to be manipulated by higher layers in the filesystem. If you can integrate that into the physical storage requirements then you kill two birds with one stone and get major performance benefits from doing so. You are welcome to debate the points, but you'll get burned if you try to take some sort of moral highground stand based on a few piddly flash filesystems written over the course of a few years. Coding at that level is fun and interesting but ultimately not very difficult. :Feel free to implement it and see for yourself. : :The only point I had wished to make is that you get performance wins out :of making the file system flash aware. Now that you've agreed to that, :feel free to experiment with any of a number of ways of making it flash :aware. Right now my work is with HAMMER. It's fun to theorize how I could make HAMMER into a flash-aware filesystem but I have no intention of actually doing so any time soon, or ever. Frankly, if I wanted to write a ground-up flash filesystem I could, it would not be difficult... certainly not more difficult then HAMMER and HAMMER is probably the most sophisticated filesystem that exists in the open source world today. But I have no desire to do that at this juncture and the lack of desire certainly does not invalidate my comments on the matter. It's kinda like saying a person has no right to comment about how to cut an european apples if their focus in life is cutting american ones. NAND is different from NOR but the differences can be explained pretty much in two paragraphs and most of the same concepts apply. You can't byte-write, you have auxillary information, you need to add a little ECC, and scrub. It isn't rocket science. I am a very technical person. If you are going to argue merit, then you damn well better say WHY something doesn't work, in detail, instead of simply stating that someone random other entity couldn't make it work some point in the past so therefor it is bad. If you do not know the WHY, precisely, then good $#%$#%$#% luck designing anything that's actually sophisticated. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 23:26:11 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 335B8106566B; Tue, 1 Apr 2008 23:26:11 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id E53E38FC1C; Tue, 1 Apr 2008 23:26:10 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31NPxYl042552; Tue, 1 Apr 2008 16:25:59 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31NPwM1042551; Tue, 1 Apr 2008 16:25:58 -0700 (PDT) Date: Tue, 1 Apr 2008 16:25:58 -0700 (PDT) From: Matthew Dillon Message-Id: <200804012325.m31NPwM1042551@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 23:26:11 -0000 :Although Apple is getting much hype about the sophistication of the :iPhone, we've been shipping convergent devices of that complexity for :some time now. Apple have better industrial design, but they're not :doing anything, other than the touch screen, that we haven't already :done. : :You are now *starting* to understand the level of complexity of CE :embedded devices. How condescending you are. Just remember, you started this frackas. I can't believe it, you actually think you know more about embedded design then I do! What a laugh. I don't know a thing about you, and you clearly don't know a thing about me. Here's a hint: When you don't know you shouldn't assume. :Actually, Matt, it's you, by trying to solve a complex embedded systems :problem as if it were a 'degenerate' large scale systems problem, who :are "being silly." You keep handing me crowbars when I need a scapel. Oooh. complex.... biiig word. What bullshit. You think these problems are complex? Embedded systems these days are nearly complete single-chip microcomputers running hacked up but nearly complete operating systems containing 95% off-the-shelf software, much of it open source, and much of it provided to the developer on a shiny platter, with a fully operational SDK and HDK and FPGA logic around the core cpu. All in one chip. These days 'embedded' means you are sporting a completely functional linux operating system in a two chip solution with virtually no external parts required beyond those needed for the connectors. And it's all now written in C or C++ or whatever the hell language you want to write it in. It's crazy easy to do embedded development work these days. No more difficult then writing software on a full blown PC. I'm sorry, but if that is your idea of complex then its roughly equivalent to my idea of ridiculously easy. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 23:26:11 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 335B8106566B; Tue, 1 Apr 2008 23:26:11 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id E53E38FC1C; Tue, 1 Apr 2008 23:26:10 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31NPxYl042552; Tue, 1 Apr 2008 16:25:59 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31NPwM1042551; Tue, 1 Apr 2008 16:25:58 -0700 (PDT) Date: Tue, 1 Apr 2008 16:25:58 -0700 (PDT) From: Matthew Dillon Message-Id: <200804012325.m31NPwM1042551@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 23:26:11 -0000 :Although Apple is getting much hype about the sophistication of the :iPhone, we've been shipping convergent devices of that complexity for :some time now. Apple have better industrial design, but they're not :doing anything, other than the touch screen, that we haven't already :done. : :You are now *starting* to understand the level of complexity of CE :embedded devices. How condescending you are. Just remember, you started this frackas. I can't believe it, you actually think you know more about embedded design then I do! What a laugh. I don't know a thing about you, and you clearly don't know a thing about me. Here's a hint: When you don't know you shouldn't assume. :Actually, Matt, it's you, by trying to solve a complex embedded systems :problem as if it were a 'degenerate' large scale systems problem, who :are "being silly." You keep handing me crowbars when I need a scapel. Oooh. complex.... biiig word. What bullshit. You think these problems are complex? Embedded systems these days are nearly complete single-chip microcomputers running hacked up but nearly complete operating systems containing 95% off-the-shelf software, much of it open source, and much of it provided to the developer on a shiny platter, with a fully operational SDK and HDK and FPGA logic around the core cpu. All in one chip. These days 'embedded' means you are sporting a completely functional linux operating system in a two chip solution with virtually no external parts required beyond those needed for the connectors. And it's all now written in C or C++ or whatever the hell language you want to write it in. It's crazy easy to do embedded development work these days. No more difficult then writing software on a full blown PC. I'm sorry, but if that is your idea of complex then its roughly equivalent to my idea of ridiculously easy. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 00:04:03 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DC0101065672 for ; Wed, 2 Apr 2008 00:04:03 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id BFEEA8FC1F for ; Wed, 2 Apr 2008 00:04:03 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 41B01404C87; Tue, 1 Apr 2008 17:03:53 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 17:04:01 -0700 Message-ID: In-Reply-To: <200804012226.m31MQ42O042173@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUR2Bdnxq/kqqoSkSjUVdQHavnoAAAaoGw References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> <200804011748.m31HmE1h039800@apollo.backplane.com> <200804012010.m31KAMpu041011@apollo.backplane.com> <200804012226.m31MQ42O042173@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 00:04:04 -0000 =20 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Tuesday, April 01, 2008 3:26 PM > To: Martin Fouts > Cc: freebsd-arch@freebsd.org > Subject: RE: Flash disks and FFS layout heuristics >=20 >=20 > What complete bullshit. If you want to argue technical=20 > merits, be my guest. So far you haven't made one single > technical point in any of your postings. My, my. Mr Dillon likes to be rude to people and tell them they are = 'stupid' and 'silly', but when he makes na=EFve comments about systems = he doesn't understand and gets called on it, suddenly it's "complete = bullshit." I can see why PHK broke off trying to educate you. > You've posted about your experience with NAND flash in embedded = systems, > very clearly with SMALL flash devices Acutally, you're jumping to conclusions, again, Matt. I mentioned what = size devices we used 3 years ago. I haven't spoken at all about the size = of devices I have experience with. > and simple filesystems, and that's fine, it's similar to my=20 > flash filesystem Actually, I mentioned some file systems which are not 'simple' by any = reasonable metric. You're the one who keeps trying to impose 'simple'. > experience (which, yes, was primarily on NOR devices but, no, that > doesn't magically make you an expert on NAND and me an=20 > idiot about it).=20 You're not an 'idiot' about NAND, your knowledge is merely limited to = reading specs, and as a consequence you're extrapolating beyond that = knowledge when you try to apply your theory to NAND, and experience has = shown that your extrapolations don't hold up. > Considering I've pretty much spent my entire life working with = hardware > that is about as ridiculous an assertion as you could make, but = clearly > you believe it. You need to stop being defensive in technical discussions; stop imposing = your presumptions on other peoples problems; and stop thinking that = anyone cares enough about you to make any assertions about your = background. I have *not* made any assertions, other than that you've made comments = about NAND which betray your lack of experience with it. You don't have = the experience, and your comments about 'trivial' problems, and = 'nonexistant' problems clearly shows that. You don't need to take my = word for it. You merely have to check on the state of the art in NAND = file systems for CE products. Oh, you should also stop putting words in my mouth. You're wrong again. = I've never thought you were an idiot and I don't think so now. You're = rude, arrogant, judgmental, and sure of your of your own skills beyond = your actual ability, but you're no idiot. > But then you generalized to the entire market and that's not fine. I've not made any such generalizations, Matt. You're projecting again. = *You* are the one who made the generation that all embedded problems = were trivial. The only thing I speak with authority about in this discussion is = convergent CE devices, and then I speak only of the ones I've worked on = and what experience with them has been. >=20 > Real filesystems are far more sophisticated then what you=20 > will ever see in the embedded flash product, Now there's a hasty generalization that betrays your attitude problem. = "real" file systems? First NAND filesystems are 'trivial'. Then it's = 'degenerate'. Now it's not 'real.' You'll never understand a problem that you dismiss without = investigating. > My interest is squarely with real filesystems targetted > to mass storage, these days. Yes. I pointed that out. Also pointed out that as a consequence you're = trying to apply approaches that don't work in CE devices. >=20 > I didn't start out smearing people, but if you are going to start > acting like an asshole then I have no problem ratcheting it up > to your level. Dillon, after all these years, I would have thought you'd gotten past = that blind spot. You don't call people 'silly' and 'stupid' and the = work they're doing "trivial" and "degenerate" *unless* you're acting = like an asshole. As long as I've known you, you've liked starting = pissing contests and then blaming the other party. PHK was wise to have = begged off when you started down that path, but I had some time on my = hands and thought others would benefit from a technical discussion. If = you want a pissing match, I suggest alt.flames, where I'm sure they'll = happily accommodate you. > Since you don't understand my position, let me lay it out for you > in simple terms: >=20 > * There's no point trying to adapt a flash-unaware filesystem to > become flash-aware. It is a complete waste of time. =20 "waste of time" is a value judgment that you don't have the background = to make for anyone but yourself. The marketplace, which supports at = least two such filesystems, disagrees with your judgment. > You might as well write a new filesystem. > If you want to use a flash-unaware > filesystem you use a translation layer, eat any=20 > performance issues, and be done with it. Congratulations. Welcome to FATFS on usb sticks. > * Just because flash-unaware filesystems HAVE To use a=20 > translation layer > doesn't mean that a translation layer is bad for a flash-aware > filesystem. That is correct. The FTL approach is suitable for certain types of = flash file systems, as I pointed out some number of emails back. It is = not suitable for all. > * A named-block translation layer can be an extremely=20 > valuable abstraction > for use in filesystem designs which directly integrate=20 > its features > (that is, the filesystem NAMES the block instead of=20 > ALLOCATES the > block). 'can be' makes for a pretty weak precondition, so sure, it 'can be'. >=20 > There is absolutely NOTHING inherently bad about the=20 > model from a=20 > performance point of view, particularly if your storage=20 > media requires > relocation (as NAND does). Either 'relocation' doesn't mean what you think it means, or NAND = doesn't require it. > The key point is that a=20 > named-block layer > takes over the functionality of all the indirect=20 > pointers that would > normally have to be manipulated by higher layers in the=20 > filesystem. Yes. This is what the FTL people do, except the granularity of their = named-block is the write unit. It has performance issues. > If you can integrate that into the physical storage=20 > requirements then > you kill two birds with one stone and get major=20 > performance benefits > from doing so. >=20 That's a big if. It has in practice turned out to be unattainable. I = await your demonstration to the contrary. > You are welcome to debate the points, but you'll get=20 > burned if you try > to take some sort of moral highground stand based on a=20 > few piddly flash > filesystems written over the course of a few years. =20 > Coding at that > level is fun and interesting but ultimately not very difficult. 'burned'. 'moral highground'. 'piddly'. 'not very difficult'. That's a = hell of a blindspot to your own behavior that you've got there, Matt. > Right now my work is with HAMMER. It's fun to theorize=20 > how I could make HAMMER into a flash-aware filesystem but=20 > I have no intention of actually doing so any time soon, or ever. >=20 I didn't think so. > Frankly, if I wanted to write a ground-up flash=20 > filesystem I could, it would not be difficult...=20 Of course not. People write file systems in undergraduate OS classes. > But I have no desire to do that at > this juncture and the lack of desire certainly does not=20 > invalidate my comments on the matter. What 'invalidates' your comments, is that others have tried what you've = outlined, in the way that you've outlined it, and it has failed. That, = coupled with your na=EFve claims about embedded systems not being = complex and your several mistaken claims about where the problems are or = aren't in such systems simply highlights that, as you say, you're having = fun speculating, but, as I say, your speculation would take you down = trodden paths to well known conclusions. > NAND is different from NOR but the differences can be=20 > explained pretty much in two paragraphs and most of the same concepts=20 > apply. The interesting aspects lie in the differences. > It isn't rocket science. I've done rocket science for a living. It's not that hard, and I've = always found that statement silly. >=20 > I am a very technical person. If you are going to argue=20 > merit, then you damn well better say WHY something doesn't > work, in detail, instead of simply stating that someone > random other entity couldn't make it work some point in > the past so therefor it is bad. You're a 'very technical person' with a very judgmental attitude and a = tendency to use emotionally loaded language that you later disclaim. = But no, I don't have to say WHY it doesn't work in detail, provided = someone else has already said so. I merely have to point out the = existence of the refutation. > If you do not know the WHY, precisely, then good $#%$#%$#% > luck designing anything that's actually sophisticated. >=20 "sophisticated", which I suppose is a synonmy for "complex", is an = interesting metric for a "very technical person" to apply. But actually, it's pretty easy to design sophisticated systems when you = don't understand the underlying issues. In practice it's more common to = make systems more sophisticated in the face of uncertainty, not less. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 00:36:41 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CF6161065678; Wed, 2 Apr 2008 00:36:41 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id B1D568FC1B; Wed, 2 Apr 2008 00:36:41 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 0BFF4402FE5; Tue, 1 Apr 2008 17:36:32 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 17:36:40 -0700 Message-ID: In-Reply-To: <200804012325.m31NPwM1042551@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUT8TcTf9Q5HzrRyu3uwNh8yZ7yAABWEug References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 00:36:42 -0000 =20 > I can't believe it, you actually think you know more=20 > about embedded design then I do! What a laugh. >=20 > I don't know a thing about you, and you clearly don't=20 > know a thing about me. Here's a hint: When you don't > know you shouldn't assume. So what part of "you think you know" is *not* an assumption? > You think these problems are complex? Yes. I do it. That's what makes them fun. > Embedded systems these days are nearly complete > single-chip microcomputers running hacked up but nearly complete > operating systems containing 95% off-the-shelf software,=20 > much of it open source, and much of it provided to the developer on=20 > a shiny platter, with a fully operational SDK and HDK and FPGA logic=20 > around the core cpu. It amazes me that you can assert to be so knowledgeable about embedded systems and then make such a glaringly wrong description of the ones I work on. Our current shipping product has *no* off-the-shelf software, beyond a few small libraries for image encoding, out of several million lines of code. There's no 'fully operational SDK', beyond a gcc crosscompiler that we've debugged ourselves. The SOC has no FPGA. > All in one chip. These days 'embedded' means you are sporting a > completely functional linux operating system in a two=20 > chip solution It's not a single chip or even two chips. It doesn't run linux. Keep guessing wrong, Matt. > with virtually no external parts required beyond those=20 > needed for the connectors. There are a lot more parts than connectors in the BOM. Wrong again. > And it's all now written in C or C++ or=20 > whatever the hell language you want to write it in. Well, "whatever the hell language" gets you off on a technicality there, Matt. > It's crazy easy to do embedded development work these=20 > days. No more difficult then writing software on a full blown PC. There is a class of such development. Pity it's not the class I'm working in. > I'm sorry, but if that is your idea of complex then its roughly > equivalent to my idea of ridiculously easy. No, Matt, it's not my idea of complex. I see that you're more in need of your advice about not assuming than I am. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 00:36:41 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CF6161065678; Wed, 2 Apr 2008 00:36:41 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id B1D568FC1B; Wed, 2 Apr 2008 00:36:41 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 0BFF4402FE5; Tue, 1 Apr 2008 17:36:32 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 17:36:40 -0700 Message-ID: In-Reply-To: <200804012325.m31NPwM1042551@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUT8TcTf9Q5HzrRyu3uwNh8yZ7yAABWEug References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 00:36:42 -0000 =20 > I can't believe it, you actually think you know more=20 > about embedded design then I do! What a laugh. >=20 > I don't know a thing about you, and you clearly don't=20 > know a thing about me. Here's a hint: When you don't > know you shouldn't assume. So what part of "you think you know" is *not* an assumption? > You think these problems are complex? Yes. I do it. That's what makes them fun. > Embedded systems these days are nearly complete > single-chip microcomputers running hacked up but nearly complete > operating systems containing 95% off-the-shelf software,=20 > much of it open source, and much of it provided to the developer on=20 > a shiny platter, with a fully operational SDK and HDK and FPGA logic=20 > around the core cpu. It amazes me that you can assert to be so knowledgeable about embedded systems and then make such a glaringly wrong description of the ones I work on. Our current shipping product has *no* off-the-shelf software, beyond a few small libraries for image encoding, out of several million lines of code. There's no 'fully operational SDK', beyond a gcc crosscompiler that we've debugged ourselves. The SOC has no FPGA. > All in one chip. These days 'embedded' means you are sporting a > completely functional linux operating system in a two=20 > chip solution It's not a single chip or even two chips. It doesn't run linux. Keep guessing wrong, Matt. > with virtually no external parts required beyond those=20 > needed for the connectors. There are a lot more parts than connectors in the BOM. Wrong again. > And it's all now written in C or C++ or=20 > whatever the hell language you want to write it in. Well, "whatever the hell language" gets you off on a technicality there, Matt. > It's crazy easy to do embedded development work these=20 > days. No more difficult then writing software on a full blown PC. There is a class of such development. Pity it's not the class I'm working in. > I'm sorry, but if that is your idea of complex then its roughly > equivalent to my idea of ridiculously easy. No, Matt, it's not my idea of complex. I see that you're more in need of your advice about not assuming than I am. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 00:47:58 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1543B106566B for ; Wed, 2 Apr 2008 00:47:58 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id E78628FC24 for ; Wed, 2 Apr 2008 00:47:57 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m320luwe043381; Tue, 1 Apr 2008 17:47:56 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m320lun8043380; Tue, 1 Apr 2008 17:47:56 -0700 (PDT) Date: Tue, 1 Apr 2008 17:47:56 -0700 (PDT) From: Matthew Dillon Message-Id: <200804020047.m320lun8043380@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> <200804011748.m31HmE1h039800@apollo.backplane.com> <200804012010.m31KAMpu041011@apollo.backplane.com> <200804012226.m31MQ42O042173@apollo.backplane.com> Cc: freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 00:47:58 -0000 :My, my. Mr Dillon likes to be rude to people and tell them they are = :'stupid' and 'silly', but when he makes na=EFve comments about systems = :he doesn't understand and gets called on it, suddenly it's "complete = :bullshit." : :I can see why PHK broke off trying to educate you. I really have no love for people who are so disrespectful to their peers. A few of the people unfortunately associated with a project I had an interest in fit that category, some more then others. Not too many, only two (well, three if you count yourself). On the bright side my list is very limited. I do not believe that you are any more qualified then you think I am. Clearly it is an issue for you and just as clearly you are unwilling to engage in any sort of technical conversation about the matter. I really have no idea why. If you decide you want to have a technical conversation, where you actually post meaningful information useful not only to me but to everyone reading this thread, instead of vague, broad, uninteresting references, then please go ahead and do so. If you think those vague bits of information you post, condescending and secretive as if they were something so secret and special nobody needs to know the details... if you think those actually contribute to the conversation, then you are deluded. If it is important to you, then perhaps you should consider that the characteristics of NAND flash are only a small part of the equation. The characteristics are not this mystical scary beast that nobody understands, they are very well defined and fairly limited in scope, and thus can be discussed, theorized, implemented, and tested. None of these processes are absolute. Hell, filesystem design is just as important and I dare say that the only person on this list with more experience then I have on filesystem design is Kirk. I'm a technical theorist, a dreamer, and an implementer. Theory always comes before function, always. I don't know what your problem is and I really don't care, but it absolutely does not and never has required direct experience to have a technical conversation. If that were true nobody would ever invent anything, try anything, or make any progress. So, yes, there is a great deal of value to having a technical conversation that mixes theory and actual direct experience. Very few people have the breadth of direct experience required to be able to comment definitively on something. Not a single person on this list, not myself, not you, not Poul... nobody has anywhere near the level of experience required to come to any sort of conclusion with regards to the material we are discussing. All we can do is experiment, theorize, and have a technical conversation about the merits of one thing or another. So, again, if you have something to contribute to our technical conversation, perhaps some direct experience you've had trying to actually implement one of these 'failed' schemes???, I'm all ears. If not, then I recommend you stop posting. -Matt From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 01:03:31 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 08EED106564A; Wed, 2 Apr 2008 01:03:31 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id B98BA8FC21; Wed, 2 Apr 2008 01:03:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m3213Jhl043507; Tue, 1 Apr 2008 18:03:19 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m3213JEt043506; Tue, 1 Apr 2008 18:03:19 -0700 (PDT) Date: Tue, 1 Apr 2008 18:03:19 -0700 (PDT) From: Matthew Dillon Message-Id: <200804020103.m3213JEt043506@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 01:03:31 -0000 :> chip solution : :It's not a single chip or even two chips. It doesn't run linux. Keep :guessing wrong, Matt. I'm not guessing at all. I don't really give a damn about your embedded project, or your constant innuendo's about what it does or does not do. If you decide you want to talk about it, that's up to you. Personally speaking, I love talking about the projects I've done. I love talking about the cool technical details and the hard problems that had to be solved. I'm talking about the embedded world in general and how it functions these days. What made you think I was talking about YOUR particular project? I have no information... getting anything from you is like pulling teeth, you are wholely unwilling to part with a single meaningful detail and yet you expect to have a technical conversation by referencing it? Give me a break. Again, if you want to have an actual conversation, then the ball is in your court. You clearly believe that I am not qualified to have that conversation... well, put your money where your mouth is then. If you think my reasoning is so bad, then say something meaningful that directly addresses it, in technical terms. Hell, you can even quote papers rather then produce your own thoughts if you think it is relevant. The devil is in the details. That's what technical conversations are for. If I went by your logic I would have never written Diablo, or dmail, or a database, or numerous filesystems, or HAMMER, or gotten involved with OSs (people kept saying they were harder then micro os's. Oops, I guess they weren't after all!). Sheesh. -Matt From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 01:03:31 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 08EED106564A; Wed, 2 Apr 2008 01:03:31 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id B98BA8FC21; Wed, 2 Apr 2008 01:03:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m3213Jhl043507; Tue, 1 Apr 2008 18:03:19 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m3213JEt043506; Tue, 1 Apr 2008 18:03:19 -0700 (PDT) Date: Tue, 1 Apr 2008 18:03:19 -0700 (PDT) From: Matthew Dillon Message-Id: <200804020103.m3213JEt043506@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 01:03:31 -0000 :> chip solution : :It's not a single chip or even two chips. It doesn't run linux. Keep :guessing wrong, Matt. I'm not guessing at all. I don't really give a damn about your embedded project, or your constant innuendo's about what it does or does not do. If you decide you want to talk about it, that's up to you. Personally speaking, I love talking about the projects I've done. I love talking about the cool technical details and the hard problems that had to be solved. I'm talking about the embedded world in general and how it functions these days. What made you think I was talking about YOUR particular project? I have no information... getting anything from you is like pulling teeth, you are wholely unwilling to part with a single meaningful detail and yet you expect to have a technical conversation by referencing it? Give me a break. Again, if you want to have an actual conversation, then the ball is in your court. You clearly believe that I am not qualified to have that conversation... well, put your money where your mouth is then. If you think my reasoning is so bad, then say something meaningful that directly addresses it, in technical terms. Hell, you can even quote papers rather then produce your own thoughts if you think it is relevant. The devil is in the details. That's what technical conversations are for. If I went by your logic I would have never written Diablo, or dmail, or a database, or numerous filesystems, or HAMMER, or gotten involved with OSs (people kept saying they were harder then micro os's. Oops, I guess they weren't after all!). Sheesh. -Matt From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 02:09:09 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3F512106568B for ; Wed, 2 Apr 2008 02:09:09 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outF.internet-mail-service.net (outf.internet-mail-service.net [216.240.47.229]) by mx1.freebsd.org (Postfix) with ESMTP id 208358FC24 for ; Wed, 2 Apr 2008 02:09:09 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Tue, 01 Apr 2008 19:10:06 -0700 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 6E55E2D600F; Tue, 1 Apr 2008 19:09:06 -0700 (PDT) Message-ID: <47F2EAC4.1050206@elischer.org> Date: Tue, 01 Apr 2008 19:09:08 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: Martin Fouts References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 02:09:09 -0000 DING DING DING! Will the contestants please go to their respective corners and calm down.. both of you are viewing what the other has said in light of your own current viewpoints instead of theirs and it's not reflectign well on either of you. an we call this to an end and maybe you two can discuss it some time over a beer with a whiteboard. It was fun in dintersting at the start, but it's gone to far.. STOPPIT!! .... NOT ONE more post.... leave it as it is.. (gee being a parent does have its uses...) From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 02:09:09 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 81AC71065692 for ; Wed, 2 Apr 2008 02:09:09 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outG.internet-mail-service.net (outg.internet-mail-service.net [216.240.47.230]) by mx1.freebsd.org (Postfix) with ESMTP id 20C418FC25 for ; Wed, 2 Apr 2008 02:09:09 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Tue, 01 Apr 2008 19:10:06 -0700 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 6E55E2D600F; Tue, 1 Apr 2008 19:09:06 -0700 (PDT) Message-ID: <47F2EAC4.1050206@elischer.org> Date: Tue, 01 Apr 2008 19:09:08 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: Martin Fouts References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 02:09:09 -0000 DING DING DING! Will the contestants please go to their respective corners and calm down.. both of you are viewing what the other has said in light of your own current viewpoints instead of theirs and it's not reflectign well on either of you. an we call this to an end and maybe you two can discuss it some time over a beer with a whiteboard. It was fun in dintersting at the start, but it's gone to far.. STOPPIT!! .... NOT ONE more post.... leave it as it is.. (gee being a parent does have its uses...) From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 03:05:25 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EC1A81065672; Wed, 2 Apr 2008 03:05:25 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id CEA078FC16; Wed, 2 Apr 2008 03:05:25 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 40AD9403C21; Tue, 1 Apr 2008 20:05:16 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 20:05:25 -0700 Message-ID: In-Reply-To: <200804020103.m3213JEt043506@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUXV2iGs/PJPG5TaipOq5gR1lMmAAC4MkA References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> <200804020103.m3213JEt043506@apollo.backplane.com> From: "Martin Fouts" To: , Cc: Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 03:05:26 -0000 To summarize, so that it's all in one place: 1) NAND flash is sufficiently different than either NOR flash or rotational media, that filesystem design optimizations aimed at either NOR or rotational tend to be inefficient in NAND and NAND offers opportunities for optimizations not present on either. It also presents challenges that don't exist for NOR or rotational media. In particular, seek and rotational latency are not present, but bit error rate is high, the size of the erase unit is large compared to the size of the write unit, and the presence of extra storage in the spare area makes optimizations possible that are not available in the other media, with the caveat that small page NAND devices cannot take advantage of the same degree of optimization as large page NAND devices 2) It is *possible* to use a flash translation layer to hide the complexity of flash from a filesystem implementation, and commercial file systems exist which do this, most notably the FATFS implementation used on most NAND based USB device, on the M-Systems parts, and commercially from Datalight. 3) It is not possible on consumer electronics "convergent" devices to take advantage of the usual techniques available for performance improvement through caching that is available on systems with relatively large amounts of NAND. A CE device with an included NAND part does not optimize in the same way as an SSD using NAND parts. 4) Power management on battery powered devices makes for different optimization trade-offs than on wall-powered devices. Most notably, it is often desirable to turn off power to RAM when the system is inactive, which has a design impact on robustness and performance. 5) The reduction in BOM and the increase in performance due to customized filesystem design has proven the usefulness of NAND-aware filesystems, at least in the commercial marketplace. 6) There are good reasons for exposing transactional semantics to the users of NAND file systems, having to do with robustness. 7) These are the well known approaches, with different strengths and weaknesses, to NAND-aware file systems: A) File system completely unaware of NAND, FTL takes care of the differences. This is used in USB devices, and has the advantage of being able to support those devices as if they were FATFS devices without changes to the host filesystem software. It has the disadvantage of performance and robustness penalties due to the filesystem making excessive writes to what it believes are fixed location datablocks. B) File systems aware of NAND, with an FTL. Datalight's RelianceFS and FFX products combine to provide this sort of approach. The advantage is that they tend to be much more robust than systems without the knowledge and even have higher performance. The disadvantage is the complexity of the translation layer, and the interfaces between it and the filesystem layer and the device layer. C) File systems that manage the NAND directly without an FTL. These fall into two camps: i) filesystems that treat NAND like NOR using a flash adaptation layer. JFFS and JFFS2, combined with MTD are the canonical examples. ii) filesystems that optimize for NAND properties. YAFFS2 direct is the canonical example. Because NAND provides no guarenteed good block, the performance issues with it are related to sensitivity to scan time to find state. JFFS2 failed in this area because of the nature of its embedded b-tree data structures, which are expensive to maintain robustly, difficult to garbage collect, and prone to needing frequent scanning and rewriting. It is conjectured that any filesystem which embeds a block renaming scheme into NAND will suffer the same fate. I for one would be interested in seeing a refutation of that conjecture, but there are now four different projects which have attempted to do so with no luck that I'm aware of. The issue is one of locality in the b-tree versus robustness. Sufficiently frequent updates of the structure to NAND to meet robustness requirements tend to put a great deal of write pressure on the device, as well as frequent garbage collection. At PalmSource, Mike Chen and myself took the NetBSD version of LFS and modified it sufficiently to produce a working log-structured file system that was used in the unshipped PalmOS Cobalt product. The conversion was relatively easy, taking somewhat less than 1.5 man years, and the resulting filesystem benchmarked favorably against other commercial products, but never saw field trial, so robustness is indetrminanent. A key to the modification was reducing the amount of state that had to be read during mount scan to a single block per erase unit and to be very careful about block selection for garbage collection. Charles Manning had already taken that approach one step further, in yaffs2, when he was able to reduce the amount of information needing scanning to a single spare area per erase unit, greatly reducing the mount scan time. Both the modified LFS and YAFFS2 take advantage of other properties of the NAND to reduce metadata write frequency and both relax timestamp semantics to do so. YAFFS2 goes farther than we did by providing a checkpoint facility which is used to further speed mount time and reconstruction. Both take advantage of spare area writing to determine write transaction completion. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 03:05:25 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EC1A81065672; Wed, 2 Apr 2008 03:05:25 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id CEA078FC16; Wed, 2 Apr 2008 03:05:25 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id 40AD9403C21; Tue, 1 Apr 2008 20:05:16 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 1 Apr 2008 20:05:25 -0700 Message-ID: In-Reply-To: <200804020103.m3213JEt043506@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciUXV2iGs/PJPG5TaipOq5gR1lMmAAC4MkA References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> <200803312254.m2VMsPqZ029549@apollo.backplane.com> <200804011733.m31HXF6e039649@apollo.backplane.com> <200804012014.m31KEvTJ041049@apollo.backplane.com> <200804012325.m31NPwM1042551@apollo.backplane.com> <200804020103.m3213JEt043506@apollo.backplane.com> From: "Martin Fouts" To: , Cc: Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 03:05:26 -0000 To summarize, so that it's all in one place: 1) NAND flash is sufficiently different than either NOR flash or rotational media, that filesystem design optimizations aimed at either NOR or rotational tend to be inefficient in NAND and NAND offers opportunities for optimizations not present on either. It also presents challenges that don't exist for NOR or rotational media. In particular, seek and rotational latency are not present, but bit error rate is high, the size of the erase unit is large compared to the size of the write unit, and the presence of extra storage in the spare area makes optimizations possible that are not available in the other media, with the caveat that small page NAND devices cannot take advantage of the same degree of optimization as large page NAND devices 2) It is *possible* to use a flash translation layer to hide the complexity of flash from a filesystem implementation, and commercial file systems exist which do this, most notably the FATFS implementation used on most NAND based USB device, on the M-Systems parts, and commercially from Datalight. 3) It is not possible on consumer electronics "convergent" devices to take advantage of the usual techniques available for performance improvement through caching that is available on systems with relatively large amounts of NAND. A CE device with an included NAND part does not optimize in the same way as an SSD using NAND parts. 4) Power management on battery powered devices makes for different optimization trade-offs than on wall-powered devices. Most notably, it is often desirable to turn off power to RAM when the system is inactive, which has a design impact on robustness and performance. 5) The reduction in BOM and the increase in performance due to customized filesystem design has proven the usefulness of NAND-aware filesystems, at least in the commercial marketplace. 6) There are good reasons for exposing transactional semantics to the users of NAND file systems, having to do with robustness. 7) These are the well known approaches, with different strengths and weaknesses, to NAND-aware file systems: A) File system completely unaware of NAND, FTL takes care of the differences. This is used in USB devices, and has the advantage of being able to support those devices as if they were FATFS devices without changes to the host filesystem software. It has the disadvantage of performance and robustness penalties due to the filesystem making excessive writes to what it believes are fixed location datablocks. B) File systems aware of NAND, with an FTL. Datalight's RelianceFS and FFX products combine to provide this sort of approach. The advantage is that they tend to be much more robust than systems without the knowledge and even have higher performance. The disadvantage is the complexity of the translation layer, and the interfaces between it and the filesystem layer and the device layer. C) File systems that manage the NAND directly without an FTL. These fall into two camps: i) filesystems that treat NAND like NOR using a flash adaptation layer. JFFS and JFFS2, combined with MTD are the canonical examples. ii) filesystems that optimize for NAND properties. YAFFS2 direct is the canonical example. Because NAND provides no guarenteed good block, the performance issues with it are related to sensitivity to scan time to find state. JFFS2 failed in this area because of the nature of its embedded b-tree data structures, which are expensive to maintain robustly, difficult to garbage collect, and prone to needing frequent scanning and rewriting. It is conjectured that any filesystem which embeds a block renaming scheme into NAND will suffer the same fate. I for one would be interested in seeing a refutation of that conjecture, but there are now four different projects which have attempted to do so with no luck that I'm aware of. The issue is one of locality in the b-tree versus robustness. Sufficiently frequent updates of the structure to NAND to meet robustness requirements tend to put a great deal of write pressure on the device, as well as frequent garbage collection. At PalmSource, Mike Chen and myself took the NetBSD version of LFS and modified it sufficiently to produce a working log-structured file system that was used in the unshipped PalmOS Cobalt product. The conversion was relatively easy, taking somewhat less than 1.5 man years, and the resulting filesystem benchmarked favorably against other commercial products, but never saw field trial, so robustness is indetrminanent. A key to the modification was reducing the amount of state that had to be read during mount scan to a single block per erase unit and to be very careful about block selection for garbage collection. Charles Manning had already taken that approach one step further, in yaffs2, when he was able to reduce the amount of information needing scanning to a single spare area per erase unit, greatly reducing the mount scan time. Both the modified LFS and YAFFS2 take advantage of other properties of the NAND to reduce metadata write frequency and both relax timestamp semantics to do so. YAFFS2 goes farther than we did by providing a checkpoint facility which is used to further speed mount time and reconstruction. Both take advantage of spare area writing to determine write transaction completion. From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 09:10:47 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ABD0C106566B for ; Wed, 2 Apr 2008 09:10:47 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from hosted.kievnet.com (hosted.kievnet.com [193.138.144.10]) by mx1.freebsd.org (Postfix) with ESMTP id 694368FC1A for ; Wed, 2 Apr 2008 09:10:47 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from localhost ([127.0.0.1] helo=edge.pp.kiev.ua) by hosted.kievnet.com with esmtpa (Exim 4.62) (envelope-from ) id 1JgybC-0004fk-41 for freebsd-arch@freebsd.org; Wed, 02 Apr 2008 11:45:38 +0300 Message-ID: <47F347B1.2020509@icyb.net.ua> Date: Wed, 02 Apr 2008 11:45:37 +0300 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.12 (X11/20080320) MIME-Version: 1.0 To: freebsd-arch@freebsd.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: kobj method signature/prototype checking/enforcement X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 09:10:47 -0000 As you are most probably aware, currently there is no checking/enforcement for signatures of functions implementing kobj methods. Internally all function pointers are stored as pointers to 'int f(void)', and they are cast to and from as needed. So, for example, if you set a function 'char * g(void **)' as device_probe method then the compiler will compile everything just fine, it will be only at run-time that you will get a trouble because of mismatching arguments. I propose to defend against this problem using the following macro for KOBJMETHOD: #define KOBJMETHOD(NAME, FUNC) \ { &NAME##_desc, (kobjop_t) (FUNC != (NAME##_t *)NULL ? FUNC : NULL) } This is an idea behind it: 1. the comparison expression is a NOP, its result is always the same as (kobjop_t)FUNC 2. the expression is evaluated at compile time, so it doesn't create any run-time overhead or binary differences 3. purpose of expression is to make use of GCC feature to warn about comparing "distinct pointer types" I tested this change with 6.3-RELEASE sources. It revealed a number of signature mismatches in different places. Obviously all of them are quite harmless - otherwise they would be already discovered in a hard way (by people bitten). Here's a general overview of issues discovered: 1. integer parameters differing in signedness (totally harmless, I think) 2. using void return type instead of int, usually for device_shutdown method (not sure about this one) 3. using int return type instead of specific size integer return type, typically for sound channel interface methods 4. 'char *' parameter instead of 'const char *' parameter (potentially can result in future problems) 5. significantly different signatures for several "dummy" methods that do not actually use any of the parameters and simply print a message or panic. While the above issues are quite harmless, I still think that adding such a checking code is a good thing. It will help with new code development and it will help general code quality and maintenance. Unfortunately I don't have my FreeBSD development environment quite set up (yet) for large scale development, so at this point I can not provide a patch for HEAD that would fix all the build breakages (on all the platforms) that would be caused by the proposed change (when -Werror is in effect). -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 18:17:46 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 371D71065786 for ; Wed, 2 Apr 2008 18:17:46 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 22EF78FC24 for ; Wed, 2 Apr 2008 18:17:46 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from zion.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by elvis.mu.org (Postfix) with ESMTP id CFE8F1A4D80; Wed, 2 Apr 2008 11:17:45 -0700 (PDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Wed, 2 Apr 2008 13:09:54 -0400 User-Agent: KMail/1.9.7 References: <10004.1205307334@critter.freebsd.dk> <20080312152744.I29518@fledge.watson.org> <20080328202602.N72156@desktop> In-Reply-To: <20080328202602.N72156@desktop> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200804021309.54956.jhb@freebsd.org> Cc: Subject: Re: timeout/callout small step forward X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 18:17:46 -0000 On Saturday 29 March 2008 03:04:17 am Jeff Roberson wrote: > http://people.freebsd.org/~jeff/callout.diff > > This patch takes the current callout implementation and makes it per-cpu. > It also hides callout details from the rest of the kernel by making the > callwheel structure private to kern_timeout.c among other things. Looks good. The kern_intr.c diff has a small bug (forgot to remove the return (intr_event_create(...)) from swi_add()). A few style suggestions would be to always leave a blank line before a comment (I think I saw this in kern_calloutwheel_init()?) and usually there isn't a blank line before a SYSINIT(). Maybe make the panic messages when creating softclock threads more specific, but that's very minor. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 19:11:49 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4CBF11065670 for ; Wed, 2 Apr 2008 19:11:49 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 08DE28FC15 for ; Wed, 2 Apr 2008 19:11:48 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m32J9TZT015462; Wed, 2 Apr 2008 13:09:29 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Wed, 02 Apr 2008 13:10:19 -0600 (MDT) Message-Id: <20080402.131019.-705186138.imp@bsdimp.com> To: dillon@apollo.backplane.com From: "M. Warner Losh" In-Reply-To: <200804012226.m31MQ42O042173@apollo.backplane.com> References: <200804012010.m31KAMpu041011@apollo.backplane.com> <200804012226.m31MQ42O042173@apollo.backplane.com> X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org, mfouts@danger.com Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 19:11:49 -0000 In message: <200804012226.m31MQ42O042173@apollo.backplane.com> Matthew Dillon writes: : : :> 64MB is tiny. None of the problems with any of the=20 : :> approachs we've discussed even exist with devices that small in an=20 : :> embedded system. : : : :It is fairly clear that you're not familiar with NAND devices on : :embedded systems, as you've just said that well known problems do not : :exist. : : : :> To be clear, because I really don't understand how you=20 : :> can possibly argue that the named-block storage layer is bad in a=20 : :> device that small... : : : :Yes, your lack of understanding is very apparent. : : What complete bullshit. If you want to argue technical merits, be : my guest. So far you haven't made one single technical point in : any of your postings. You've posted about your experience with NAND AHEM! Matt, you will keep a civil tongue, or you will be asked to leave the list. This goes for everybody else too. Warner From owner-freebsd-arch@FreeBSD.ORG Wed Apr 2 19:21:02 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3947E1065674 for ; Wed, 2 Apr 2008 19:21:02 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id C9EC38FC38 for ; Wed, 2 Apr 2008 19:21:01 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m32JGiYO015549; Wed, 2 Apr 2008 13:16:45 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Wed, 02 Apr 2008 13:17:34 -0600 (MDT) Message-Id: <20080402.131734.255331081.imp@bsdimp.com> To: avg@icyb.net.ua From: "M. Warner Losh" In-Reply-To: <47F347B1.2020509@icyb.net.ua> References: <47F347B1.2020509@icyb.net.ua> X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: kobj method signature/prototype checking/enforcement X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Apr 2008 19:21:02 -0000 In message: <47F347B1.2020509@icyb.net.ua> Andriy Gapon writes: : I propose to defend against this problem using the following macro for : KOBJMETHOD: : #define KOBJMETHOD(NAME, FUNC) \ : { &NAME##_desc, (kobjop_t) (FUNC != (NAME##_t *)NULL ? FUNC : NULL) } ... : Here's a general overview of issues discovered: : 1. integer parameters differing in signedness (totally harmless, I think) : 2. using void return type instead of int, usually for device_shutdown : method (not sure about this one) : 3. using int return type instead of specific size integer return type, : typically for sound channel interface methods : 4. 'char *' parameter instead of 'const char *' parameter (potentially : can result in future problems) : 5. significantly different signatures for several "dummy" methods that : do not actually use any of the parameters and simply print a message or : panic. : : While the above issues are quite harmless, I still think that adding : such a checking code is a good thing. It will help with new code : development and it will help general code quality and maintenance. : : Unfortunately I don't have my FreeBSD development environment quite set : up (yet) for large scale development, so at this point I can not provide : a patch for HEAD that would fix all the build breakages (on all the : platforms) that would be caused by the proposed change (when -Werror is : in effect). Yes! I think I like this approach, and would like to see it fleshed out more. Warner From owner-freebsd-arch@FreeBSD.ORG Thu Apr 3 05:51:13 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A2B3106566C; Thu, 3 Apr 2008 05:51:13 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id E64E88FC2A; Thu, 3 Apr 2008 05:51:12 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [10.0.1.199] (cpe-24-94-72-120.hawaii.res.rr.com [24.94.72.120]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m335p6fA098719; Thu, 3 Apr 2008 01:51:09 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 2 Apr 2008 19:51:32 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: John Baldwin In-Reply-To: <200804021309.54956.jhb@freebsd.org> Message-ID: <20080402195001.O949@desktop> References: <10004.1205307334@critter.freebsd.dk> <20080312152744.I29518@fledge.watson.org> <20080328202602.N72156@desktop> <200804021309.54956.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: timeout/callout small step forward X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Apr 2008 05:51:13 -0000 On Wed, 2 Apr 2008, John Baldwin wrote: > On Saturday 29 March 2008 03:04:17 am Jeff Roberson wrote: >> http://people.freebsd.org/~jeff/callout.diff >> >> This patch takes the current callout implementation and makes it per-cpu. >> It also hides callout details from the rest of the kernel by making the >> callwheel structure private to kern_timeout.c among other things. > > Looks good. The kern_intr.c diff has a small bug (forgot to remove the return > (intr_event_create(...)) from swi_add()). A few style suggestions would be Ah thanks. I had fixed this in a tree but didn't update the patch. Now it's in current. I'll check that in. > to always leave a blank line before a comment (I think I saw this in > kern_calloutwheel_init()?) and usually there isn't a blank line before a > SYSINIT(). Maybe make the panic messages when creating softclock threads > more specific, but that's very minor. Ok, I think kern_timeout.c could use some reformating and refactoring as well but I didn't want to tie this commit to that. Some of those functions get too deep and should be broken off into simpler routines. Thanks, Jeff > > -- > John Baldwin > From owner-freebsd-arch@FreeBSD.ORG Fri Apr 4 02:11:08 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A3F711065689 for ; Fri, 4 Apr 2008 02:11:08 +0000 (UTC) (envelope-from onlinefuturebazaar2007@gmail.com) Received: from qb-out-0506.google.com (qb-out-0506.google.com [72.14.204.235]) by mx1.freebsd.org (Postfix) with ESMTP id E74FF8FC17 for ; Fri, 4 Apr 2008 02:11:07 +0000 (UTC) (envelope-from onlinefuturebazaar2007@gmail.com) Received: by qb-out-0506.google.com with SMTP id a10so4241067qbd.7 for ; Thu, 03 Apr 2008 19:11:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:return-receipt-to:reply-to:from:to:subject:date:organization:message-id:mime-version:content-type:x-mailer:thread-index:x-mimeole:disposition-notification-to; bh=fV+nZ1zAwuCA+7O7scSERvPa6wTclCQ37XujbEsWrTk=; b=Hly/luIqbEUdVUE0Lz6nMVIGvrOhPEhK3krFhZvNSwP+LGaKizAw57kLjLA3xLNhXfrH1K6DePs7bF3X9kvR1x799m1Yqu1g5Das3/T+MJlbCPq46tbgOMlTdNnMCTbMsn+N60RZ8aaA4oXLpBs3/wKn3XnfbIIIiBsclOyybQk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=return-receipt-to:reply-to:from:to:subject:date:organization:message-id:mime-version:content-type:x-mailer:thread-index:x-mimeole:disposition-notification-to; b=NMs9gtBEn5gJuYOJIwslQBub3es2L+n6pcTwlgZ2jQmvqTz7slzyEbS2Y/uLZiD5TiiZWzhXlii8IQa+ByT6ix6LxGy10Olbp+nBosnu+sQmeGsyQQMUJFmCHko45q9R2ha7vtbi43ykVscOBYC/WlRkt+N+1I3rmDzylwSyDZA= Received: by 10.142.222.21 with SMTP id u21mr430104wfg.231.1207274132325; Thu, 03 Apr 2008 18:55:32 -0700 (PDT) Received: from onlinemain ( [59.161.47.100]) by mx.google.com with ESMTPS id 27sm8094416wff.8.2008.04.03.18.55.29 (version=SSLv3 cipher=RC4-MD5); Thu, 03 Apr 2008 18:55:31 -0700 (PDT) From: "Suraj Saroj" To: Date: Fri, 4 Apr 2008 06:59:24 +0530 Organization: Online Future Bazaar Message-ID: MIME-Version: 1.0 X-Mailer: Microsoft Office Outlook, Build 11.0.5510 Thread-Index: AciV6UeS1HjQezP1TKGXa9ZKdpU9AA== X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Online Future Bazaar X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: onlinefuturebazaar2007@gmail.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Apr 2008 02:11:08 -0000 Visit: www.onlinefuturebazaar.com Online Future Bazaar India