From owner-freebsd-geom@FreeBSD.ORG Sat Feb 16 00:33:09 2008 Return-Path: Delivered-To: geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2FB8416A418 for ; Sat, 16 Feb 2008 00:33:09 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id AB82913C45A for ; Sat, 16 Feb 2008 00:33:08 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 3BE5817104; Sat, 16 Feb 2008 00:33:07 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m1G0X6tr094216; Sat, 16 Feb 2008 00:33:06 GMT (envelope-from phk@critter.freebsd.dk) To: Marcel Moolenaar From: "Poul-Henning Kamp" In-Reply-To: Your message of "Fri, 15 Feb 2008 15:46:02 PST." Date: Sat, 16 Feb 2008 00:33:06 +0000 Message-ID: <94215.1203121986@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: geom@FreeBSD.org Subject: Re: Brainstorm: NAND flash X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Feb 2008 00:33:09 -0000 In message , Marcel Moolenaar wri tes: >> Mediasize is about addressability, not about usability, so this >> assumption is wrong. >> >> A GEOM provider is just an addressable array of sectors, it >> doesn't guarantee that you can read them all or write them >> all, as is indeed the case when your disk develops a bad sector. >> >> NAND is only special due to the OOB stuff, the main page array >> is just a pretty spotty disk, for all GEOM cares. > >The reason I thought this was good is that disks are >shipped without bad blocks visible to the "application". >That is: the norm is no bad blocks. With NAND flash >the norm is that bad blocks part of the deal. I thought >that dealing with bad blocks explicitly for NAND would >level the playing field and make it more consistent... Well, if you want to take that route, you should not use GEOM to connect the wear-leveling to the NAND flash in the first place. Which option you prefer there is sort of a toss. Putting it gives you devices in /dev and other benefits, using a private interface allows you to get it more precisely tailored to your needs. I would say put it under GEOM, the bad blocks will not trouble GEOM, and should somebody get perfect NAND (or care to handle the bad blocks otherwise), they can stick their filesystem there directly, if they don't need to write to it too much. >>> dealt with at this level. NANDs don't have sectors. >>> Attributes of this class include: >>> blockcount - the raw number of blocks >> >> This goes in mediasize (as a byte count) >> >>> blocksize - the number of bytes or pages in a block >> >> This goes in sectorsize. > >Can't this cause race conditions? > >Suppose there happens to be a MBR in the first page at >offset 0. The MBR class could end up taking the provider, >when a wear-leveling geom should really take it. At the moment the wear-leveling opens the NAND device for writing, the MBR would get spoiled and disappear. And the chances of MBR finding its metadata in the right physical sector is pretty small to begin with if the wear leveling is worth anything. Of course if you do simple bad-block substitution, the chance would be close to certainty, but the MBR would still get spoiled, so that would still work. >I'm ignorant of the obviousness of why sector mapping and >wear-leveling are to be done at the same time... > >...and I presume you can't elaborate... No I can't. But I can tell you something about filesystems under BSD license which might interest you. Imagine you implement a filesystem, that allocates space in 512 byte sectors, even though the underlying device has a (much) larger sector size.[1] To reduce the amount of disk-I/O, you would obviously want to avoid doing read 64k block modify 512 bytes of those write 64k block read same 64k block modify some other 512 bytes of those write 64k block again In particular if writes were very slow or otherwise expensive. You would of course do this, by implementing, as UNIX has always done, a buffer-cache that does the logical/physical translation. BUT, imagine now as a complication, that your filesystem was log-structured in somewhat the same hacked up way that Margo Seltzer did with LFS. The idea behind LFS is important in this context: The objective was to gain write speed by always writing sequentially and basically treat the disk as a circular buffer, hoping that the RAM cache would limit the amount of seeks for reading, and that the disk would have enough free space to reduce the workload of the cleaner process. The trouble with that of course, is that both assumptions were wrong until RAM and disk exploded in size just a few years ago. On a 95% filled filesystem, LFS sunk under the weight of the cleaner, and RAM was never big enough to cache all you wanted and it doesn't help until the second access anyway. The other important aspect of a LFS, is that you need a "cleaner" process to run ahead of the write pointer, and scavenge space. If it finds a fully used big block, it leaves it alone, but if it finds an 64k block with only 512 bytes of data, it copies those 512 bytes into the write stream so it can mark the 64k block as free, and recycle it. Margos LFS was a fiasco, but we can still learn from it: The source of trouble, as far as I have been able to find out, is that the filesystem naming layer (in her case UFS) need a logical block number which must be determined before the physical block number has been allocated, so the logical block number must be translated to a physical number through some sort of means or table. You obviously would _not_ want two copies of the data in the cache, one under the logical and one under the physical blocknumber, so you have to pick one or the other. Margos choice for the easy solution to the logical/physical mapping problem in LFS, sucked badly when it came to write the "cleaner" process: A mapping that gives you only a logical->physical translation cheaply, but requires you to read many blocks of disk to reverse the mapping, doesn't help you when you read a physical sector and need to find out if it is used in, and where it belongs in the logical space. Which is exactly what the cleaner needs to do. I belive in the end her choice made it so damn hard that the cleaner never happened during the time she took an interest in LFS (exactly until she got her phd I belive ?) Ousterhout had some very good and relevant, but harsh words for her about that. (Sprites LFS, by Ousterhout, is also worth a study, but it was better designed but also more narrowly tailored to the Sprite OS, and thus we cannot learn as much from it today.) This is all from memory, I havn't bothered to look up the LFS source code or the correspondence on Ousterhouts page, so some details may be slightly off, for which I apologize. Poul-Henning [1] Its interesting that Sun gave up on this and had to get special firmware to CD-ROM drives, but that's an entirely different story and not relevant :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.