FreeBSD Mail Archives

Date:      Sat, 21 Jan 2017 22:16:42 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Karl Denninger <karl@denninger.net>
Cc:        "freebsd-arm@freebsd.org" <freebsd-arm@freebsd.org>
Subject:   Re: how to measure microsd wear
Message-ID:  <CANCZdfogv7aAvvke=U3Kit1XZxo8P4sg2B72bXMaoh1qFZn-Pg@mail.gmail.com>
In-Reply-To: <eef976be-7261-14cb-153b-9eb7a4bdd43b@denninger.net>
References:  <20170122002432.B16E8406061@ip-64-139-1-69.sjc.megapath.net> <eef976be-7261-14cb-153b-9eb7a4bdd43b@denninger.net>

On Sat, Jan 21, 2017 at 9:29 PM, Karl Denninger <karl@denninger.net> wrote:
> On 1/21/2017 18:24, Hal Murray wrote:
>> karl@denninger.net said:
>>> and this one is not a low-hour failure either, nor is it an off-brand --
>>> it's a Sandisk Ultra 32Gb and the machine has roughly a year of 24x7x365
>>> uptime on it.
>> Any idea how many writes it did?
> Offhand, no.  I did not expect this particular device to have a problem
> given its workload, but it did.  It could have been a completely random
> even (e.g. cosmic ray hits the "wrong" place in the controller's mapping
> tables, damages the data in it a critical way, and the controller throws
> up its hands and says "screw you, it's over.")  There's no real way to
> know - the card is effectively junk as the controller has write-locked
> it, so all I can do (and did) is get the config files and application it
> runs under the OS off it and put them on the new one.

If you are lucky, the SD card will fail 'read only' rather than 'read
never'. :) You're correct that once you go into that mode, however,
you can't get out of it with standard interfaces. There are rumors of
vendor specific ones that are used to diagnose failure modes, but I've
never been able to find out more about them as they are firmware
specific. The NAND chips, however, remain generally readable and you
can do a hunt to see what's what. I say generally, though, because the
list of failure modes for NAND chips is scary....

> The other failures were less-surprising; in particular the box on my
> desk, given that I compile on it frequently and that produces a lot of
> small write I/O activity, didn't shock me all that much when it failed.
>
> One of the big problems with NAND flash (in any form) is that it can
> only be written to "zeros."  That is, a blank page is all "1s" at a bit
> level, and a write actually just writes the zeros.

Yes and now. Erasing a page will set it to all 1's. Programming a page
will move the charge nodes from the 'erased' state to the 'programmed'
state. What this means varies a lot based on the type of NAND. SLC,
sure, it goes from 1 to 0 (although the node for 1 still moves a bit).
For MLC or TLC, it's a lot more complicated because you're encoding 2
or 3 bits into discrete voltage level. You have to do the proper dance
and program the pages in the correct order with the correct
'randomizations' in the data (either inside the chip, or external to
it) to make sure that the 'white balance' of bits is very close to
50/50. Also added in this pipeline is the ECC or LDPC error coding to
ensure that the crappy NAND can recover from the inevitable bit errors
that you know will happen.

> This leads to what
> is called "write amplification" because changing one byte in a page
> requires reading the page in and writing an entire new page out, then
> (usually later) erasing the former page; you cannot update in-place.

The erase and program causes this, yes. But write amplification
happens when the drive has to garbage collect blocks off its log to
find blocks it can write new blocks. It is different than the effect
you are talking about which seems to ignore the LBA to phyiscal
translation layer that's in the drive that's hidden from the user.

> If
> a page is 4k in size then writing a single byte results in an actual
> write of 4k bytes, or ~4,000 times as much as you think you wrote.

No, that's not how it works. First off, there's no interface for
writing one byte (just programming a page). Second, the OS will
translate writing one byte to writing one block which will cause a new
page to be written to the end of the log. Flash memory is erased
BLOCKS at a time (usually a few hundred pages) and programmed a page
at a time. When you write a new block, it gets appended to the end of
the "log" with a note about the new LBA to physical mapping. Also,
pages are usually larger than logical blocks, so you often wind up
with multiple blocks living inside a single page. The cause of write
amps are when LBAs are re-written creating "holes" in the map. The
erase blocks that are most empty are usually selected to be garbage
collected (the valid blocks written to the end of the log and the
erase block erased to use for new writes). So write amp tends to trend
as the inverse of the number of spare blocks in the system...

> This
> is also one of the reasons that random small-block write performance is
> much slower than big writes; if you write an even multiple of an on-card
> block the controller can simply lay down the new data onto pre-erased
> space, where if you write small pieces of data it cannot do that and
> winds up doing a lot of read/write cycling.

Actually, that's crap too. The reason that small writes are lower
performance is down to internal structures on SSDs that block the
physical writes (say 512 or 4k) into a larger page (say 32k or 64k) to
keep the metadata on the LBA to physical mapping down. So they do a
read, modify write, which adds a tREAD and some buffering time and
increases write amp.

> It gets worse (by a lot) if
> there's file metadata to update with each write as well because that
> metadata almost-certainly winds up carrying a (large) amount of write
> amplification irrespective of the file data itself.  All of this is a
> big part of why write I/O performance to these cards for actual
> filesystem use is stinky in the general case compared against
> pretty-much anything else.

Blocks are blocks. Metadata doesn't matter. Blocks get appended to the
device log, so when you write them, it doesn't matter where on the
disk.

> The controller's internal logic has much voodoo in it from a user's
> perspective; the manufacturers consider exactly how they do what they do
> to be proprietary and simply present to you an opaque block-level
> interface.  There are rumors that some controllers "know" about certain
> filesystems (specifically exFAT) and are optimized for it, which implies
> they may behave less-well if you're using something else.  How true this
> actually might be is unknown but a couple of years ago I had a card that
> appeared dead -- until it was reformatted with exFAT, at which point it
> started working again.  I didn't trust it, needless to say.

Different drives have different strategies. But the exFAT special
case, when it is still used at all, is confined to SD cards, SSDs
don't use it anymore.

> SSDs typically have a published endurance rating and a reasonable
> interface to get a handle on how much "wear" they have experienced.
> I've never seen either in any meaningful form for SD cards of any sort.
> In addition SSDs can (and do) "cheat" in that they all have RAM in them
> and thus can collate writes together before physically committing them
> in some instances, plus they typically will report that a write is
> "complete" when it is in RAM (and not actually in NAND!)  Needless to
> say if there's no proper power protection sufficient to flush that RAM
> if the power fails unexpectedly very bad things will happen to your
> data, and very few SSDs have said proper power protection (Intel 7xx and
> 3xxx series are two that are known to do this correctly; I have a bunch
> of the 7xx series drives in service and have never had a problem with
> any of them even under intentional cord-yank scenarios intended to test
> their power-loss protection.)  I'm unaware of SD cards that do any of
> this and I suspect their small size precludes it, never mind that they
> were not designed for a workload where this would be terribly useful.
> The use envisioned for most SD cards, and their intent when designed, is
> the sequential writing of anywhere from large to huge files (video or
> still pictures) and the later sequential reading back of same, all under
> some form of a FAT filesystem (exFAT for the larger cards now available.)

That's kinda true. SD cards have no room for super caps. However, the
black art that goes into the SD cards have allowed for this and they
have reliability guarantees of their own, but they go slower for it.
They generally don't have large DRAM buffers, and generally write
throttle to NAND rate pretty quickly. If you aren't lying to the OS by
saying the write is complete to win on IOPs benchmarks, the
reliability issues go away.

Where SD cards fall down is their firmware is usually rubbish at
recovering from certain kinds of errors. The usual one of a camera
that loses power while writes are going on generally work (basically a
sequential workload) because the meta-data they need to keep book is
usually written in a reliable way. Usually, but not always, which is
why I usually rate them as rubbish.

> IMHO the best you can do with these cards in this application is to
> minimize writes to the extent you can, especially small and frequent
> writes of little actual value (e.g. mount with noatime!) and make sure
> you can reasonably recover from failures in a rational fashion.

That's true, but for none of the reasons that you suggest. The reasons
have more to do with the log structure the devices are forced to used
coupled with newer NAND nodes that have lower and lower endurance. The
good 3d NAND that gives back to endurance again are reserved for the
NVMe and SSD drives, but it is starting to show up in high-performance
SD cards as well, though they are still pricy.

BTW, I worked at Fusion I/O for a few years writing their 'on load'
driver that did all these things and learning about all the tricks
that are used in the industry. My main area of focus was the NAND
reliability models that were used to get better performance later in
life out of crappy NAND and to ensure the drives could go 5x the
manufacturer's stated endurance numbers with lots of clever tricks.
Secondarily, I worked on garbage collection and radix tree design to
help improve our drive's performance under a variety of work loads.

Warner

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfogv7aAvvke=U3Kit1XZxo8P4sg2B72bXMaoh1qFZn-Pg>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation