Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 May 2011 01:34:29 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Daniel Kalchev <daniel@digsys.bg>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS: How to enable cache and logs.
Message-ID:  <20110512083429.GA58841@icarus.home.lan>
In-Reply-To: <4DCB7F22.4060008@digsys.bg>
References:  <4DCA5620.1030203@dannysplace.net> <4DCB455C.4020805@dannysplace.net> <alpine.GSO.2.01.1105112146500.20825@freddy.simplesystems.org> <20110512033626.GA52047@icarus.home.lan> <4DCB7F22.4060008@digsys.bg>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, May 12, 2011 at 09:33:06AM +0300, Daniel Kalchev wrote:
> On 12.05.11 06:36, Jeremy Chadwick wrote:
> >On Wed, May 11, 2011 at 09:51:58PM -0500, Bob Friesenhahn wrote:
> >>On Thu, 12 May 2011, Danny Carroll wrote:
> >>>Replying to myself in order to summarise the recommendations (when using
> >>>v28):
> >>>- Don't use SSD for the Log device.  Write speed tends to be a problem.
> >>DO use SSD for the log device.  The log device is only used for
> >>synchronous writes.  Except for certain usages (E.g. database and
> >>NFS server) most writes will be asynchronous and never be written to
> >>the log.  Huge synchronous writes will also bypass the SSD log
> >>device. The log device is for reducing latency on small synchronous
> >>writes.
> >Bob, please correct me if I'm wrong, but as I understand it a log device
> >(ZIL) effectively limits the overall write speed of the pool itself.
> >
> Perhaps I misstated it in my first post, but there is nothing wrong
> with using SSD for the SLOG.
> 
> You can of course create usage/benchmark scenario, where an (cheap)
> SSD based SLOG will be worse than an (fast) HDD based SLOG,
> especially if you are not concerned about latency. The SLOG resolves
> two issues, it increases the pool throughput (primary storage) by
> removing small synchronous writes from it, that will unnecessarily
> introduce head movement and more IOPS and it provided low latency
> for small synchronous writes.

I've been reading about this in detail here:

http://constantin.glez.de/blog/2010/07/solaris-zfs-synchronous-writes-and-zil-explained

I had no idea the primary point of a SLOG was to deal with applications
that make use of O_SYNC.  I thought it was supposed to improve write
performance for both asynchronous and synchronous writes.  Obviously I'm
wrong here.

The author's description (at that URL) of an example scenario makes
little sense to me; there's a story he tells referring to a bank and a
financial transaction of US$699 performed which got cached in RAM and
then the system lost power -- and how the intent log on a filesystem
would be replayed during reboot.

What guarantee is there that the intent log -- which is written to the
disk -- actually got written to the disk in the middle of a power
failure?  There's a lot of focus there on the idea that "the intent log
will fix everything, but may lose writes", but what guarantee do I have
that the intent log isn't corrupt or botched during a power failure?

I guess this is why others have mentioned the importance of BBUs and
supercaps, but I don't know what guarantee there is that during a power
failure there won't be some degree of filesystem corruption or lost
data.

There's a lot about ensuring/guaranteeing filesystem integrity I've to
learn.

> The later is only valid if the SSD is sufficiently write-optimized.
> Most consumer SSDs end up saturated by writes. Sequential write IOPS
> is what matters here.

Oh, I absolutely agree on this point.  So basically consumer-level SSDs
that don't provide extreme write speed benefits (compared to a classic
MHDD) -- not discussing seek times here, we all know SSDs win there --
probably aren't good candidates for SLOGs.

What's interesting about the focus on IOPS is that Intel SSDs, in the
consumer class, still trump their competitors.  But given that your
above statement focuses on sequential writes, and the site I provided is
quite clear about what happens to sequential writes on Intel SSD that
doesn't have TRIM..... Yeah, you get where I'm going with this.  :-)

> About TRIM. As it was already mentioned, you will use only small
> portion of an (for example) 32GB SSD for the SLOG. If you do not
> allocate the entire SSD, then wear leveling will be able to play
> well and it is very likely you will not suffer any performance
> degradation.

That sounds ideal, though I'm not sure about the "won't suffer ANY
performance degradation" part.  I think degradation is just less likely
to be witnessed.

I should clarify on what "allocate" in the above paragraph means (for
readers, not for you Daniel :-) ): it means disk space actually used
(LBAs actually written to).  Wear levelling works better when there's
more available (unused) flash.  The more full the disk (filesystem(s))
is, the worse the wear levelling algorithm performs.

> By the way, I do not believe Windows benchmark has any significance
> in our ZFS usage for the SSDs. How is TRIM implemented in Windows?
> How does it relate to SSD usage as SLOG and L2ARC?

Yeah, I knew someone would go down this road.  Sigh.  I strongly believe
it does have relevance.  The relevance is in the fact that the non-TRIM
benchmarks (read: an OS that has TRIM support but the SSD itself does
not, therefore TRIM cannot be used) are strong indicators that the
performance of the SSD -- sequential reads and writes both -- greatly
degrade without TRIM over time.  This is also why you'll find people
(who cannot use TRIM) regularly advocating an entire format (writing
zeros to all LBAs on the disk) after prolonged use without TRIM.

I don't know how TRIM is implemented with NTFS in Windows.

> How can ever TRIM support influence reading from the drive?!

I guess you want more proof, so here you go.  Again, the authors wrote a
bunch of data to the filesystem, took a sequential read benchmark, then
induced TRIM and took another sequential read benchmark.  The difference
is obvious.  This is an X25-V, however, which is the "low-end" of the
consumer series, so the numbers are much worse -- but this is a drive
that runs for around US$100, making it appealing to people:

http://www.anandtech.com/show/3756/2010-value-ssd-100-roundup-kingston-and-ocz-take-on-intel/5

I imagine the reason this happens is similar to why memory performance
degrades under fragmentation or when there's a lot of "middle-man stuff"
going on.

"Middle-man stuff" in this case means the FTL inside of the SSD which is
used to correlate LBAs with physical NAND flash pages (and the
physically separate chips; it's not just one big flash chip you know).
NAND flash pages tend to be something like 256KByte or 512KByte in size,
so erasing one means no part of it should be in use by the OS or
underlying filesystem.

How does the SSD know what's used by the OS?  It has to literally keep
track of all the LBAs written to.  I imagine that list is extremely
large and takes time to iterate over.

TRIM allows the OS to tell the underlying SSD "LBAs x-y aren't in use
any more", which probably removes an entry from the FTL flash<->LBA map,
and even does things like move data around between flash pages so that
it can erase a NAND flash page.  It can do the latter given the role of
the FTL acting as a "middle-man" as noted above.

> TRIM is an slow operation. How often are these issued?

Good questions, for which I have no answer.  The same could be asked of
any OS however, not just Windows.  And I've asked the same question
about SSDs internal "garbage collection" too.  I have no answers, so you
and I are both wondering the same question.  And yes, I am aware TRIM is
a costly operation.

There's a description I found of the process that makes a lot of sense,
so rather than re-word it I'll just include it here:

http://www.enterprisestorageforum.com/technology/article.php/11182_3910451_2/Fixing-SSD-Performance-Degradation-Part-1.htm

See the paragraph starting with "Another long-awaited technique".

> What is the impact of issuing TRIM to an otherwise loaded SSD?

I'm not sure if "loaded" means "heavy I/O load" or "heavily used"
(space-wise).  If you meant "heavy I/O load": as I understand it --
following forums, user experiences, etc. -- a heavily-used drive which
hasn't had TRIM issued tends to perform worse as time goes on.  Most
people with OSes that don't have TRIM (OS X, Windows XP, etc.) tend to
resort to a full format of the SSD (every LBA written zero, e.g. the -E
flag to newfs) every so often. 

The interval TRIM should be performed is almost certainly up for
discussion, but I can't provide any advice because no OS I run or use
seems to implement it (aside from FreeBSD UFS, and that seems to issue
TRIM on BIO_DELETE via GEOM).

(Inline EDIT: Holy crap, I just realised TRIM support has to be enabled
via tunefs on UFS filesystems.  I started digging through the code and I
found the FS_TRIM bit; gee, maybe I should use tunefs -t.  I wish I had
known this; I thought it just did this automatically if the underlying
storage device provided TRIM support.  Sigh)

Here's some data which probably won't mean much to you since it's from a
Windows machine, but the important part is that it's from a Windows XP
SP3 machine -- XP has no TRIM support.

 Disk: Intel 320-series SSD; model SSDSA2CW080G3; 80GB, MLC
   SB: Intel ICH9, in "Enhanced" mode (non-AHCI, non-RAID)
   OS: Windows XP SP3
   FS: NTFS, 4KB cluster size, NTFS atime turned off, NTFS partition
       properly 4KB-aligned
Space: Approximately 6GB of 80GB used.

This disk is very new (only 436 power-on hours).

Here are details of the disk:

http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_01.png

And a screen shot of a sequential read benchmark which should speak for
itself.  Block read size is 64KBytes.  This is a raw device read and not
a filesystem-level read, meaning NTFS isn't in the picture here.  What's
interesting is the degradation in performance around the 16GB region:

http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_02.png

Next, a screen shot of a filesystem-based benchmark.  This is writing
and reading a 256MByte file (to the NTFS filesystem) using different
block sizes.  Horizontal access is block size, vertical axis is speed.
Reads are the blue bar, writes are the orange:

http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_03.png

And finally, the same device-level sequential read benchmark performed
again to show what effect the write benchmarks may have had on the disk:

http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_04.png

Sadly I can't test sequential writes because it's an OS disk.

So, my findings more or less mimic that of what other people are seeing
as well.  Given that the read benchmarks are device-level and not
filesystem-level, one shouldn't be pondering Windows -- one should be
pondering the implications of lack of TRIM and what's going on within
the drive itself.

I also have an Intel 320-series SSD in my home FreeBSD box as an OS disk
(UFS2 / UFS2+SU filesystems).  The amount of space used there is lower
(~4GB).  Do you know of some benchmarking utilities which do
device-level reads and can plot or provide metrics for LBA offsets or
equivalent?  I could compare that to the Windows benchmarks, but still,
I think we're barking up the wrong tree.  I'm really not comparing ZFS
to NTFS here; I'm saying that TRIM addresses performance problems (to
some degree) regardless of filesystem type.

Anyway, I think that's enough from me for now.  I've written this over
the course of almost 2 hours.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110512083429.GA58841>