Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 5 May 2016 10:14:29 +0200
From:      Borja Marcos <borjam@sarenet.es>
To:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   ZFS and SSD, trim caused stalling
Message-ID:  <132CDFA3-0390-4208-B6D5-3F2AE49E4B47@sarenet.es>

next in thread | raw e-mail | index | archive | help

Hello,

Doing some tests with Intel P3500 NVMEs I have found a serious =
performance problem caused by the TRIM operation.=20
Maybe it=E2=80=99s better not to use trim on these SSDs, I am not sure, =
but anyway this reveals a serious performance problem which
can happen with other SSDs. Actually I have seen a comparable behavior =
at least with another SSD, although less serious. For
example, trying with a 128 GB OCZ Vertex4, there was some stalling, =
although this particular SSD trims at around 2 GB/s while it can
sustain a write throughput of 200 MB/s until it reaches 50% capacity, =
falling to around 100 MB/s.

I know this is a very worst case benchmark, but operations like the =
deletion of a large snapshot or a dataset could
trigger similar problems.

In order to do a gross check of the I/O performance of this system, I =
created a raidz2 pool with 10 NVMEs. After
creating it, I used Bonnie++. As a single bonnie instance is unable to =
generate enough I/O activity, I actually ran=20
four in parallel.

Doing a couple of tests, I noticed that the second time I launched four =
Bonnies the writing activity was completely
stalled. Repeeating a single test I noticed this (file OneBonnie.png):

The Bonnies were writing for 30 minutes, the read/write test took around =
50 minutes, and the reading test took
10 minutes more or less. But after the Bonnie processes finished, the =
deletion of the files took more or less
30 minutes of heavy trim activity.=20

Running two tests, one after another, showed something far more serious. =
The second group of four Bonnies
was stalled for around 15 minutes while there was heavy trim I/O =
activity. And according to the service times
reported by devstat, the stall didn=E2=80=99t happen in the disk I/O =
subsystem. Looking at the activity between 8:30 and=20
8:45 it can be seen that the service time reported for the writing =
operations is 0, which means that the write operations
aren=E2=80=99t actually reaching the disk. (files TwoBonniesTput.png and =
TwoBonniesTimes.png)

ZFS itself is starving the whole vdev. Doing some silly operations such =
a =E2=80=9Cls=E2=80=9D was a problem as well, the system
performance was awful.=20

Apart from disabling TRIM there would be two solutions to this problem:

1) Somewhat deferring the TRIM operations. Of course it implies that the =
block freeing work must be throttled, which
can cause its own issues.

2) Skipping the TRIMs sometimes. Depending on the particular workload =
and SSD model, TRIM can be almost mandatory
or just a =E2=80=9Cnice to have=E2=80=9D feature. In a case like this, =
deleting large files (four 512 GB files) has caused a very serious =
impact. In
this case TRIM has done more harm than good.=20

The selective TRIM skipping could be based just on the number of TRIM =
requests pending on the vdev queues (past some
threshold the TRIM requests would be discarded) or maybe the ZFS block =
freeing routines would make a similar decision. I=E2=80=99m not
sure where it=E2=80=99s better to implement this.

A couple of sysctl variables could keep a counter of discarded TRIM =
operations and total =E2=80=9Cnot trimmed=E2=80=9D bytes, making if =
possible
 to know the impact of this measure. And this mechanism could be based =
on some static threshold configured via a sysctl variable or,
even better, ZFS could make a decision based on the queue depth. In case =
write or read requests got an unacceptable service
time, the system would invalidate the TRIM requests.

What do you think? In some cases it=E2=80=99s clear that TRIM can do =
more harm than good. I think that this measure can buy the best
of both worlds: TRIMming when possible, during =E2=80=9Cnormal=E2=80=9D =
I/O activity, and avoiding the troubles caused by it during exceptional
activity (deletion of very large files/large number of files/large =
snapshots/datasets).









Borja.







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?132CDFA3-0390-4208-B6D5-3F2AE49E4B47>