Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 28 Sep 2014 05:30:08 -0500
From:      Scott Bennett <bennett@sdf.org>
To:        Paul Kraus <paul@kraus-haus.org>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: ZFS and 2 TB disk drive technology :-(
Message-ID:  <201409281030.s8SAU8dR027634@sdf.org>

next in thread | raw e-mail | index | archive | help
     On Wed, 24 Sep 2014 11:24:35 -0400 Paul Kraus <paul@kraus-haus.org>
wrote:
     Thanks for chiming in, Paul.

>On 9/24/14 7:08, Scott Bennett wrote:
>
><snip>
>
>What version of FreeBSD are you running ?

FreeBSD hellas 9.2-STABLE FreeBSD 9.2-STABLE #1 r264339: Fri Apr 11 05:16:25 CDT 2014     bennett@hellas:/usr/obj/usr/src/sys/hellas  i386

>What hardware are you running it on ?

     The CPU is a Q6600 running on a PCIE2 Gigabyte motherboard, whose model
number I did have written down around here somewhere but can't lay hands on
at the moment.  I looked it up at the gigabyte.com web site, and it supposed
to be okay for a number of considerably faster CPU models.  It has 4 GB of
memory, but FreeBSD ignores the last ~1.1 GB of it, so ~2.9 GB usable.  It
has been running mprime worker threads on all four cores with no apparent
problems for almost 11 months now.
     The USB 3.0 card is a "rocketfish USB 3.0 PCI Express Card", which
reports a NEC uPD720200 USB 3.0 controller.  It has two ports, into which
I have two USB 3.0 hubs plugged.  There are currently four 2 TB drives
plugged into those hubs.
     The Firewire 400 card reports itself as a "VIA Fire II (VT6306)", and
it has two ports on it with one 2 TB drive connected to each port.
>
>>       Then I copied the 1.08 TB file again from another Seagate 2 TB drive
>> to the mirror vdev.  No errors were detected during the copy.  Then I
>> began creating a tar file from large parts of a nearly full 1.2 TB file
>> system (UFS2) on yet another Seagate 2TB on the Firewire 400 bus with the
>> tar output going to a file in the mirror in order to try to have written
>> something to most of the sectors on the four-drive mirror.  I terminated
>> tar after the empty space in the mirror got down to about 3% because the
>> process had slowed to a crawl.  (Apparently, space allocation in ZFS
>> slows down far more than UFS2 when available space gets down to the last
>> few percent.:-( )
>
>ZFS's space allocation algorithm will have trouble (performance issues) 
>allocating new blocks long before you get a few percent free. This is 
>known behavior and the threshold for performance degradation varies with 
>work load and historical write patterns. My rule of thumb is that you 
>really do not want to go past 75-80% full, but I have seen reports over 
>on the ZFS list of issues with very specific write patterns and work 
>load with as little as 50% used. For your work load, writing very large 
>files once, I would expect that you can get close to 90% used before 
>seeing real performance issues.

     Thanks for that information.  Yeah, I think I saw it starting to slow
when it got into the low 90s% full, but I wasn't watching all of the time,
so I don't know when it first became noticeable.  Anyway, I'll keep those
examples in mind for planning purposes.
>
>>       Next, I ran a scrub on the mirror and, after the scrub finished, got
>> the following output from a "zpool status -v".
>>
>>    pool: testmirror
>>   state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> 	corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>> 	entire pool from backup.
>>     see: http://illumos.org/msg/ZFS-8000-8A
>>    scan: scrub repaired 1.38M in 17h59m with 1 errors on Mon Sep 15 19:53:45 2014
>
>The above means that ZFS was able to repair 1.38MB of bad data but still 
>ran into 1 situation (unknown size) that it could not fix.

     But I'm not sure that the repairs actually took.  Read on.
>
>> config:
>>
>> 	NAME        STATE     READ WRITE CKSUM
>> 	testmirror  ONLINE       0     0     1
>> 	  mirror-0  ONLINE       0     0     2
>> 	    da1p5   ONLINE       0     0     2
>> 	    da2p5   ONLINE       0     0     2
>> 	    da5p5   ONLINE       0     0     8
>> 	    da7p5   ONLINE       0     0     7
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>          /backups/testmirror/backups.s2A
>
>And here is the file that contains the bad data.
>
     Yes.  There are only two files, and that one is the larger one and
was written first.
>>
>>       Note that the choices of recommended action above do *not* include
>> replacing a bad drive and having ZFS rebuild its content on the
>> replacement.  Why is that so?
>
>Correct, because for some reason ZFS was not able to read enough of the 
>data without checksum errors to gibe you back your data in tact.

     Okay, laying aside the question of why no drive out of four in a mirror
vdev can provide the correct data, so that's why a rebuild wouldn't work.
Couldn't it at least give a clue about drive(s) to be replaced/repaired?
I.e., the drive(s) and sector number(s)?  Otherwise, one would spend a lot
of time reloading data without knowing whether a failure at the same place(s)
would just happen again.
>
>>       Thinking, apparently naively, that the scrub had repaired some or
>> most of the errors
>
>It did, 1.38MB worth. But it also had errors it could not repair.

     It *says* it did, but did it really?  How does it know?  Did it read
the results of its correction back in from the drive(s) to see?
>
>> and wanting to know which drives had ended up with
>> permanent errors, I did a "zpool clear testmirror" and ran another scrub.
>> During this scrub, I got some kernel messages on the console:
>>
>> (da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
>> (da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
>> (da7:umass-sim5:5:0:0): Retrying command
>> (da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
>> (da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
>> (da7:umass-sim5:5:0:0): Retrying command
>
>How many device errors have you had since booting the system / creating 
>the zpool ?

     Those were the first two times that I've gotten kernel error messages
on any of those devices ever.  My (non-ZFS) comparison results presented here
some weeks ago showed that the errors found in the comparisons went
undetected by hardware or software while the file was being written to disk.
No errors were detected during readback either, except by the application
(cmp(1)).
>
>> I don't know how to decipher these error messages (i.e., what do the hex
>> digits after "CDB: " mean?)
>
>I do not know the specifics in this case, but whenever I have seen 
>device errors it has always been due to either bad communication with a 
>drive or a drive reporting an error. If there are ANY device errors you 
>must address them before you go any further.

     Yes, but I need to know what those messages actually say when I talk
to manufacturers.  For example, if they contain addresses of failed sectors,
then I need to know what those addresses are if the manufacturers want me
to attempt to reassign them.  Also, if I tell them there are kernel messages,
they want to look up Micro$lop messages, and when I point out that I don't
use Micro$lop and that I run FreeBSD, they usually try to tell me that "We
don't support that", so it really helps if I can translate the messages to
them.  IOW, I can't address the device errors if I don't know how to read
the messages, which I'm hoping someone reading this may be able to help
with.  (I wish the FreeBSD Handbook had a comprehensive list of kernel
messages and what they mean.)
>
>As an anecdotal note, I have not had terribly good luck with USB 
>attached drives under FreeBSD, especially under 9.x. I suspect that the 
>USB stack just can't keep up and ends up dropping things (or hanging). I 
>have had better luck with the 10.x release but still do not trust it for 
>high traffic loads. I have had no issues with SAS or SATA interfaces 

     Okay.  I'll keep that in mind for the future, but for now I'm stuck
with 9.2 until I can get some stable disk space to work with to do the
upgrades to amd64 and then to later releases.  The way things have been
going, I may have to relegate at least four 2 TB drives to paperweight
supply and then wait until I can replace them with smaller capacity drives
that will actually work.  Also, I have four 2 TB drives in external cases
that have only USB 3.0 interfaces on them, so I have no other way to
connect them (except USB 2.0, of course), so I'm stuck with (some) USB,
too.

>(using supported chipsets, I have had very good luck with any of the 
>Marvell JBOD SATA controllers), _except_ when I was using a SATA port 
>multiplier. Over on the ZFS list the consensus is that port multipliers 
>are problematic at best and they should be avoided.

     What kinds of problems did they mention?  Also, how are those Marvell
controllers connected to your system(s)?  I'm just wondering whether
I would be able to use any of those models of controllers.  I've not dealt
with SATA port multipliers.  Would an eSATA card with two ports on it be
classed as a port multiplier?
     At the moment, all of my ZFS devices are connected by either USB 3.0
or Firewire 400.  I now have an eSATA card with two ports on it that I
plan to install at some point, which will let me move the Firewire 400
drive to eSATA.  Should I expect any new problem for that drive after the
change?
>
>> When it had finished, another "zpool status
>>   -v" showed these results.
>>
>>    pool: testmirror
>>   state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> 	corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>> 	entire pool from backup.
>>     see: http://illumos.org/msg/ZFS-8000-8A
>>    scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 2014
>
>This time it fixed 1.25MB of data and still had an error (of unknown 
>size) that it could not fix.

     Unfortunately, ZFS did not report the addresses of any of the errors.
If there are hard errors, how can I find out where the bad sectors are
located on each disk?  It might be possible to reassign those sectors to
spares if ZFS can tell me their addresses.
>
>> config:
>>
>> 	NAME        STATE     READ WRITE CKSUM
>> 	testmirror  ONLINE       0     0     1
>> 	  mirror-0  ONLINE       0     0     2
>> 	    da1p5   ONLINE       0     0     2
>> 	    da2p5   ONLINE       0     0     2
>> 	    da5p5   ONLINE       0     0     6
>> 	    da7p5   ONLINE       0     0     8
>
>Once again you have errors on ALL your devices. This points to a 

     Why did it not fix them during the first scrub?

>systemic problem of some sort on your system. On the ZFS list people 
>have reported bad memory as sometimes being the cause of these errors. I 
>would look for a system component that is common to all the drives and 
>controllers. How healthy is your power supply ? How close to it's limits 
>are you ?

     The box is mostly empty and nowhere near the power supply's capacity.
The box and all other attached devices are plugged into a SmartUPS 1000,
which is plugged into a surge protector.  As for the health of the power
supply, I guess I don't know how to check that, but everything else that
depends upon the power supply seems to be working fine.
>
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>          /backups/testmirror/backups.s2A
>>
>>       So it is not clear to me that either scrub fixed *any* errors at
>> all.
>
>Why is it not clear? The message from zpool status is very clear:

     It is not clear because I got similar results reported after two
consecutive scrubs with no changes made by me in between them.  If it
really fixed all but one error during the first scrub, why are there
still more errors for the second one to correct?  Unless ZFS checks the
results of its corrections, how can it know whether it really succeeded
in fixing anything?
>
>scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 
>2014
>
>There were errors that were repaired and an error that was not.

    Yes, like it said the first time.
>
>>  I next ran a comparison ("cmp -z -l") of the original against the
>> copy
>
>If you are comparing the file that ZFS reported was corrupt, then you 
>should not expect them to match.

     Well, you omitted the results of the comparison that I ran after
the second scrub had completed.  As I wrote before,
+      Another issue revealed above is that ZFS, in spite of having *four*
+ copies of the data and checksums of them, failed to detect any problem
+ while reading the data back for cmp(1), much less feed cmp(1) the correct
+ version of the data rather than a corrupted version.  Similarly, the hard
+ error (not otherwise logged by the kernel) apparently encountered by
+ vm_pager resulted in termination of cmp(1) rather than resulting in ZFS
+ reading the page from one of the other three drives.  I don't see how ZFS
+ is of much help here, so I guess I must have misunderstood the claims for
+ ZFS that I've read on this list and in the available materials on-line.

     Now, given that there were 10 errors widely separated in that file
(before the kernel-detected error that killed cmp(1)) that went undetected
by hardware, drivers, or ZFS when the file was written and also went
undetected by ZFS when read back in, in spite of all the data copies and
checksum copies, and in view of two consecutive scrubs without changes made
between them to the data yielding nearly identical numbers of errors "fixed",
I am skeptical of the claims for ZFS's "self-healing" ability.  With access
to four copies of each of those 10 data blocks and four copies of each of the
checksums, why did ZFS detect nothing while reading the file?  And in the
case of the uncorrectable error, why could ZFS not find any copy of the data
block that matched any copy of the checksum or even two matching copies of
the data block, so that it could provide the correct data to the application
program anyway, while logging the sector(s) involved in the block containing
the uncorrectable error?  If it can't do any of those things, why have
redundancy?
>
>> [stuff deleted  --SB]
>
>It sounds like you are really pushing this system to do more than it 
>reasonably can. In a situation like this you should really not be doing 
>anything else at the same time given that you are already pushing what 
>the system can do.
>
     It seems to me that the only places that could fail to keep up would
be the motherboard's chip(set) or one of the controller cards.  The
motherboard controller knows the speed of the memory, so it will only
cycle the memory at that speed.  The CPU, of course, should be at a lower
priority for bus cycles, so it would just use whatever were left over.  There
is no overclocking involved, so that is not an issue here.  The machine goes
as fast as it goes and no faster.  If it takes longer for it to complete a
task, then that's how long it takes.  I don't see that "pushing this system
to do more than it reasonably can" is even possible for me to do.  It does
what it does, and it does it when it gets to it.  Would I like it to do
things faster?  Of course, I would, but what I want does not change physics.
I'm not getting any machine check or overrun messages, either.
     Further, because one of the drives is limited to 50 MB/s (Firewire 400)
transfer rates, ZFS really can't go any faster than that drive.  Most of the
time, a systat vmstat display during the scrubs showed the MB/s actually
transferred for all four drives as being about the same (~23 - ~35 MB/s).
     The scrubs took from 5% to 25% of one core's time, and associated
kernel functions took from 0% to ~9% (combined) from other cores.  cmp(1)
took 25% - 35% of one core with associated kernel functions taking 5% - 15%
(combined) from other cores.  I used cpuset(1) to keep cmp(1) from bothering
the mprime thread I cared about the most.  (Note that mprime runs niced
to 18, so its threads should not slow any of the testing I was doing.)  It
really doesn't look to me like an overload situation, but I can try moving
the three USB 3.0 drives to USB 2.0 to slow things down even further.  That
leaves still unexplained ZFS's failure to make use of multiple copies for
error correction during the reading of a file or to fix in one scrub
everything that was fixable.
>>
>> Script started on Wed Sep 17 01:37:38 2014
>> [hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
>
>This is the file the ZFS told you was corrupt, all bets are off.
>
     There should be only one bad block because the scrubs fixed everything
else, right?  And that bad block is bad on all four drives, right?

><snip>
>
>> [point made again elsewhere deleted  --SB]
>
>ZFS told you that file was corrupt. You are choosing to try to read it. 
>ZFS used to not even let you try to access a corrupt file but that 
>behavior was changed to permit people to try to salvage what they could 
>instead of write it all off.

     See what I wrote above.  It was a *four-way mirror*, not an unreplicated
pool.  It strikes me as extremely unlikely that the *same* blocks would be
damaged on *all four* drives.  ZFS ought to be able to identify and provide
the correct version from one or more blocks and checksums.
>
>> [text included in a quotation above deleted here  --SB]
>
>I suggest that you are ignoring what ZFS is telling you, specifically 
>that your system is incapable of reliably write to and reading from 
>_any_ of the four drives you are trying to use and that there is a 
>corrupt file due to this and here it the name of that corrupt file.
>
>Until you fix the underlying issues with your system, ZFS (or any FS for 
>that matter) will not be of much use to you.

     It looks to me as nearly certain that the underlying issue is poor
manufacturing standards for 2 TB drives.
>
>>       I don't know where to turn next.  I will try to call Seagate/Samsung
>> later today again about the bad Samsung drive and the bad, refurbished
>> Seagate drive, but they already told me once that having a couple of kB
>> of errors in a ~1.08 TB file copy does not mean that the drive is bad.
>> I don't know whether they will consider a hard write error to mean the
>> drive is bad.  The kernel messages shown above are the first ones I've
>> gotten about any of the drives involved in the copy operation or the
>> tests described above.
>
>The fact that you have TWO different drives from TWO different vendors 
>exhibiting the same problem (and to the same degree) makes me think that 
>the problem is NOT with the drives but elsewhere with your system. I 
>have started tracking usage an failure statistics for my personal drives 
>(currently 26 of them, but I have 4 more coming back from Seagate as 

     Whooweee!  That's a heap of drives!  IIRC, for a chi^2 distribution,
30 isn't bad for a sample size.  How many of those drives are of larger
capacity than 1 TB?

>warranty replacements). I know that I do not have a statistically 
>significant sample, but it is what I have to work with. Taking into 
>account the drive I have as well as the hundreds of drives I managed at 
>a past client, I have never seen the kind of bad data failures you are 
>seeing UNLESS I had another underlying problem. Especially when the 
>problem appears on multiple drives. I suspect that the real odds of 
>having the same type of bad data failure on TWO drives in this case is 
>so small that another cause needs to be identified.

     Recall that I had two 2 TB drives that failed this year at, IIRC,
11 and 13 months since purchase, which is why two of the drives I was
testing in the mirror were refurbished drives (supplied under warranty).
One of those drives was showing hard errors on many sectors for a while
before it failed completely.  Having two drives that are bad doesn't
seem so unlikely, although having two drives (much less four!) with an
identical scattering of bad sectors on each is rather a stretch.
     You are referring to the two drives that showed two checksum errors
each in both post-scrub status reports?  Yes, those were from two
manufacturers, but the two drives with the greatest numbers of checksum
errors, along with one of the drives showing only 2 checksum errors, were
all from one manufacturer, who claims that such occurrences are "normal"
for those drives because they have no parity checking or parity recording.
That manufacturer does not suggest that there is anything wrong with the
system to which they are attached.  (By that, I mean that the guy at that
manufacturer who spoke with me on the phone made those claims.)
>
>>       If anyone reading this has any suggestions for a course of action
>> here, I'd be most interested in reading them.  Thanks in advance for any
>> ideas and also for any corrections if I've misunderstood what a ZFS
>> mirror was supposed to have done to preserve the data and maintain
>> correct operation at the application level.
>
>The system you are trying to use ZFS on may just not be able to handle 
>the throughput (both memory and disk I/O) generated by ZFS without 
>breaking. This may NOT just be a question of amount of RAM, but of the 
>reliability of the motherboard/CPU/RAM/device interfaces when stressed. 

     I did do a fair amount of testing with mprime last year and found no
problems.  I monitor CPU temperatures frequently, especially when I'm
running a test like the ones I've been doing, and the temperatures have
remained reasonable throughout.  (My air-conditioning bill has not been
similarly reasonable, I'm sorry to say.)
     That having been said, though, between your remarks and Andrew Berg's,
there does seem cause to run another scrub, perhaps two, with those three
drives connected via USB 2.0 instead of USB 3.0 to see what happens when
everything is slowed down drastically.  I'll give that a try when I find
time.  That won't address the ZFS-related questions or the differences
in error rates on different drives, but might reveal an underlying system
hardware issue.
     Maybe a PCIE2 board is too slow for USB 3.0, although the motherboard
controller, BIOS, USB 3.0 controller, and kernel all declined to complain.
If it is, then the eSATA card I bought (SATA II) would likely be useless
as well. :-<

>In the early days of ZFS it was noticed that ZFS stressed the CPU and 
>memory systems of a server harder than virtually any other task.
>
     When would that have been, please?  (I don't know much ZFS history.)
I believe this machine dates to 2006 or more likely 2007, although the
USB 3.0 card was new last year.  The VIA Firewire card was installed at
the same time as the USB 3.0 card, but it was not new at that time.


                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201409281030.s8SAU8dR027634>