From owner-freebsd-questions@FreeBSD.ORG Tue Sep 30 02:30:01 2014 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5CD23A08 for ; Tue, 30 Sep 2014 02:30:01 +0000 (UTC) Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com [209.85.216.182]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 18A49923 for ; Tue, 30 Sep 2014 02:30:00 +0000 (UTC) Received: by mail-qc0-f182.google.com with SMTP id i17so3211136qcy.13 for ; Mon, 29 Sep 2014 19:29:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=m/KN3P3a+XPhdy6rVm/CHHMwCUT07qguDZkeDc/th80=; b=cbNwLEC7SJODlmKlmsbiC2mr3f8RdelP8DAJCL/LI/sg44s7GzrKLWzlUdi7Yoitdn bIHf+zaN/JiKo5/Y0+eaYzMFV5TjJcp//RiLICPNGyGBQtIyGtb13ykKckLYjUWABC05 vpYr8gP9arfSI9BO4iAddjwv1aEkwrxxvisClGL7zMUWjMlfU4B1/g+MTvYj5NXHIyKT 9AZRFyIa770gWmFGRVIvYujvsQCX8faOIMlr0MW6XuS5l8rJ5s1ic02YcjlFdKzPnrYu E5vdLt+5oJ6UFYYLYqKoicG8h1L8t4FpYXwfYGIQcj99ZvXFArcGyGlthLHwFnzieRQl XUkQ== X-Gm-Message-State: ALoCoQlkl70LxvKeNR6+a/IVZ+C5/3i/y3yQYiuxP5PZySNjVG4Z5m14bcou7ZOCNzsoyO8Yy4fi X-Received: by 10.224.89.67 with SMTP id d3mr8029767qam.95.1412044193898; Mon, 29 Sep 2014 19:29:53 -0700 (PDT) Received: from MBP-1.local ([96.236.21.80]) by mx.google.com with ESMTPSA id 75sm12792395qgg.25.2014.09.29.19.29.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Sep 2014 19:29:53 -0700 (PDT) Message-ID: <542A159F.3070801@kraus-haus.org> Date: Mon, 29 Sep 2014 22:29:51 -0400 From: Paul Kraus User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: Scott Bennett , freebsd-questions@freebsd.org Subject: Re: ZFS and 2 TB disk drive technology :-( References: <201409281030.s8SAU8dR027634@sdf.org> In-Reply-To: <201409281030.s8SAU8dR027634@sdf.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Sep 2014 02:30:01 -0000 On 9/28/14 6:30, Scott Bennett wrote: > On Wed, 24 Sep 2014 11:24:35 -0400 Paul Kraus > wrote: > Thanks for chiming in, Paul. > >> On 9/24/14 7:08, Scott Bennett wrote: >> >> >> >> What version of FreeBSD are you running ? > > FreeBSD hellas 9.2-STABLE FreeBSD 9.2-STABLE #1 r264339: Fri Apr 11 05:16:25 CDT 2014 bennett@hellas:/usr/obj/usr/src/sys/hellas i386 I asked this specifically because I have seen lots of issues with hard drives connected via USB on 9.x. In some cases the system hangs (silently, with no evidence after a hard reboot), in some cases just flaky I/O to the drives that caused performance issues. And _none_ of the attached drives were over 1TB. There were three different 1TB drives (one IOmega and 2 Seagate) and 1 500GB (Seagate drive in a Gigaware enclosure). These were on three different systems (two SuperMicro dual Quad-Xeon CPU and one HP MicroProliant N36L). These were all USB2, I would expect more problems and weirder problems with USB3 as it (tries to) goes much faster. Skipping lots ... > Okay, laying aside the question of why no drive out of four in a mirror > vdev can provide the correct data, so that's why a rebuild wouldn't work. > Couldn't it at least give a clue about drive(s) to be replaced/repaired? > I.e., the drive(s) and sector number(s)? Otherwise, one would spend a lot > of time reloading data without knowing whether a failure at the same place(s) > would just happen again. You can probably dig that out of the zpool using zdb, but I am no zdb expert and refer you to the experts on the ZFS list (find out how to subscribe at the bottom of this list http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists ). >> As an anecdotal note, I have not had terribly good luck with USB >> attached drives under FreeBSD, especially under 9.x. I suspect that the >> USB stack just can't keep up and ends up dropping things (or hanging). I >> have had better luck with the 10.x release but still do not trust it for >> high traffic loads. I have had no issues with SAS or SATA interfaces > > Okay. I'll keep that in mind for the future, but for now I'm stuck > with 9.2 until I can get some stable disk space to work with to do the > upgrades to amd64 and then to later releases. The way things have been > going, I may have to relegate at least four 2 TB drives to paperweight > supply and then wait until I can replace them with smaller capacity drives > that will actually work. Also, I have four 2 TB drives in external cases > that have only USB 3.0 interfaces on them, so I have no other way to > connect them (except USB 2.0, of course), so I'm stuck with (some) USB, > too. While I have had nothing but trouble every single time I tried using a USB attached drive for more than a few MB at a time under 9.x, I have had no problems with Marvell based JBOD SATA cards under 9.x or 10.0. > >> (using supported chipsets, I have had very good luck with any of the >> Marvell JBOD SATA controllers), _except_ when I was using a SATA port >> multiplier. Over on the ZFS list the consensus is that port multipliers >> are problematic at best and they should be avoided. > > What kinds of problems did they mention? Also, how are those Marvell > controllers connected to your system(s)? I'm just wondering whether > I would be able to use any of those models of controllers. I've not dealt > with SATA port multipliers. Would an eSATA card with two ports on it be > classed as a port multiplier? I do not recall specifics, but I do recall a variety of issues, mostly around SATA buss resets. The problem _I_ had was that if one of the four drives in the enclosure (behind the port multiplier) failed it knocked all four off-line. The cards were PCIE 1X and PCIE 2X. All of the Marvell cards I have seen have been one logical port per physical port. The chipsets in the add-on cards seem to be in sets of 4 ports (although the on-board chipsets seem to be sets of 6). I currently have one 4 port card (2 internal, 2 external) and one 8 port card (4 internal, 4 external) with no problems. They are in an HP MicroProliant N54L with 16 GB RAM. Here is the series of cards that I have been using: http://www.sybausa.com/productList.php?cid=142¤tPage=0 Specifically the SI-PEX40072 and SI-PEX40065, stay away from the RAID versions and just go for the JBOD. The Marvell JBOD chips were recommended over on the ZFS list. > At the moment, all of my ZFS devices are connected by either USB 3.0 > or Firewire 400. Are the USB drives directly attached or via hubs ? The hubs may be introducing more errors (I have not had good luck finding USB hubs that are reliable in transferring data ... on my Mac, I have never tried using external hubs on my servers). > I now have an eSATA card with two ports on it that I > plan to install at some point, which will let me move the Firewire 400 > drive to eSATA. Should I expect any new problem for that drive after the > change? I would expect a decrease in error counts for the eSATA attached drive. I have been running LOTS of scrubs against my ~1.6TB of data recently, both on a 2-way 2-column mirror of 2TB drives or the current config of 2-way 3-column mirror of 1TB drives. I have seen no errors of any kind. [ppk@FreeBSD2 ~]$ zpool status pool: KrausHaus state: ONLINE scan: scrub repaired 0 in 1h49m with 0 errors on Fri Sep 26 13:51:21 2014 config: NAME STATE READ WRITE CKSUM KrausHaus ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 diskid/DISK-Seaagte ES.3 ONLINE 0 0 0 diskid/DISK-Seaagte ES.2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 diskid/DISK-WD-SE ONLINE 0 0 0 diskid/DISK-HGST UltraStar ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 diskid/DISK-WD-SE ONLINE 0 0 0 diskid/DISK-Seaagte ES.2 ONLINE 0 0 0 spares diskid/DISK-Seaagte ES.3 AVAIL errors: No known data errors Note: Disk serial numbers replaced with type of drive. All are 1TB. >> >> It sounds like you are really pushing this system to do more than it >> reasonably can. In a situation like this you should really not be doing >> anything else at the same time given that you are already pushing what >> the system can do. >> > It seems to me that the only places that could fail to keep up would > be the motherboard's chip(set) or one of the controller cards. The > motherboard controller knows the speed of the memory, so it will only > cycle the memory at that speed. The CPU, of course, should be at a lower > priority for bus cycles, so it would just use whatever were left over. There > is no overclocking involved, so that is not an issue here. The machine goes > as fast as it goes and no faster. If it takes longer for it to complete a > task, then that's how long it takes. I don't see that "pushing this system > to do more than it reasonably can" is even possible for me to do. It does > what it does, and it does it when it gets to it. Would I like it to do > things faster? Of course, I would, but what I want does not change physics. > I'm not getting any machine check or overrun messages, either. So you deny that race states can exist in a system as complex as a modern computer running a modern OS ? The OS is an integral part of all this, including all the myriad device drivers. And with multiple CPUs the problem may be even worse. > Further, because one of the drives is limited to 50 MB/s (Firewire 400) > transfer rates, ZFS really can't go any faster than that drive. Most of the > time, a systat vmstat display during the scrubs showed the MB/s actually > transferred for all four drives as being about the same (~23 - ~35 MB/s). What does `iostat -x -w 1` show ? How many drives are at 100 %b ? How many drives have a qlen of 10 ? For how many samples in a row ? That is the limit of what ZFS will dispatch, once there are 10 outstanding I/O requests for a given device, ZFS does not dispatch more I/O requests until the qlen drops below 10. This is tunable (look through sysctl -a | grep vfs.zfs). On my system with the port multiplier I had to tune this down to 4 (found empirically) or I would see underlying SATA device errors and retries. I find it useful to look at 1 second as well as 10 seconds samples (to see both peak load on the drives as well as more average). Here is my system with a scrub running on the above zpool and 10 second sample time (iostat -x -w 10): extended device statistics device r/s w/s kr/s kw/s qlen svc_t %b ada0 783.8 3.0 97834.2 15.8 10 10.4 84 ada1 792.8 3.0 98649.5 15.8 4 3.7 49 ada2 789.9 3.0 98457.0 15.8 4 3.6 47 ada3 0.1 13.1 0.0 59.0 0 6.1 0 ada4 0.8 13.1 0.4 59.0 0 5.8 0 ada5 794.0 3.0 98703.7 15.8 0 4.1 62 ada6 785.9 3.0 98158.3 15.8 10 11.2 98 ada7 0.0 0.0 0.0 0.0 0 0.0 0 ada8 791.4 3.0 98458.2 15.8 0 3.0 53 In the above, ada0 and ada6 have hit their outstanding I/O limit (in zfs), both are slower than the others, with both longer service time (svc_t) and % busy (%b). These are the oldest drives in the zpool, Seagate ES.2 series and are 5 years old (and just out of warranty). So it is not surprising that they are the slowest. They are the limiting factor on how fast the scrub can progress. > The scrubs took from 5% to 25% of one core's time, Because they are limited by the I/O stack between the kernal and the device. > and associated > kernel functions took from 0% to ~9% (combined) from other cores. cmp(1) > took 25% - 35% of one core with associated kernel functions taking 5% - 15% > (combined) from other cores. I used cpuset(1) to keep cmp(1) from bothering > the mprime thread I cared about the most. (Note that mprime runs niced > to 18, so its threads should not slow any of the testing I was doing.) It > really doesn't look to me like an overload situation, but I can try moving > the three USB 3.0 drives to USB 2.0 to slow things down even further. Do you have a way to look at errors directly on the USB buss ? > That > leaves still unexplained ZFS's failure to make use of multiple copies for > error correction during the reading of a file or to fix in one scrub > everything that was fixable. >>> >>> Script started on Wed Sep 17 01:37:38 2014 >>> [hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A >> >> This is the file the ZFS told you was corrupt, all bets are off. >> > There should be only one bad block because the scrubs fixed everything > else, right? Not necessarily, I have not looked at the ZFS code (feel free to, it is all open source), so I do not know for certain whether it gives up on a file once it finds corruption. > And that bad block is bad on all four drives, right? Or the I/O to all four drives was interrupted at the same TIME ... I have seen that before when it was a device driver stack that was having trouble (which is what I suspect here). >> The fact that you have TWO different drives from TWO different vendors >> exhibiting the same problem (and to the same degree) makes me think that >> the problem is NOT with the drives but elsewhere with your system. I >> have started tracking usage an failure statistics for my personal drives >> (currently 26 of them, but I have 4 more coming back from Seagate as > > Whooweee! That's a heap of drives! IIRC, for a chi^2 distribution, > 30 isn't bad for a sample size. How many of those drives are of larger > capacity than 1 TB? Not really, I used to manage hundreds of drives. When I have 2 out of 4 Seagate ES.2 1TB drives and 1 out of 2 HGST UltraStar 1TB drives fail under warranty I am still not willing to say that overall both Seagate and HGST have a 50% failure rate ... specifically because I do not consider 4 (or worse 2) drives a statistically significant sample :-) In terms of drive sizes, a little over 50% are 1TB or over (not counting the 4 Seagate 1TB warranty replacement drives that arrived today). Of the 11 1TB drives in the sample (not counting the ones that arrived today), 3 have failed under warranty (so far). The 4 2TB drives in the sample set none have failed yet, but they are all less than 1 year old. >> The system you are trying to use ZFS on may just not be able to handle >> the throughput (both memory and disk I/O) generated by ZFS without >> breaking. This may NOT just be a question of amount of RAM, but of the >> reliability of the motherboard/CPU/RAM/device interfaces when stressed. > > I did do a fair amount of testing with mprime last year and found no > problems. From the brief research I did, it looks like mprime is a computational program and will test only limited portions of a system (CPU and RAM mostly). > I monitor CPU temperatures frequently, especially when I'm > running a test like the ones I've been doing, and the temperatures have > remained reasonable throughout. (My air-conditioning bill has not been > similarly reasonable, I'm sorry to say.) > That having been said, though, between your remarks and Andrew Berg's, > there does seem cause to run another scrub, perhaps two, with those three > drives connected via USB 2.0 instead of USB 3.0 to see what happens when > everything is slowed down drastically. I'll give that a try when I find > time. That won't address the ZFS-related questions or the differences > in error rates on different drives, but might reveal an underlying system > hardware issue. > Maybe a PCIE2 board is too slow for USB 3.0, although the motherboard > controller, BIOS, USB 3.0 controller, and kernel all declined to complain. > If it is, then the eSATA card I bought (SATA II) would likely be useless > as well. :-< > >> In the early days of ZFS it was noticed that ZFS stressed the CPU and >> memory systems of a server harder than virtually any other task. >> > When would that have been, please? (I don't know much ZFS history.) > I believe this machine dates to 2006 or more likely 2007, although the > USB 3.0 card was new last year. The VIA Firewire card was installed at > the same time as the USB 3.0 card, but it was not new at that time. That would have been the 2005-2007 timeframe. A Sun SF-V240 could be brought to it's knees by a large ZFS copy operation. Both CPUs would peg and memory bandwidth would all be consumed by the I/O operations. The Sun T-2000 was much better as it had (effectively) 32 logical CPUs (8 cores each with 4 execution threads) and ZFS really likes multiprocessor environments. -- -- Paul Kraus paul@kraus-haus.org Co-Chair Albacon 2014.5 http://www.albacon.org/2014/