From owner-freebsd-fs@freebsd.org Mon Feb 15 10:19:09 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4628FAA8338 for ; Mon, 15 Feb 2016 10:19:09 +0000 (UTC) (envelope-from areilly@bigpond.net.au) Received: from nskntmtas04p.mx.bigpond.com (nskntmtas04p.mx.bigpond.com [61.9.168.146]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "InterMail Test Certificate", Issuer "Certificate Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id DFE0A11DB for ; Mon, 15 Feb 2016 10:19:08 +0000 (UTC) (envelope-from areilly@bigpond.net.au) Received: from nskntcmgw06p ([61.9.169.166]) by nskntmtas04p.mx.bigpond.com with ESMTP id <20160215101859.JBVO27061.nskntmtas04p.mx.bigpond.com@nskntcmgw06p> for ; Mon, 15 Feb 2016 10:18:59 +0000 Received: from ghanamia.reilly.home ([121.211.74.3]) by nskntcmgw06p with BigPond Outbound id JaJz1s00904FjAp01aJzCW; Mon, 15 Feb 2016 10:18:59 +0000 X-Authentication-Info: Submitted using ID areilly@bigpond.net.au X-Authority-Analysis: v=2.0 cv=JY8+XD2V c=1 sm=1 a=3jNtSoK4IhUy2m3FAQj8ZQ==:17 a=zfoz2xrM8ApMh4vX9qwA:9 a=CjuIK1q_8ugA:10 a=MRFpzz4hz_DlcQhZ:21 a=0sZv6AU2dSB90eZc:21 a=3jNtSoK4IhUy2m3FAQj8ZQ==:117 From: Andrew Reilly Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Hours of tiny transfers at the end of a ZFS resilver? Message-Id: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au> Date: Mon, 15 Feb 2016 21:18:59 +1100 To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) X-Mailer: Apple Mail (2.3112) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Feb 2016 10:19:09 -0000 Hi Filesystem experts, I have a question about the nature of ZFS and the resilvering that occurs after a driver replacement from a raidz array. I have a fairly simple home file server that (by way of gradually replaced pieces and upgrades) has effectively been doing great service since well, forever, but its last re-build replaced its main UFS file systems with a four-drive ZFS raidz pool. It's been going very nicely over the years, and now it's full, so I've nearly finished replacing its 1TB drives with new 4TB ones. I'm doing that the slow way, replacing one at a time and resilvering before going on to the next, because that only requires a minute or two of down-time for the drive swaps each time. Replacing the whole array and restoring from backup would have had the system off-line for many hours (I guess). Now, one thing that I didn't realise at the start of this process was that the zpool has the original 512B sector size baked in at a fairly low level, so it is using some sort of work-around for the fact that the new drives actually have 4096B sectors (although they lie about that in smartctl -i queries): The four new drives appear to smartctl as: Model Family: HGST Deskstar NAS Device Model: HGST HDN724040ALE640 Serial Number: PK1334PEHYSZ6S LU WWN Device Id: 5 000cca 250dba043 Firmware Version: MJAOA5E0 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Feb 15 20:57:30 2016 AEDT SMART support is: Available - device has SMART capability. SMART support is: Enabled They show up in zpool status as: pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool = will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Feb 15 07:19:45 2016 3.12T scanned out of 3.23T at 67.4M/s, 0h29m to go 798G resilvered, 96.48% done config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ada0p1 ONLINE 0 0 0 block = size: 512B configured, 4096B native ada1p1 ONLINE 0 0 0 block = size: 512B configured, 4096B native ada3p1 ONLINE 0 0 0 block = size: 512B configured, 4096B native replacing-3 DEGRADED 0 0 0 17520966596084542745 UNAVAIL 0 0 0 was = /dev/ada4p1/old ada4p1 ONLINE 0 0 0 block = size: 512B configured, 4096B native (resilvering) errors: No known data errors While clearly sub-optimal, I expect that the performance will still be good enough for my purposes: I can build a new, properly aligned file system when I do the next re-build. The odd thing is that after charging through the resilver using large blocks (around 64k according to systat), when they get to the end, as this one is now, the process drags on for hours with millions of tiny, sub-2K transfers: Here's the systat -vmstat output right now. 19 users Load 0.28 0.30 0.27 15 Feb 21:01 Mem:KB REAL VIRTUAL VN PAGER SWAP = PAGER Tot Share Tot Share Free in out in = out Act 832452 38920 5151576 111272 258776 count All 859068 52388 5349268 249964 pages 3 Proc: = Interrupts r p d s w Csw Trp Sys Int Sof Flt ioflt 1026 = total 174 4417 126 4775 1026 45 90 cow = atkbd0 1 88 zfod = hdac1 16 2.8%Sys 0.5%Intr 3.9%User 0.0%Nice 92.9%Idle ozfod = ehci0 ehci | | | | | | | | | | %ozfod 249 = siis0 ohci =3D+>> daefr 147 = hpet0:t0 4 dtbuf prcfr 165 = hpet0:t1 Namei Name-cache Dir-cache 213520 desvn 3 totfr 114 = hpet0:t2 Calls hits % hits % 12132 numvn react = hdac0 259 77130 77093 100 4853 frevn pdwak = xhci1 261 32 pdpgs = ahci0:ch0 Disks da0 ada0 ada1 ada2 ada3 ada4 pass0 intrn 170 = ahci0:ch1 KB/t 0.00 1.41 1.46 0.00 1.39 1.47 0.00 6013640 wire 151 = ahci0:3 tps 0 173 138 0 176 151 0 77140 act 30 = re0 266 MB/s 0.00 0.24 0.20 0.00 0.24 0.22 0.00 1732676 inact %busy 0 18 15 0 17 99 0 cache 258776 free 29632 buf So there's a problem wth the zpool status output: it's predicting half an hour to go based on the averaged 67M/s over the whole drive, not the <2MB/s that it's actually doing, and will probably continue to do so for several hours, if tonight goes the same way as last night. Last night zpool status said "0h05m to go" for more than three hours, before I gave up waiting to start the next drive. Is this expected behaviour, or something bad and peculiar about my system? I'm confused about how ZFS really works, given this state. I had thought that the zpool layer did parity calculation in big 256k-ish stripes across the drives, and the zfs filesystem layer coped with that large block size because it had lots of caching and wrote everything in log-structure. Clearly that mental model must be incorrect, because then it would only ever be doing large transfers. Anywhere I could go to find a nice write-up of how ZFS is working? Cheers, --=20 Andrew