From owner-freebsd-stable@freebsd.org Sat Apr 20 20:56:49 2019 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9957C1575BC7 for ; Sat, 20 Apr 2019 20:56:48 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-ed1-x541.google.com (mail-ed1-x541.google.com [IPv6:2a00:1450:4864:20::541]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 79C3F7195E for ; Sat, 20 Apr 2019 20:56:47 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-ed1-x541.google.com with SMTP id j20so6859441edq.10 for ; Sat, 20 Apr 2019 13:56:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language; bh=+ehUGQF8ap7GkDIsHR+/Dgsoa187H0nI3UdT1cVcNzk=; b=XWcoyFMKNVi/b/AT1yIComq+LM7fU2J8XTE7nv99VjBjvgQKTFigbdFtDIbsMw5mb4 lIoumLYt2SgCOq9jASFDapbKaPqVbPs0c0qB3rCIjoDbnZI/8VmM313gUWYk86/3vN9T rfA8Q5h2NwPdE6ftAq8IeORm4HC4FgAWhPR4BOw1woX46hxSJ2TXnE8PFp9SQdPoYNPP Ff7h8mLAarxM9a7xuePr+5RYXgmZGayiSLjwWPfBnzwA5OEKNSkn72US9s2zgWSwpTpo dRQv+CSRcKBzNa72CMCC/ag6Gdhe6jOEI52exQhUnOOyMHFTN8zAXMewGvJ9UWZSEK5H R9OQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language; bh=+ehUGQF8ap7GkDIsHR+/Dgsoa187H0nI3UdT1cVcNzk=; b=n4YWze0EsQdziloi929iT3Vb3/i4YnJ4o1wJgmV9JAFrv+75uAzoj1cNd4yLdiSjNZ w/nXMjO+1cqzHTWnGOiDAOKU8+Nqq1NNvFVrZT5gtHKYgAYe4HPPpUa4MTxvSnXl1lkl MB1W/Df39mNC4fGZFsVE9HPyx3PG+Oyu15uFuVZ2H06QakiIf3Wgec+hMZKRFPVWvFb/ 1DKBK6hq2s4zcdWgliN9MVf8w9/oITzbyrK0gKhgCk7AsUnI4lQObLh5w+5G1uWp84rg p3i+bW2ccTR6w3jxop2YACrZXQTbhY7PhJzlBd8J8wdUs/oIWKrqW0mW/plefU+A7qvS du6g== X-Gm-Message-State: APjAAAUdycbhVhsWvU+ya3fpqmvEiwxQeN0UXk8AVUt/MGs3SFapJy/6 0V6e3iH7HKvHpm2/ywCg1ZQaSOg2b9E= X-Google-Smtp-Source: APXvYqyBkPdPoBnfB4OEJL6lvtLLdZ1a9VHGPEezPTtJqUvJCyWDYXoPSW3V1kaUOx83P6JFbqLBVA== X-Received: by 2002:aa7:c88a:: with SMTP id p10mr7103307eds.145.1555793805602; Sat, 20 Apr 2019 13:56:45 -0700 (PDT) Received: from [10.44.128.75] ([161.12.40.153]) by smtp.gmail.com with ESMTPSA id w10sm2356737edh.62.2019.04.20.13.56.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 20 Apr 2019 13:56:44 -0700 (PDT) Subject: Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) To: Karl Denninger , freebsd-stable@freebsd.org References: <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net> <3d2ad225-b223-e9db-cce8-8250571b92c9@FreeBSD.org> <2bc8a172-6168-5ba9-056c-80455eabc82b@denninger.net> <2c23c0de-1802-37be-323e-d390037c6a84@denninger.net> <864062ab-f68b-7e63-c3da-539d1e9714f9@denninger.net> <6dc1bad1-05b8-2c65-99d3-61c547007dfe@denninger.net> <758d5611-c3cf-82dd-220f-a775a57bdd0b@multiplay.co.uk> <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net> From: Steven Hartland Message-ID: <8108da18-2cdd-fa29-983c-3ae7be6be412@multiplay.co.uk> Date: Sat, 20 Apr 2019 21:56:45 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net> Content-Language: en-US X-Rspamd-Queue-Id: 79C3F7195E X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=multiplay-co-uk.20150623.gappssmtp.com header.s=20150623 header.b=XWcoyFMK; spf=pass (mx1.freebsd.org: domain of killing@multiplay.co.uk designates 2a00:1450:4864:20::541 as permitted sender) smtp.mailfrom=killing@multiplay.co.uk X-Spamd-Result: default: False [-4.49 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36]; RCVD_COUNT_THREE(0.00)[3]; MX_GOOD(-0.01)[cached: ASPMX.L.GOOGLE.COM]; DKIM_TRACE(0.00)[multiplay-co-uk.20150623.gappssmtp.com:+]; RCPT_COUNT_TWO(0.00)[2]; NEURAL_HAM_SHORT(-0.98)[-0.984,0]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; R_DKIM_ALLOW(-0.20)[multiplay-co-uk.20150623.gappssmtp.com:s=20150623]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org]; DMARC_NA(0.00)[multiplay.co.uk]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[1.4.5.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.5.4.1.0.0.a.2.list.dnswl.org : 127.0.5.0]; IP_SCORE(-0.99)[ip: (-0.28), ipnet: 2a00:1450::/32(-2.38), asn: 15169(-2.26), country: US(-0.06)] Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Apr 2019 20:56:49 -0000 Thanks for extra info, the next question would be have you eliminated that corruption exists before the disk is removed? Would be interesting to add a zpool scrub to confirm this isn't the case before the disk removal is attempted.     Regards     Steve On 20/04/2019 18:35, Karl Denninger wrote: > > On 4/20/2019 10:50, Steven Hartland wrote: >> Have you eliminated geli as possible source? > No; I could conceivably do so by re-creating another backup volume set > without geli-encrypting the drives, but I do not have an extra set of > drives of the capacity required laying around to do that. I would have > to do it with lower-capacity disks, which I can attempt if you think > it would help.  I *do* have open slots in the drive backplane to set > up a second "test" unit of this sort.  For reasons below it will take > at least a couple of weeks to get good data on whether the problem > exists without geli, however. >> >> I've just setup an old server which has a LSI 2008 running and old FW >> (11.0) so was going to have a go at reproducing this. >> >> Apart from the disconnect steps below is there anything else needed >> e.g. read / write workload during disconnect? > > Yes.  An attempt to recreate this on my sandbox machine using smaller > disks (WD RE-320s) and a decent amount of read/write activity (tens to > ~100 gigabytes) on a root mirror of three disks with one taken offline > did not succeed.  It *reliably* appears, however, on my backup volumes > with every drive swap. The sandbox machine is physically identical > other than the physical disks; both are Xeons with ECC RAM in them. > > The only operational difference is that the backup volume sets have a > *lot* of data written to them via zfs send|zfs recv over the > intervening period where with "ordinary" activity from I/O (which was > the case on my sandbox) the I/O pattern is materially different.  The > root pool on the sandbox where I tried to reproduce it synthetically > *is* using geli (in fact it boots native-encrypted.) > > The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is > a ~6-8 hour process. > > The usual process for the backup pool looks like this: > > Have 2 of the 3 physical disks mounted; the third is in the bank vault. > > Over the space of a week, the backup script is run daily.  It first > imports the pool and then for each zfs filesystem it is backing up > (which is not all of them; I have a few volatile ones that I don't > care if I lose, such as object directories for builds and such, plus > some that are R/O data sets that are backed up separately) it does: > > If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send > -R ...@zfs-base | zfs receive -Fuvd $BACKUP > > else > > zfs rename -r ...@zfs-base ...@zfs-old > zfs snapshot -r ...@zfs-base > > zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP > > .... if ok then zfs destroy -vr ...@zfs-old otherwise print a > complaint and stop. > > When all are complete it then does a "zpool export backup" to detach > the pool in order to reduce the risk of "stupid root user" (me) accidents. > > In short I send an incremental of the changes since the last backup, > which in many cases includes a bunch of automatic snapshots that are > taken on frequent basis out of the cron. Typically there are a week's > worth of these that accumulate between swaps of the disk to the vault, > and the offline'd disk remains that way for a week.  I also wait for > the zpool destroy on each of the targets to drain before continuing, > as not doing so back in the 9 and 10.x days was a good way to > stimulate an instant panic on re-import the next day due to kernel > stack page exhaustion if the previous operation destroyed hundreds of > gigabytes of snapshots (which does routinely happen as part of the > backed up data is Macrium images from PCs, so when a new month comes > around the PC's backup routine removes a huge amount of old data from > the filesystem.) > > Trying to simulate the checksum errors in a few hours' time thus far > has failed.  But every time I swap the disks on a weekly basis I get a > handful of checksum errors on the scrub. If I export and re-import the > backup mirror after that the counters are zeroed -- the checksum error > count does *not* remain across an export/import cycle although the > "scrub repaired" line remains. > > For example after the scrub completed this morning I exported the pool > (the script expects the pool exported before it begins) and ran the > backup.  When it was complete: > > root@NewFS:~/backup-zfs # zpool status backup >   pool: backup >  state: DEGRADED > status: One or more devices has been taken offline by the administrator. >         Sufficient replicas exist for the pool to continue functioning > in a >         degraded state. > action: Online the device using 'zpool online' or replace the device with >         'zpool replace'. >   scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat > Apr 20 08:45:09 2019 > config: > >         NAME                      STATE     READ WRITE CKSUM >         backup                    DEGRADED     0 0     0 >           mirror-0                DEGRADED     0 0     0 >             gpt/backup61.eli      ONLINE       0 0     0 >             gpt/backup62-1.eli    ONLINE       0 0     0 >             13282812295755460479  OFFLINE      0 0     0  was > /dev/gpt/backup62-2.eli > > errors: No known data errors > > It knows it fixed the checksums but the error count is zero -- I did > NOT "zpool clear". > > This may have been present in 11.2; I didn't run that long enough in > this environment to know.  It definitely was *not* present in 11.1 and > before; the same data structure and script for backups has been in use > for a very long time without any changes and this first appeared when > I upgraded from 11.1 to 12.0 on this specific machine, with the exact > same physical disks being used for over a year (they're currently 6Tb > units; the last change out for those was ~1.5 years ago when I went > from 4Tb to 6Tb volumes.)  I have both HGST-NAS and He-Enterprise > disks in the rotation and both show identical behavior so it doesn't > appear to be related to a firmware problem in one disk .vs. the other > (e.g. firmware that fails to flush the on-drive cache before going to > standby even though it was told to.) > >> >> mps0: port 0xe000-0xe0ff mem >> 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3 >> mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd >> mps0: IOCCapabilities: >> 185c >> >>     Regards >>     Steve >> >> On 20/04/2019 15:39, Karl Denninger wrote: >>> I can confirm that 20.00.07.00 does *not* stop this. >>> The previous write/scrub on this device was on 20.00.07.00. It was >>> swapped back in from the vault yesterday, resilvered without incident, >>> but a scrub says.... >>> >>> root@NewFS:/home/karl # zpool status backup >>>    pool: backup >>>   state: DEGRADED >>> status: One or more devices has experienced an unrecoverable error.  An >>>          attempt was made to correct the error.  Applications are >>> unaffected. >>> action: Determine if the device needs to be replaced, and clear the >>> errors >>>          using 'zpool clear' or replace the device with 'zpool >>> replace'. >>>     see: http://illumos.org/msg/ZFS-8000-9P >>>    scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat >>> Apr >>> 20 08:45:09 2019 >>> config: >>> >>>          NAME                      STATE     READ WRITE CKSUM >>>          backup                    DEGRADED     0     0     0 >>>            mirror-0                DEGRADED     0     0     0 >>>              gpt/backup61.eli      ONLINE       0     0     0 >>>              gpt/backup62-1.eli    ONLINE       0     0    47 >>>              13282812295755460479  OFFLINE      0     0     0 was >>> /dev/gpt/backup62-2.eli >>> >>> errors: No known data errors >>> >>> So this is firmware-invariant (at least between 19.00.00.00 and >>> 20.00.07.00); the issue persists. >>> >>> Again, in my instance these devices are never removed "unsolicited" so >>> there can't be (or at least shouldn't be able to) unflushed data in the >>> device or kernel cache.  The procedure is and remains: >>> >>> zpool offline ..... >>> geli detach ..... >>> camcontrol standby ... >>> >>> Wait a few seconds for the spindle to spin down. >>> >>> Remove disk. >>> >>> Then of course on the other side after insertion and the kernel has >>> reported "finding" the device: >>> >>> geli attach ... >>> zpool online .... >>> >>> Wait... >>> >>> If this is a boogered TXG that's held in the metadata for the >>> "offline"'d device (maybe "off by one"?) that's potentially bad in that >>> if there is an unknown failure in the other mirror component the >>> resilver will complete but data has been irrevocably destroyed. >>> >>> Granted, this is a very low probability scenario (the area where the >>> bad >>> checksums are has to be where the corruption hits, and it has to happen >>> between the resilver and access to that data.)  Those are long odds but >>> nonetheless a window of "you're hosed" does appear to exist. >>> >> > -- > Karl Denninger > karl@denninger.net > /The Market Ticker/ > /[S/MIME encrypted email preferred]/