From owner-freebsd-stable@freebsd.org  Sat Apr 20 20:56:49 2019
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9957C1575BC7
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Sat, 20 Apr 2019 20:56:48 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-ed1-x541.google.com (mail-ed1-x541.google.com
 [IPv6:2a00:1450:4864:20::541])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 server-signature RSA-PSS (4096 bits)
 client-signature RSA-PSS (2048 bits) client-digest SHA256)
 (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 79C3F7195E
 for <freebsd-stable@freebsd.org>; Sat, 20 Apr 2019 20:56:47 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-ed1-x541.google.com with SMTP id j20so6859441edq.10
 for <freebsd-stable@freebsd.org>; Sat, 20 Apr 2019 13:56:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to:content-language;
 bh=+ehUGQF8ap7GkDIsHR+/Dgsoa187H0nI3UdT1cVcNzk=;
 b=XWcoyFMKNVi/b/AT1yIComq+LM7fU2J8XTE7nv99VjBjvgQKTFigbdFtDIbsMw5mb4
 lIoumLYt2SgCOq9jASFDapbKaPqVbPs0c0qB3rCIjoDbnZI/8VmM313gUWYk86/3vN9T
 rfA8Q5h2NwPdE6ftAq8IeORm4HC4FgAWhPR4BOw1woX46hxSJ2TXnE8PFp9SQdPoYNPP
 Ff7h8mLAarxM9a7xuePr+5RYXgmZGayiSLjwWPfBnzwA5OEKNSkn72US9s2zgWSwpTpo
 dRQv+CSRcKBzNa72CMCC/ag6Gdhe6jOEI52exQhUnOOyMHFTN8zAXMewGvJ9UWZSEK5H
 R9OQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language;
 bh=+ehUGQF8ap7GkDIsHR+/Dgsoa187H0nI3UdT1cVcNzk=;
 b=n4YWze0EsQdziloi929iT3Vb3/i4YnJ4o1wJgmV9JAFrv+75uAzoj1cNd4yLdiSjNZ
 w/nXMjO+1cqzHTWnGOiDAOKU8+Nqq1NNvFVrZT5gtHKYgAYe4HPPpUa4MTxvSnXl1lkl
 MB1W/Df39mNC4fGZFsVE9HPyx3PG+Oyu15uFuVZ2H06QakiIf3Wgec+hMZKRFPVWvFb/
 1DKBK6hq2s4zcdWgliN9MVf8w9/oITzbyrK0gKhgCk7AsUnI4lQObLh5w+5G1uWp84rg
 p3i+bW2ccTR6w3jxop2YACrZXQTbhY7PhJzlBd8J8wdUs/oIWKrqW0mW/plefU+A7qvS
 du6g==
X-Gm-Message-State: APjAAAUdycbhVhsWvU+ya3fpqmvEiwxQeN0UXk8AVUt/MGs3SFapJy/6
 0V6e3iH7HKvHpm2/ywCg1ZQaSOg2b9E=
X-Google-Smtp-Source: APXvYqyBkPdPoBnfB4OEJL6lvtLLdZ1a9VHGPEezPTtJqUvJCyWDYXoPSW3V1kaUOx83P6JFbqLBVA==
X-Received: by 2002:aa7:c88a:: with SMTP id p10mr7103307eds.145.1555793805602; 
 Sat, 20 Apr 2019 13:56:45 -0700 (PDT)
Received: from [10.44.128.75] ([161.12.40.153])
 by smtp.gmail.com with ESMTPSA id w10sm2356737edh.62.2019.04.20.13.56.44
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 20 Apr 2019 13:56:44 -0700 (PDT)
Subject: Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
To: Karl Denninger <karl@denninger.net>, freebsd-stable@freebsd.org
References: <f87f32f2-b8c5-75d3-4105-856d9f4752ef@denninger.net>
 <c96e31ad-6731-332e-5d2d-7be4889716e1@FreeBSD.org>
 <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net>
 <CACpH0MdLNQ_dqH+to=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com>
 <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net>
 <3d2ad225-b223-e9db-cce8-8250571b92c9@FreeBSD.org>
 <2bc8a172-6168-5ba9-056c-80455eabc82b@denninger.net>
 <CACpH0MfmPzEO5BO2kFk8-F1hP9TsXEiXbfa1qxcvB8YkvAjWWw@mail.gmail.com>
 <2c23c0de-1802-37be-323e-d390037c6a84@denninger.net>
 <864062ab-f68b-7e63-c3da-539d1e9714f9@denninger.net>
 <6dc1bad1-05b8-2c65-99d3-61c547007dfe@denninger.net>
 <758d5611-c3cf-82dd-220f-a775a57bdd0b@multiplay.co.uk>
 <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <8108da18-2cdd-fa29-983c-3ae7be6be412@multiplay.co.uk>
Date: Sat, 20 Apr 2019 21:56:45 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net>
Content-Language: en-US
X-Rspamd-Queue-Id: 79C3F7195E
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org;
 dkim=pass header.d=multiplay-co-uk.20150623.gappssmtp.com header.s=20150623
 header.b=XWcoyFMK; 
 spf=pass (mx1.freebsd.org: domain of killing@multiplay.co.uk designates
 2a00:1450:4864:20::541 as permitted sender)
 smtp.mailfrom=killing@multiplay.co.uk
X-Spamd-Result: default: False [-4.49 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[];
 TO_DN_SOME(0.00)[];
 R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36];
 RCVD_COUNT_THREE(0.00)[3];
 MX_GOOD(-0.01)[cached: ASPMX.L.GOOGLE.COM];
 DKIM_TRACE(0.00)[multiplay-co-uk.20150623.gappssmtp.com:+];
 RCPT_COUNT_TWO(0.00)[2]; NEURAL_HAM_SHORT(-0.98)[-0.984,0];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+];
 RCVD_TLS_LAST(0.00)[];
 ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US];
 MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 R_DKIM_ALLOW(-0.20)[multiplay-co-uk.20150623.gappssmtp.com:s=20150623];
 FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org];
 DMARC_NA(0.00)[multiplay.co.uk]; TO_MATCH_ENVRCPT_SOME(0.00)[];
 RCVD_IN_DNSWL_NONE(0.00)[1.4.5.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.5.4.1.0.0.a.2.list.dnswl.org
 : 127.0.5.0]; 
 IP_SCORE(-0.99)[ip: (-0.28), ipnet: 2a00:1450::/32(-2.38), asn: 15169(-2.26),
 country: US(-0.06)]
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Apr 2019 20:56:49 -0000

Thanks for extra info, the next question would be have you eliminated 
that corruption exists before the disk is removed?

Would be interesting to add a zpool scrub to confirm this isn't the case 
before the disk removal is attempted.

     Regards
     Steve

On 20/04/2019 18:35, Karl Denninger wrote:
>
> On 4/20/2019 10:50, Steven Hartland wrote:
>> Have you eliminated geli as possible source?
> No; I could conceivably do so by re-creating another backup volume set 
> without geli-encrypting the drives, but I do not have an extra set of 
> drives of the capacity required laying around to do that. I would have 
> to do it with lower-capacity disks, which I can attempt if you think 
> it would help.  I *do* have open slots in the drive backplane to set 
> up a second "test" unit of this sort.  For reasons below it will take 
> at least a couple of weeks to get good data on whether the problem 
> exists without geli, however.
>>
>> I've just setup an old server which has a LSI 2008 running and old FW 
>> (11.0) so was going to have a go at reproducing this.
>>
>> Apart from the disconnect steps below is there anything else needed 
>> e.g. read / write workload during disconnect?
>
> Yes.  An attempt to recreate this on my sandbox machine using smaller 
> disks (WD RE-320s) and a decent amount of read/write activity (tens to 
> ~100 gigabytes) on a root mirror of three disks with one taken offline 
> did not succeed.  It *reliably* appears, however, on my backup volumes 
> with every drive swap. The sandbox machine is physically identical 
> other than the physical disks; both are Xeons with ECC RAM in them.
>
> The only operational difference is that the backup volume sets have a 
> *lot* of data written to them via zfs send|zfs recv over the 
> intervening period where with "ordinary" activity from I/O (which was 
> the case on my sandbox) the I/O pattern is materially different.  The 
> root pool on the sandbox where I tried to reproduce it synthetically 
> *is* using geli (in fact it boots native-encrypted.)
>
> The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is 
> a ~6-8 hour process.
>
> The usual process for the backup pool looks like this:
>
> Have 2 of the 3 physical disks mounted; the third is in the bank vault.
>
> Over the space of a week, the backup script is run daily.  It first 
> imports the pool and then for each zfs filesystem it is backing up 
> (which is not all of them; I have a few volatile ones that I don't 
> care if I lose, such as object directories for builds and such, plus 
> some that are R/O data sets that are backed up separately) it does:
>
> If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send 
> -R ...@zfs-base | zfs receive -Fuvd $BACKUP
>
> else
>
> zfs rename -r ...@zfs-base ...@zfs-old
> zfs snapshot -r ...@zfs-base
>
> zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP
>
> .... if ok then zfs destroy -vr ...@zfs-old otherwise print a 
> complaint and stop.
>
> When all are complete it then does a "zpool export backup" to detach 
> the pool in order to reduce the risk of "stupid root user" (me) accidents.
>
> In short I send an incremental of the changes since the last backup, 
> which in many cases includes a bunch of automatic snapshots that are 
> taken on frequent basis out of the cron. Typically there are a week's 
> worth of these that accumulate between swaps of the disk to the vault, 
> and the offline'd disk remains that way for a week.  I also wait for 
> the zpool destroy on each of the targets to drain before continuing, 
> as not doing so back in the 9 and 10.x days was a good way to 
> stimulate an instant panic on re-import the next day due to kernel 
> stack page exhaustion if the previous operation destroyed hundreds of 
> gigabytes of snapshots (which does routinely happen as part of the 
> backed up data is Macrium images from PCs, so when a new month comes 
> around the PC's backup routine removes a huge amount of old data from 
> the filesystem.)
>
> Trying to simulate the checksum errors in a few hours' time thus far 
> has failed.  But every time I swap the disks on a weekly basis I get a 
> handful of checksum errors on the scrub. If I export and re-import the 
> backup mirror after that the counters are zeroed -- the checksum error 
> count does *not* remain across an export/import cycle although the 
> "scrub repaired" line remains.
>
> For example after the scrub completed this morning I exported the pool 
> (the script expects the pool exported before it begins) and ran the 
> backup.  When it was complete:
>
> root@NewFS:~/backup-zfs # zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
>         Sufficient replicas exist for the pool to continue functioning 
> in a
>         degraded state.
> action: Online the device using 'zpool online' or replace the device with
>         'zpool replace'.
>   scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat 
> Apr 20 08:45:09 2019
> config:
>
>         NAME                      STATE     READ WRITE CKSUM
>         backup                    DEGRADED     0 0     0
>           mirror-0                DEGRADED     0 0     0
>             gpt/backup61.eli      ONLINE       0 0     0
>             gpt/backup62-1.eli    ONLINE       0 0     0
>             13282812295755460479  OFFLINE      0 0     0  was 
> /dev/gpt/backup62-2.eli
>
> errors: No known data errors
>
> It knows it fixed the checksums but the error count is zero -- I did 
> NOT "zpool clear".
>
> This may have been present in 11.2; I didn't run that long enough in 
> this environment to know.  It definitely was *not* present in 11.1 and 
> before; the same data structure and script for backups has been in use 
> for a very long time without any changes and this first appeared when 
> I upgraded from 11.1 to 12.0 on this specific machine, with the exact 
> same physical disks being used for over a year (they're currently 6Tb 
> units; the last change out for those was ~1.5 years ago when I went 
> from 4Tb to 6Tb volumes.)  I have both HGST-NAS and He-Enterprise 
> disks in the rotation and both show identical behavior so it doesn't 
> appear to be related to a firmware problem in one disk .vs. the other 
> (e.g. firmware that fails to flush the on-drive cache before going to 
> standby even though it was told to.)
>
>>
>> mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 
>> 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3
>> mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities: 
>> 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
>>
>>     Regards
>>     Steve
>>
>> On 20/04/2019 15:39, Karl Denninger wrote:
>>> I can confirm that 20.00.07.00 does *not* stop this.
>>> The previous write/scrub on this device was on 20.00.07.00. It was
>>> swapped back in from the vault yesterday, resilvered without incident,
>>> but a scrub says....
>>>
>>> root@NewFS:/home/karl # zpool status backup
>>>    pool: backup
>>>   state: DEGRADED
>>> status: One or more devices has experienced an unrecoverable error.  An
>>>          attempt was made to correct the error.  Applications are 
>>> unaffected.
>>> action: Determine if the device needs to be replaced, and clear the 
>>> errors
>>>          using 'zpool clear' or replace the device with 'zpool 
>>> replace'.
>>>     see: http://illumos.org/msg/ZFS-8000-9P
>>>    scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat 
>>> Apr
>>> 20 08:45:09 2019
>>> config:
>>>
>>>          NAME                      STATE     READ WRITE CKSUM
>>>          backup                    DEGRADED     0     0     0
>>>            mirror-0                DEGRADED     0     0     0
>>>              gpt/backup61.eli      ONLINE       0     0     0
>>>              gpt/backup62-1.eli    ONLINE       0     0    47
>>>              13282812295755460479  OFFLINE      0     0     0 was
>>> /dev/gpt/backup62-2.eli
>>>
>>> errors: No known data errors
>>>
>>> So this is firmware-invariant (at least between 19.00.00.00 and
>>> 20.00.07.00); the issue persists.
>>>
>>> Again, in my instance these devices are never removed "unsolicited" so
>>> there can't be (or at least shouldn't be able to) unflushed data in the
>>> device or kernel cache.  The procedure is and remains:
>>>
>>> zpool offline .....
>>> geli detach .....
>>> camcontrol standby ...
>>>
>>> Wait a few seconds for the spindle to spin down.
>>>
>>> Remove disk.
>>>
>>> Then of course on the other side after insertion and the kernel has
>>> reported "finding" the device:
>>>
>>> geli attach ...
>>> zpool online ....
>>>
>>> Wait...
>>>
>>> If this is a boogered TXG that's held in the metadata for the
>>> "offline"'d device (maybe "off by one"?) that's potentially bad in that
>>> if there is an unknown failure in the other mirror component the
>>> resilver will complete but data has been irrevocably destroyed.
>>>
>>> Granted, this is a very low probability scenario (the area where the 
>>> bad
>>> checksums are has to be where the corruption hits, and it has to happen
>>> between the resilver and access to that data.)  Those are long odds but
>>> nonetheless a window of "you're hosed" does appear to exist.
>>>
>>
> -- 
> Karl Denninger
> karl@denninger.net <mailto:karl@denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/