From owner-freebsd-fs@freebsd.org Thu Aug 27 20:44:37 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0F77E9C3EC8 for ; Thu, 27 Aug 2015 20:44:37 +0000 (UTC) (envelope-from karl@denninger.net) Received: from fs.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "NewFS.denninger.net", Issuer "NewFS.denninger.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id A57071F88 for ; Thu, 27 Aug 2015 20:44:36 +0000 (UTC) (envelope-from karl@denninger.net) Received: from [192.168.1.40] (localhost [127.0.0.1]) by fs.denninger.net (8.15.2/8.14.8) with ESMTP id t7RKiZ6b066537 for ; Thu, 27 Aug 2015 15:44:35 -0500 (CDT) (envelope-from karl@denninger.net) Received: from [192.168.1.40] [192.168.1.40] (Via SSLv3 AES128-SHA) ; by Spamblock-sys (LOCAL/AUTH) Thu Aug 27 15:44:35 2015 Subject: Re: Panic in ZFS during zfs recv (while snapshots being destroyed) To: Sean Chittenden References: <55BB443E.8040801@denninger.net> <55CF7926.1030901@denninger.net> <55DF7191.2080409@denninger.net> Cc: freebsd-fs@freebsd.org From: Karl Denninger X-Enigmail-Draft-Status: N1110 Message-ID: <55DF76AA.3040103@denninger.net> Date: Thu, 27 Aug 2015 15:44:26 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms030109030707060501020908" X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Aug 2015 20:44:37 -0000 This is a cryptographically signed message in MIME format. --------------ms030109030707060501020908 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable No, but that does sound like it might be involved..... And yeah, this did start when I moved the root pool to a mirrored pair of Intel 530s off a pair of spinning-rust WD RE4s.... (The 530s are darn nice performance-wise, reasonably inexpensive and thus very suitable for a root filesystem drive and they also pass the "pull the power cord" test, incidentally.) You may be onto something -- I'll try shutting it off, but due to the fact that I can't make this happen and it's a "every week or two" panic, but ALWAYS when the zfs send | zfs recv is running AND it's always on the same filesystem it will be a fair while before I know if it's fixed (like over a month, given the usual pattern here, as that would be 4 "average" periods without a panic)..... I also wonder if I could tune this out with some of the other TRIM parameters instead of losing it entirely. vfs.zfs.trim.max_interval: 1 vfs.zfs.trim.timeout: 30 vfs.zfs.trim.txg_delay: 32 vfs.zfs.trim.enabled: 1 vfs.zfs.vdev.trim_max_pending: 10000 vfs.zfs.vdev.trim_max_active: 64 vfs.zfs.vdev.trim_min_active: 1 That it's panic'ing on a mtx_lock_sleep might point this way.... the trace shows it coming from a zfs_onexit_destroy, which ends up calling zfs_unmount_snap() and then it blows in dounmount() while executing mtx_lock_sleep(). I do wonder if I'm begging for new and innovative performance issues if I run with TRIM off for an extended period of time, however..... :-) On 8/27/2015 15:30, Sean Chittenden wrote: > Have you tried disabling TRIM? We recently ran in to an issue where a = `zfs delete` on a large dataset caused the host to panic because TRIM was= tripping over the ZFS deadman timer. Disabling TRIM worked as valid wo= rkaround for us. ? You mentioned a recent move to SSDs, so this can hap= pen, esp after the drive has experienced a little bit of actual work. ? = -sc > > > -- > Sean Chittenden > sean@chittenden.org > > >> On Aug 27, 2015, at 13:22, Karl Denninger wrote: >> >> On 8/15/2015 12:38, Karl Denninger wrote: >>> Update: >>> >>> This /appears /to be related to attempting to send or receive a >>> /cloned /snapshot. >>> >>> I use /beadm /to manage boot environments and the crashes have all >>> come while send/recv-ing the root pool, which is the one where these >>> clones get created. It is /not /consistent within a given snapshot >>> when it crashes and a second attempt (which does a "recovery" >>> send/receive) succeeds every time -- I've yet to have it panic twice >>> sequentially. >>> >>> I surmise that the problem comes about when a file in the cloned >>> snapshot is modified, but this is a guess at this point. >>> >>> I'm going to try to force replication of the problem on my test syste= m. >>> >>> On 7/31/2015 04:47, Karl Denninger wrote: >>>> I have an automated script that runs zfs send/recv copies to bring a= >>>> backup data set into congruence with the running copies nightly. Th= e >>>> source has automated snapshots running on a fairly frequent basis >>>> through zfs-auto-snapshot. >>>> >>>> Recently I have started having a panic show up about once a week dur= ing >>>> the backup run, but it's inconsistent. It is in the same place, but= I >>>> cannot force it to repeat. >>>> >>>> The trap itself is a page fault in kernel mode in the zfs code at >>>> zfs_unmount_snap(); here's the traceback from the kvm (sorry for the= >>>> image link but I don't have a better option right now.) >>>> >>>> I'll try to get a dump, this is a production machine with encrypted = swap >>>> so it's not normally turned on. >>>> >>>> Note that the pool that appears to be involved (the backup pool) has= >>>> passed a scrub and thus I would assume the on-disk structure is ok..= =2E.. >>>> but that might be an unfair assumption. It is always occurring in t= he >>>> same dataset although there are a half-dozen that are sync'd -- if t= his >>>> one (the first one) successfully completes during the run then all t= he >>>> rest will as well (that is, whenever I restart the process it has al= ways >>>> failed here.) The source pool is also clean and passes a scrub. >>>> >>>> traceback is at http://www.denninger.net/kvmimage.png; apologies for= the >>>> image traceback but this is coming from a remote KVM. >>>> >>>> I first saw this on 10.1-STABLE and it is still happening on FreeBSD= >>>> 10.2-PRERELEASE #9 r285890M, which I updated to in an attempt to see= if >>>> the problem was something that had been addressed. >>>> >>>> >>> --=20 >>> Karl Denninger >>> karl@denninger.net >>> /The Market Ticker/ >>> /[S/MIME encrypted email preferred]/ >> Second update: I have now taken another panic on 10.2-Stable, same dea= l, >> but without any cloned snapshots in the source image. I had thought th= at >> removing cloned snapshots might eliminate the issue; that is now out t= he >> window. >> >> It ONLY happens on this one filesystem (the root one, incidentally) >> which is fairly-recently created as I moved this machine from spinning= >> rust to SSDs for the OS and root pool -- and only when it is being >> backed up by using zfs send | zfs recv (with the receive going to a >> different pool in the same machine.) I have yet to be able to provoke= >> it when using zfs send to copy to a different machine on the same LAN,= >> but given that it is not able to be reproduced on demand I can't be >> certain it's timing related (e.g. performance between the two pools in= >> question) or just that I haven't hit the unlucky combination. >> >> This looks like some sort of race condition and I will continue to see= >> if I can craft a case to make it occur "on demand" >> >> --=20 >> Karl Denninger >> karl@denninger.net >> /The Market Ticker/ >> /[S/MIME encrypted email preferred]/ > > > %SPAMBLOCK-SYS: Matched [+Sean Chittenden ], messa= ge ok > --=20 Karl Denninger karl@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ --------------ms030109030707060501020908 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp 3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5 vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/ o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6 eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+ JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ 3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0 FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG 1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5 c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MjcyMDQ0MjZaME8GCSqGSIb3DQEJBDFCBECF WlEkAcLqO2f2Z8ZDvK5ecb19GxxKf8dr80mFFcoispyzX4aUDCZubvMlg9EVVArF+i7NFHVg 5oT1ZZ9LoT7TMGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1 ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG 9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAJs/0rpt+ O+ACeKMvpA2q8zRUc3n//fO+J9bOx2Xbsxgd8Rmu9PzAmxDZd16WP3gkD67WTGkXZiEcWJQL 1tIELTf2UwjzuMk7S3bQer9hxsN6PdIotYrPFgU3/2FP2wucTBB9zMTgUaVcCcilshPbyVUE ADQcs8wkOzZAT5yuCNE3EpZMCd2T/BMs1k0gmpMQTP9rzVyaRgOYkn6TVAHKf40PPKL+qJh8 hkNU9mj25zlVXkNgEHQcruWoGYYFDDBg+pHLQNgDzGahMmlFbn/ZFKRsa0PXYtPgPoG13rfc 68LGmhD3KfX8p9Yqhy04FFPb2RIfYYYajCjMgWjLrZ4qozNOpO3xM34QTLtB85K2C0InOLp5 zSVEgjfm9O3ascu5mA1BcQt8OluX4nkpObMusBoyu5fnJJPhXJ4/OLKkU+JMASNtp25MSDM4 ln8KvWBz61vKWRJRXkF3YntfnhffqpwbYKk/3IljZ0Z0m2pTEbEpLYbDapCArOVGsHoIKcTZ BjCz02/64eAA6dh0hMdXCGaPtawZf/7hwWPx7S4ioRrE0vnc/afI9TMZSY+fwLwLtUYUZSgj K2JrKdmbSpC0eTB1MfR+uRLGN9aCasJmVQs2FumLRYRd+6RMn+raKo692is3ZmjJkwQRt3zb qnhnA1qBWBRoCbmSB5hoAILXYcIAAAAAAAA= --------------ms030109030707060501020908--