From owner-freebsd-current@freebsd.org Fri Apr 6 02:26:16 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A8037F9754E for ; Fri, 6 Apr 2018 02:26:16 +0000 (UTC) (envelope-from marklmi26-fbsd@yahoo.com) Received: from sonic304-23.consmr.mail.gq1.yahoo.com (sonic304-23.consmr.mail.gq1.yahoo.com [98.137.68.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 19B7D7E736 for ; Fri, 6 Apr 2018 02:26:15 +0000 (UTC) (envelope-from marklmi26-fbsd@yahoo.com) X-YMail-OSG: DF1h9EcVM1nfcy.2TjM_VtIib3WHXoc9.IbnlYCpXdQ9MnVaPGarf43bB7KZfHn DgXqqxAjYA_3UjXRy5KQImFCODA5bTDbp0V34ItwKiUlaKuoM8Oq6pwafTd2zZ2fGzqdrWMOfdHG SeHav2X6GNQzSMrtwe21fAKoGYaccWSKfVCh2ddo5VToXAb.X8LYZ9PD31MkpVDT2BEPH4r3bQz8 SJdVjenFPsHlPXWH0ieUo9Hq8YR9Vq9M2LQLYvtrGH635YuZfjZd33Fc89RI8mWqepVN3WJLx50f v0wLoL8sgxP4JvOO4.oi8CCIMa6SyQWuUdcK9dQbnZ.f_T4_1fDAFZ45TGjM4VG2GaXDn0vX3E89 A2gsw.fz53U3V298tjaLKuGsYBPSFnvIDYe68UUMPdSqY0iAxSc89OS1egcEEH_y4Y3hL4TPJy98 yLy_Rkx0usZHY2Xi3l68lQpt0HlYGHTJsEHTxmgjwl0HoLG7kyvYhTDISJ0sF1QfO_AkYY1KpPU0 ZN7FJpkRRTZ3t4bSiNutpNEHuTP0vknfSOjP5 Received: from sonic.gate.mail.ne1.yahoo.com by sonic304.consmr.mail.gq1.yahoo.com with HTTP; Fri, 6 Apr 2018 02:26:14 +0000 Received: from c-76-115-7-162.hsd1.or.comcast.net (EHLO [192.168.1.25]) ([76.115.7.162]) by smtp417.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 7f2c846933d88de3e3ef1d175f746870; Fri, 06 Apr 2018 02:05:56 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 11.3 \(3445.6.18\)) Subject: Re: head -r331499 amd64/threadripper panic in vm_page_free_prep during "poudriere bulk -a", after 14h 22m or so. From: Mark Millard In-Reply-To: <08B7C130-A38D-473A-8A73-CA79ED1A0044@yahoo.com> Date: Thu, 5 Apr 2018 19:05:55 -0700 Cc: FreeBSD Current Content-Transfer-Encoding: quoted-printable Message-Id: References: <8D9C49CB-957E-40A5-8EB0-D90D8AC02060@yahoo.com> <20180325183421.GA74365@raichu> <44821CA4-19C2-4265-8E83-568452DF6471@yahoo.com> <20180325200934.GC74365@raichu> <45B4FCDA-C743-4F35-B819-9CB064C20038@yahoo.com> <08B7C130-A38D-473A-8A73-CA79ED1A0044@yahoo.com> To: Mark Johnston X-Mailer: Apple Mail (2.3445.6.18) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 Apr 2018 02:26:16 -0000 On 2018-Mar-26, at 6:35 AM, Mark Millard = wrote: > [Unfortunately, I'd not be able to get back to this > for many hours. I do not want to leave the machine > at the db> prompt that long. So this is all there > will be.] >=20 > It got a different crash last night, after a little over 12 > hours of poudriere bulk -a activity, again while I was > sleeping. Hand typed: >=20 > kernel trap 12 with interrupts disabled >=20 > Fatal trap 12: page fault while in kernel mode > cpuid =3D 13; apic id =3D 0d > fault virtual address =3D 0x20 > fault code =3D supervisor read data, page not present > instruction pointer =3D 0x20:0xffffffff80b70867 > stack pointer =3D 0x28:0xfffffe00ebab8880 > frame pointer =3D 0x28:0xfffffe00ebab8890 > code segment =3D base 0x0, limit 0xfffff, type 0x1b > =3D DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags =3D resume, IOPL =3D 0 > current process =3D 44 (dom0) > [ thread pid 44 tid 100277 ] > Stopped at turnstile_broadcast+0x47: movq 0x20(%rbx,%rax,1),%rcx >=20 > (So an offset from a null pointer, apparently.) >=20 > bt shows: >=20 > Tracing pid 44 tid 100277 td 0xfffff8010f938560 > turnstile_broadcast() at turnstile_broadcast+0x47/frame = 0xfffffe00ebab8890 > __mtx_unlock_sleep() at __mtx_unlock_sleep+0xb9/frame = 0xfffffe00ebab88c0 > vm_pageout_page_lock() at vm_pageout_page_lock+0x179/frame = 0xfffffe00ebab8960 > vm_pageout_worker() at vm_pageout_worker+0xd3a/frame = 0xfffffe00ebab8a50 > vm_pageout() at vm_pageout+0x133/frame 0xfffffe00ebab8a70 > fork_exit() at fork_exit+0x83/frame 0xfffffe00ebab8ab0 > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00ebab8ab0 > --- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 --- >=20 > Dump again failed, the same way but with some byte > value differences. >=20 > (da1:strovsc1:0:0:0) WRITE(10). CDB 2a 00 35 39 8c c7 00 00 08 00 > (da1:storvsc1:0:0:0) CAM status Command timeout > (da1:storvsc1:0:0:0) Error 5, Retries exhausted > Aborting dump to to I/O error. >=20 > ** DUMP FAILED (ERROR 5) ** > Cannot dump: unknown error (error=3D5) >=20 > So this appears to be repeatable (for the Optane > swap/page partition?). >=20 > show reg: >=20 > cs 0x20 > ds 0x3b ll+0x1a > es 0x3b ll+0x1a > fs 0x13 > gs 0x1b > ss 0x28 ll+0x7 > rax 0 > rcx 0xfffff8010f938501 > rdx 0xfffff8010f938501 > rbx 0xfffffe00ebab8880 > rsp 0xfffffe00ebab8800 > rsi 0 > rdi 0 > r8 0 > r9 0 > r10 0 > r11 0 > r12 0 > r13 0xfffff8010f938560 > r14 0 > r15 0xffffffff81d67998 vm_dom+0x18 > rip 0xffffffff80b70867 turnstile_broadcast+0x47 > rflags 0x10056 > turnstile_broadcast+0x47: movq 0x20(%rbx,%rax,1),%rcx >=20 > Around where rbx points: >=20 > 0xfffffe00ebab8872: ab eb 0 fe ff ff 28 0 0 0 0 0 0 0 > 0xfffffe00ebab8880: 0 0 0 0 0 0 0 0 80 79 d6 81 ff ff > 0xfffffe00ebab888e: ff ff c0 88 ab eb 0 fe ff ff 9 20 af 80 > 0xfffffe00ebab889c: ff ff ff ff 0 7b 2 d8 f f8 ff ff 98 79 >=20 > And it looks like we have that null pointer above. >=20 > And I'm afraid that is it: I need to be off doing other things. 3 rounds of bulk -a spanning over 126 hours total and I've not had any more failures. Between rounds I updated /usr/src/ and did buildworld/buildkernel/install sequences so I'd not be far behind head. I'm giving up on directly trying to replicate either of the two types of failures that I'd reported. At least I know to "show panic" now. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)