From owner-freebsd-bugs@freebsd.org Sun Jul 23 20:52:55 2017 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F2817DAD1E2 for ; Sun, 23 Jul 2017 20:52:54 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D338A6F368 for ; Sun, 23 Jul 2017 20:52:54 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v6NKqs3i086579 for ; Sun, 23 Jul 2017 20:52:54 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen... Date: Sun, 23 Jul 2017 20:52:54 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.0-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: truckman@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: attachments.created Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Jul 2017 20:52:55 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219399 --- Comment #89 from Don Lewis --- Created attachment 184641 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D184641&action= =3Dedit patch to move amd64 shared page to a lower address to avoid Ryzen problem w= ith executing code near user address upper limit I've been doing a number of experiments with openjdk7 builds to try to bett= er characterize the Ryzen problem. First I did a number of openjdk7 builds using cpuset to pin the build to individual cores. Using cpuset -l 0 to pin the build to the first thread on core 0 would consistently cause a silent reboot on the first or second try.= =20 Pinning the build to any of the other cores allowed me to successfully bui= ld openjdk7. I ran four builds on each of the other cores to make sure that I wasn't just getting a successful build by chance. Surprisingly, pinning the build to the second thread on core 0 was also successful. In any case, the results were consistent with my earlier tests where I disabled SMT and also= all but two cores in the BIOS, since those tests always used the first thread on core 0. I tried building openjdk7 on all cores except the first thread of core 0 by using cpuset -l 1-15 and was also successful. Based on that positive result, I tried building my default set of ~1600 por= ts with cpuset -l 1-15. A little over two hours into the build, the llvm40 bu= ild failed with the: _arena.c:821: Failed assertion: "nstime_compare(&decay->epoc h, &time) <= =3D 0") causing the ports that depend on it to be skipped, but everything else built successfully. When I restarted poudriere, the llvm40 build succeeded, but = the system hung after about an hour while running java as part of the openjdk7 build. Next I tried building with cpuset -l 2-15. The only problem that I ran int= o is that the gcc build failed with SIGBUS, causing its dependencies to be skipp= ed.=20 When I restarted poudriere, gcc5 and the remaining ports build successfully. I wanted to try to eliminate the possibility of a subtle defect in core 0 a= s a potential cause of the problem, so I tried adding hint.lapic.0.disabled=3D1 hint.lapic.1.disabled=3D1 to /boot/loader.conf, but FreeBSD does not allow the BSP to be disabled B-( The other thing that is unique about core 0 on my machine is that it looks = like all of the external interrupts (but not interprocessor interrupts) go there= .=20 The biggest source of those seemed to be hpet, but I couldn't figure out ho= w to disable that (other than maybe disabling ACPI totally). When I tried hint.hpet.0.clock=3D0, all of the CPUs got assigned interrupts from another timer. The next thing I tried was inspired by the Dragonfly patch. At least some thread implementations use signals to communicate between threads. I'm not familiar with OpenJDK, but it is possible that it is such an implementation= , so it might be a heavy signal user and spend a lot of cycles in the signal trampoline code. Our signal trampoline code is in a different location than Dragonfly uses, but it is still close to (in the top page of) the top of us= er memory. Even though I got the impression that the Dragonfly patch addresse= s an issue with SMT, it does involve an interaction between interrupts and execu= tion of code near the top of user memory. As an experiment, I patched the kernel to move the location of the shared p= age lower by PAGE_SIZE. I'm not sure if it is necessary, but the page at the o= ld location has the same rwx permissions and is zero filled. I don't know if = the bug is triggered by executing code close to the upper address boundary or c= lose to a permission boundary. The preliminary results so far are very promisin= g.=20 With the patch applied, I am able to successfully build openjdk7 either unpinned or pinned to the first thread of core 0. I just kicked off an unpinned ~1600 port poudriere run. I should have resu= lts of that late today. The patch is attached. --=20 You are receiving this mail because: You are the assignee for the bug.=