Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Jul 2017 20:52:54 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-bugs@FreeBSD.org
Subject:   [Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen...
Message-ID:  <bug-219399-8-jg88T4h7cK@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-219399-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-219399-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219399

--- Comment #89 from Don Lewis <truckman@FreeBSD.org> ---
Created attachment 184641
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D184641&action=
=3Dedit
patch to move amd64 shared page to a lower address to avoid Ryzen problem w=
ith
executing code near user address upper limit

I've been doing a number of experiments with openjdk7 builds to try to bett=
er
characterize the Ryzen problem.

First I did a number of openjdk7 builds using cpuset to pin the build to
individual cores.  Using cpuset -l 0 to pin the build to the first thread on
core 0 would consistently cause a silent reboot on the first or second try.=
=20
Pinning  the build to any of the other cores allowed me to successfully bui=
ld
openjdk7.  I ran four builds on each of the other cores to make sure that I
wasn't just getting a successful build by chance.  Surprisingly, pinning the
build to the second thread on core 0 was also successful.  In any case, the
results were consistent with my earlier tests where I disabled SMT and also=
 all
but two cores in the BIOS, since those tests always used the first thread on
core 0.

I tried building openjdk7 on all cores except the first thread of core 0 by
using cpuset -l 1-15 and was also successful.

Based on that positive result, I tried building my default set of ~1600 por=
ts
with cpuset -l 1-15.  A little over two hours into the build, the llvm40 bu=
ild
failed with the:
  _arena.c:821: Failed assertion: "nstime_compare(&decay->epoc h, &time) <=
=3D 0")
causing the ports that depend on it to be skipped, but everything else built
successfully.  When I restarted poudriere, the llvm40 build succeeded, but =
the
system hung after about an hour while running java as part of the openjdk7
build.

Next I tried building with cpuset -l 2-15.  The only problem that I ran int=
o is
that the gcc build failed with SIGBUS, causing its dependencies to be skipp=
ed.=20
When I restarted poudriere, gcc5 and the remaining ports build successfully.

I wanted to try to eliminate the possibility of a subtle defect in core 0 a=
s a
potential cause of the problem, so I tried adding
 hint.lapic.0.disabled=3D1
 hint.lapic.1.disabled=3D1
to /boot/loader.conf, but FreeBSD does not allow the BSP to be disabled B-(

The other thing that is unique about core 0 on my machine is that it looks =
like
all of the external interrupts (but not interprocessor interrupts) go there=
.=20
The biggest source of those seemed to be hpet, but I couldn't figure out ho=
w to
disable that (other than maybe disabling ACPI totally).  When I tried
hint.hpet.0.clock=3D0, all of the CPUs got assigned interrupts from another
timer.

The next thing I tried was inspired by the Dragonfly patch.  At least some
thread implementations use signals to communicate between threads.  I'm not
familiar with OpenJDK, but it is possible that it is such an implementation=
, so
it might be a heavy signal user and spend a lot of cycles in the signal
trampoline code.  Our signal trampoline code is in a different location than
Dragonfly uses, but it is still close to (in the top page of) the top of us=
er
memory.  Even though I got the impression that the Dragonfly patch addresse=
s an
issue with SMT, it does involve an interaction between interrupts and execu=
tion
of code near the top of user memory.

As an experiment, I patched the kernel to move the location of the shared p=
age
lower by PAGE_SIZE.  I'm not sure if it is necessary, but the page at the o=
ld
location has the same rwx permissions and is zero filled.  I don't know if =
the
bug is triggered by executing code close to the upper address boundary or c=
lose
to a permission boundary.  The preliminary results so far are very promisin=
g.=20
With the patch applied, I am able to successfully build openjdk7 either
unpinned or pinned to the first thread of core 0.

I just kicked off an unpinned ~1600 port poudriere run.  I should have resu=
lts
of that late today.

The patch is attached.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-219399-8-jg88T4h7cK>