Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 09 Oct 2017 14:45:53 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-bugs@FreeBSD.org
Subject:   [Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen...
Message-ID:  <bug-219399-8-KzodOyMRrb@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-219399-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-219399-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219399

--- Comment #263 from Lars Viklund <zao@zao.se> ---
Let me go on a tangent with my experience since April of Ryzen:

I've had both kinds of AMD problem with my Ryzen 1700 across two ASUS
motherboards, the PRIME B350M-A and the PRIME X370-PRO.

It may help to clear up the discussion with that there are indeed distinct
problems that AMD seems to address when you contact their RMA/support.

The first kind is the popular one, with segfaulting GCC builds. That was
resolved for me by going from a 1709PGT to a 1728SUS CPU via a RMA process,=
 in
which the AMD person considered the stock cooler and case organization
sufficient.

The second kind is the one where the machine simply freezes after a hour or
more, not traceable to load patterns.

The symptoms for that have been described earlier in this bug and are that =
all
activity ceases in the machine - NIC stops, screen turns off, no trace of a=
ny
panics over display or network. Only thing alive is fans and the RGB lighti=
ng.

In my case, the only knob in firmware that had any effect was disabling SMT
completely (note, not restricting core pairs). This changed the machine from
being hanging after sinking around 2.2T of data onto ZFS over 10 gigabit
networking (mlx4en) in about 1h20min, to being usable as a stable NAS for
weeks.

In my second interaction with AMD, they directly directed me to disable all
forms of power saving, particularly C-states like C6. As my firmware doesn't
have that particular knob anymore, I could not fully comply and the machine
remains unstable in SMT configuration with FreeBSD. For SF's sake, no amoun=
t of
load levelling, poll frequency, voltage bumps, or other knobs have any effe=
ct,
to the degree that they make sense to tune. While horrible VRM evidently ma=
y be
one cause for problems, this seems to be a wider issue.

I should note that I'm running the commit that moves the top page of memory,
and have also modified it to have a larger gap, as others have done in this
bug, and the sigtramp tool reports that I seem to have applied it properly.
Instability remains.

Now, to the interesting part of the story... a bleeding edge Arch Linux (ke=
rnel
4.13.3) is rock solid on the same hardware and same configuration.=20

The only aberration that it demonstrates is kernel log entries that AMD's I=
OMMU
may not quite be up to snuff, which I've ignored as the machine seems to wo=
rk
and I only need the GPU for console.

> [Thu Oct  5 23:04:13 2017] nouveau 0000:21:00.0: AMD-Vi: Event logged [IO=
_PAGE_FAULT domain=3D0x000b address=3D0x0000000000000000 flags=3D0x0000]
> [Thu Oct  5 23:04:13 2017] nouveau 0000:21:00.0: AMD-Vi: Event logged [IO=
_PAGE_FAULT domain=3D0x000b address=3D0x0000000000000080 flags=3D0x0000]
> [Thu Oct  5 23:04:13 2017] nouveau 0000:21:00.0: AMD-Vi: Event logged [IO=
_PAGE_FAULT domain=3D0x000b address=3D0x0000000000000180 flags=3D0x0000]

Could it be an interaction with the NVidia card or other hardware in a
non-primary PCIe slot? Could it be that the IOMMU on these chipsets is so
broken that the OS needs workarounds or ignoring it completely?

I've got no clue, but I offer this data and these points to ponder.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-219399-8-KzodOyMRrb>