Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 09 Sep 2018 00:27:18 +0000
From:      bugzilla-noreply@freebsd.org
To:        virtualization@FreeBSD.org
Subject:   [Bug 225791] ena driver causing kernel panics on AWS EC2
Message-ID:  <bug-225791-27103-0P0UYppwQY@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-225791-27103@https.bugs.freebsd.org/bugzilla/>
References:  <bug-225791-27103@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D225791

Leif Pedersen <leif@ofWilsonCreek.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |leif@ofWilsonCreek.com

--- Comment #18 from Leif Pedersen <leif@ofWilsonCreek.com> ---
(In reply to pete from comment #16)

I've been able to reproduce this repeatedly (but not predictably) on 11.2 o=
n an
r4.large. Not to state the blindingly obvious, but smaller instances such as
t2.* aren't affected since they use xn instead of ena. It seems to be most
likely at times of high network IO, which again risks stating the
forehead-slappingly obvious. :)

Multiple times, the crash included the same back-trace shown in this bug.
However, at least once it panicked on a double-fault, which, if related,
suggests that the bug in ena could be incurring memory corruption. Now gran=
ted,
I only know of one incidence of a double-fault, so it could've been running=
 on
a host with faulty RAM or something at the time. However, after each panic,=
 I'd
stop/start the instance rather than reboot, to provoke it to move to new
hardware, so I'm not suggesting that the whole bug is merely from faulty ho=
st
hardware.

I might beg that the fix could be patched in 11.2, or at least included in =
11.3
so it won't have to wait for 12. Otherwise, AWS users will find themselves
stuck on 11.1, and the approaching EOL of 11.1 will leave them without secu=
rity
updates, which in turn makes this an indirect security issue. However, I
understand there are other considerations at play, and very much appreciate=
 the
relentless work of the security team (not to mention the work on AWS support
and FreeBSD in general).

Probably too much detail: The particular case was our standby MySQL databas=
e on
an r4.large. It was stable on 11.1, and problematic after I upgraded it to =
11.2
(with `freebsd-update upgrade`); after five or so crashes in a month, I
downgraded it back to 11.1 (again with `freebsd-update upgrade`), after whi=
ch
it has been perfectly stable for a couple of weeks now. It's in master-mast=
er
replication with our production replica, and normally gets a fairly low but
steady stream of activity from the replication. However, we have several
nightly jobs that crank away on updating a model and cause a large volume of
traffic in the replication stream. I don't have proper metrics on bytes/sec=
, so
I don't have any idea whether it saturates the interface. It's enough that
replication falls behind for up to a few hours, but I wouldn't call our sys=
tem
"huge" in terms of network traffic by any means.

The reason I included all that detail is to point out: (1) it seems to be a
regression between 11.1 and 11.2, (2) r4.* are for sure affected, and (3) it
may be that the problem is more likely to be triggered on moderate or bursty
network traffic with much task-switching between MySQL threads, compared to=
 a
simple stream of a high speed file transfer, for example.

-Leif

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-225791-27103-0P0UYppwQY>