Date: Sun, 09 Sep 2018 00:27:18 +0000 From: bugzilla-noreply@freebsd.org To: virtualization@FreeBSD.org Subject: [Bug 225791] ena driver causing kernel panics on AWS EC2 Message-ID: <bug-225791-27103-0P0UYppwQY@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-225791-27103@https.bugs.freebsd.org/bugzilla/> References: <bug-225791-27103@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D225791 Leif Pedersen <leif@ofWilsonCreek.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |leif@ofWilsonCreek.com --- Comment #18 from Leif Pedersen <leif@ofWilsonCreek.com> --- (In reply to pete from comment #16) I've been able to reproduce this repeatedly (but not predictably) on 11.2 o= n an r4.large. Not to state the blindingly obvious, but smaller instances such as t2.* aren't affected since they use xn instead of ena. It seems to be most likely at times of high network IO, which again risks stating the forehead-slappingly obvious. :) Multiple times, the crash included the same back-trace shown in this bug. However, at least once it panicked on a double-fault, which, if related, suggests that the bug in ena could be incurring memory corruption. Now gran= ted, I only know of one incidence of a double-fault, so it could've been running= on a host with faulty RAM or something at the time. However, after each panic,= I'd stop/start the instance rather than reboot, to provoke it to move to new hardware, so I'm not suggesting that the whole bug is merely from faulty ho= st hardware. I might beg that the fix could be patched in 11.2, or at least included in = 11.3 so it won't have to wait for 12. Otherwise, AWS users will find themselves stuck on 11.1, and the approaching EOL of 11.1 will leave them without secu= rity updates, which in turn makes this an indirect security issue. However, I understand there are other considerations at play, and very much appreciate= the relentless work of the security team (not to mention the work on AWS support and FreeBSD in general). Probably too much detail: The particular case was our standby MySQL databas= e on an r4.large. It was stable on 11.1, and problematic after I upgraded it to = 11.2 (with `freebsd-update upgrade`); after five or so crashes in a month, I downgraded it back to 11.1 (again with `freebsd-update upgrade`), after whi= ch it has been perfectly stable for a couple of weeks now. It's in master-mast= er replication with our production replica, and normally gets a fairly low but steady stream of activity from the replication. However, we have several nightly jobs that crank away on updating a model and cause a large volume of traffic in the replication stream. I don't have proper metrics on bytes/sec= , so I don't have any idea whether it saturates the interface. It's enough that replication falls behind for up to a few hours, but I wouldn't call our sys= tem "huge" in terms of network traffic by any means. The reason I included all that detail is to point out: (1) it seems to be a regression between 11.1 and 11.2, (2) r4.* are for sure affected, and (3) it may be that the problem is more likely to be triggered on moderate or bursty network traffic with much task-switching between MySQL threads, compared to= a simple stream of a high speed file transfer, for example. -Leif --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-225791-27103-0P0UYppwQY>