From owner-freebsd-current@freebsd.org Mon Jan 4 20:07:17 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 41101A621C0 for ; Mon, 4 Jan 2016 20:07:17 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 029601CB9 for ; Mon, 4 Jan 2016 20:07:17 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from slw by zxy.spb.ru with local (Exim 4.86 (FreeBSD)) (envelope-from ) id 1aGBPH-000OfE-Ns; Mon, 04 Jan 2016 23:07:07 +0300 Date: Mon, 4 Jan 2016 23:07:07 +0300 From: Slawa Olhovchenkov To: shahzaibcb Cc: freebsd-current@freebsd.org Subject: Re: FreeBsd MCA Panic Crash !! Message-ID: <20160104200707.GI70867@zxy.spb.ru> References: <1451903649383-6064691.post@n5.nabble.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1451903649383-6064691.post@n5.nabble.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jan 2016 20:07:17 -0000 On Mon, Jan 04, 2016 at 03:34:09AM -0700, shahzaibcb wrote: > Hi, > > We've switched to FreeBSD recently to accomodate large video storage as we > are running video streaming website. So the job of the FreeBSD is to > transcode the uploaded videos using ffmpeg and serve them to users via nginx > webserver but so far our experience is not very good with it. It crashes > every 2-3 days and we're unable to track down the problem. The server specs > are pretty high : > > > Supermicro X5690 (12 cores, 24 threads - 2u) > 96GB RAM > 12x3TB RAID-10 (HBA-LSI9211) > > Here is the screenshot of recent crash : > > http://prntscr.com/9er3pk > > One thing worth mentioning is, before going down there's no load on server, > more or less free RAM usually is around 12GB. We've tried following > solutions so far : > > > - Updated FreeBSD OS > - Replaced 800W PS with 900W > - We've reduced CMOS from MAX(26x) to 18x as suggested in this post Do you try to replace CPU? > http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic > > The solution we've not performed so far is : > > - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics. > > Here is the crash dump : > > [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 3 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be00000000800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 3 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 2 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be00000000800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 2 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 3 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be00000000800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 3 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 2 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be00000000800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 2 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > > ----------------------------------------------------------------------------------- > > I showed those Hardware errors to Vendor from whom we purchased Supermicro > servers . This is what he has to say : > > ----------------------------------- > Why do you not made one test environment with CentOS or one other Linux that > you know to use, and see if you have same errors ??? if not than you know > that the errors come from OS not from hardware. ( CentOS, RedHead….work > diferend like FreeBSD – work direct on hardware if you don’t have the right > kernel settings can the server crashed. CentOS , RedHead…. don’t work direct > on hardware and distribute the resource load better and you have better > control and you can better debug one situation) > ----------------------------------- > > Now we're on a black hole and unable to find that either issue with FreeBSD > or Hardware. We're thinking to disable mca in loader.conf but ppl are not > suggesting it. If you guys can help us, it'd be very kind. > > > > -- > View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html > Sent from the freebsd-current mailing list archive at Nabble.com. > _______________________________________________ > freebsd-current@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"