From owner-freebsd-questions@freebsd.org Fri Mar 11 19:06:59 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB474ACD0EE for ; Fri, 11 Mar 2016 19:06:58 +0000 (UTC) (envelope-from freebsd-questions@m.gmane.org) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B0F4A288 for ; Fri, 11 Mar 2016 19:06:58 +0000 (UTC) (envelope-from freebsd-questions@m.gmane.org) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1aeSOf-0006ZX-H4 for freebsd-questions@freebsd.org; Fri, 11 Mar 2016 20:06:49 +0100 Received: from pool-72-66-1-32.washdc.fios.verizon.net ([72.66.1.32]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 11 Mar 2016 20:06:49 +0100 Received: from nightrecon by pool-72-66-1-32.washdc.fios.verizon.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 11 Mar 2016 20:06:49 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-questions@freebsd.org From: Michael Powell Subject: Re: FreeBSD Crashes Intermittently !! Date: Fri, 11 Mar 2016 14:04:44 -0500 Lines: 127 Message-ID: References: Reply-To: nightrecon@hotmail.com Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: pool-72-66-1-32.washdc.fios.verizon.net X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Mar 2016 19:06:59 -0000 shahzaib shahzaib wrote: > Hi, > > I am new to this mailing list so please pardon me for any mistakes. We've > started using FreeBSD from past 4-5 months and facing auto-reboot crash > issue since the beginning. Following are the servers specs : > > Supermicro X5690 (12 cores, 24 threads - 2u) > 96GB RAM > 12x3TB mirror+stripping (HBA-LSI9211) > X8DT3 Board > > We've total of 5 supermicro servers built upon same hardware and all of > them intermittently goes down and sometimes they crash and boot up > automatically (within 6min) and sometimes they gets freeze and we've to > manually boot them via IPMI interface. All the time we get 'MCA Internal > Timer Error' in crash logs. Here is the recent one : > > http://pastebin.com/042SJ11c > > Once we reported this issue to our hardware vendor he said that its due to > FreeBSD incompatibility with hardware and suggested us to try installing > Linux on one of them and so did we proceeded with Debian on one of them > them but all in vain and server was still crashing. Once we reported him > about his failed proposal he then said that it could be related to > application which is causing this crash. He is just trying his best to point the finger somewhere else, anywhere else, with the bottom line he doesn't want to let you return the machines and refund your money. If the hardware has the same problems with different operating systems something is wrong in the hardware. > Now if he really is right then RAM should first swapped out to its full in > order to make OS crash but never did that happened, we've never been out > of Memory as 96GB RAM is pretty high. We've also took some precaution to > debug this issue : [snip] This "let's blame it on an application" will never produce positive results if the problem is truly hardware related. One long-standing and well known situation is poorly engineered hardware usually gets "fixed" in the WIntel world by patching work around(s) into driver code. This just hides the problem from the user. So in a situation like this you will find the machine magically doesn't crash when running Windows on it, but since these magic-bullet "fixes" do not map directly into the Unix world it takes a lot more developer effort to achieve a similar repair. When you see this effect, most of this is in (but not necessarily limited to) driver code. > > Now i am confused if application really can crash server without swapping > it out ? Could there be any php function which could make a crash :-| . Is > FreeBSD is the cause of crash ? Things are pretty blurred right now :(. If it also crashes with Debian why would you want to blame FreeBSD? > Here is the Kernel tuning values : > > http://pastebin.com/nEnxkV6y > [snip] My own personal recommendation is to simplify things down. I usually start by choosing the "default" BIOS cmos settings. Usually there are two; one is a bare bones default and the other is usually an "optimized defaults". I usually always start with the "optimized" choice as it is still very generic. I would remove all customizations and reduce to pristine OS install with no tunings. Even to the point of running the box with the LSI controller disabled and run it on just a SATA drive or two. This gets the driver for the LSI controller out of the way. But let me back up the train first. First thing I'd do after setting BIOS to defaults is to disable the HPET timer. Second thing I'd do is disable the NUMA aware OS setting. Third thing I'd do is take the ipmi load out from loader.conf. Also disable the entire USB subsystem at some point in the experimentation to rule out fbsd's USB subsystem, etc. Basically remove and strip things down until the problem goes away. Then you have a smaller pool of possible subsystems causing the problem. The HPET timer is best used in a Windows environment for synchronizing multimedia. If you disable it in BIOS only to find that FreeBSD tries to utilize it anyway it can be disabled in loader.conf with: kern.timecounter.hardware=TSC I mean, in the *Nix world do you really want micro millisecond time stamps on all logging, just spinning CPU cycles and wasting performance? I suspect *Nix systems run better without the HPET timer. My opinion (don't have benchmarkings to prove). In my experience software bugs usually present with a narrowing down to a very specific sequence of steps that can reliably reproduce a problem. Hardware problems, on the other hand, can show little or no pattern whatsoever (totally erratic). And the intermittent hardware failure is the absolute worst because you can only really troubleshoot during the period the intermittent is showing. If you get and intermittent that is essentially instantaneous, it happens and is gone. Very frustrating as this usually reboots the box with little or no info left behind to go on. However, I'd like to also point out that 5 machines all doing the same thing is not likely to be an intermittent. IMHO this correlates to a hardware compatibility situation. My first thoughts there are almost always the memory subsystem. If the RAM has not had a proper engineering validation I'd look at the list from SuperMicro and try and obtain something that has been validated. Should this magically make the problem disappear, then the vendor you bought the hardware from is putting RAM into boxen that does NOT have a guarantee that it will work. I've seen machines that would behave fine during normal operation but that would reboot only when running a make buildworld, as this pushes the RAM quite a bit harder than regular day to day stuff. Just a few $0.02 food for thought type things. It's nice to have the time to be able to drill down and discover a solution. It's rewarding. But in the real world as soon as I saw Debian produce the same situation I'd be on the phone to RMA. If I had a little more time I might try Windows to see if it somehow "Just Works". The datapoint here being it would point into the possibility that WIntel is releasing driver patch work around(s) to cover up poor hardware design. But really, I generally don't have this kind of time and in order to meet deadlines sometimes have to go with a Plan B even if I don't like it. -Mike