From owner-freebsd-questions@freebsd.org  Fri Mar 11 19:06:59 2016
Return-Path: <owner-freebsd-questions@freebsd.org>
Delivered-To: freebsd-questions@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB474ACD0EE
 for <freebsd-questions@mailman.ysv.freebsd.org>;
 Fri, 11 Mar 2016 19:06:58 +0000 (UTC)
 (envelope-from freebsd-questions@m.gmane.org)
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
 (using TLSv1 with cipher AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id B0F4A288
 for <freebsd-questions@freebsd.org>; Fri, 11 Mar 2016 19:06:58 +0000 (UTC)
 (envelope-from freebsd-questions@m.gmane.org)
Received: from list by plane.gmane.org with local (Exim 4.69)
 (envelope-from <freebsd-questions@m.gmane.org>) id 1aeSOf-0006ZX-H4
 for freebsd-questions@freebsd.org; Fri, 11 Mar 2016 20:06:49 +0100
Received: from pool-72-66-1-32.washdc.fios.verizon.net ([72.66.1.32])
 by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
 id 1AlnuQ-0007hv-00
 for <freebsd-questions@freebsd.org>; Fri, 11 Mar 2016 20:06:49 +0100
Received: from nightrecon by pool-72-66-1-32.washdc.fios.verizon.net with
 local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00
 for <freebsd-questions@freebsd.org>; Fri, 11 Mar 2016 20:06:49 +0100
X-Injected-Via-Gmane: http://gmane.org/
To: freebsd-questions@freebsd.org
From: Michael Powell <nightrecon@hotmail.com>
Subject: Re: FreeBSD Crashes Intermittently !!
Date: Fri, 11 Mar 2016 14:04:44 -0500
Lines: 127
Message-ID: <nbv4vv$n5p$1@ger.gmane.org>
References: <CAD3xhrMfKO8hVdpzR1xNqV=vwTMedPeTHR7v2=5W6RwC3F4V7A@mail.gmail.com>
Reply-To: nightrecon@hotmail.com
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: pool-72-66-1-32.washdc.fios.verizon.net
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Mar 2016 19:06:59 -0000

shahzaib shahzaib wrote:

> Hi,
> 
> I am new to this mailing list so please pardon me for any mistakes. We've
> started using FreeBSD from past 4-5 months and facing auto-reboot crash
> issue since the beginning. Following are the servers specs :
> 
> Supermicro X5690 (12 cores, 24 threads - 2u)
> 96GB RAM
> 12x3TB mirror+stripping (HBA-LSI9211)
> X8DT3 Board
> 
> We've total of 5 supermicro servers built upon same hardware and all of
> them intermittently goes down and sometimes they crash and boot up
> automatically (within 6min) and sometimes they gets freeze and we've to
> manually boot them via IPMI interface. All the time we get 'MCA Internal
> Timer Error' in crash logs. Here is the recent one :
> 
> http://pastebin.com/042SJ11c
> 
> Once we reported this issue to our hardware vendor he said that its due to
> FreeBSD incompatibility with hardware and suggested us to try installing
> Linux on one of them and so did we proceeded with Debian on one of them
> them but all in vain and server was still crashing. Once we reported him
> about his failed proposal he then said that it could be related to
> application which is causing this crash.

He is just trying his best to point the finger somewhere else, anywhere else, 
with the bottom line he doesn't want to let you return the machines and 
refund your money. If the hardware has the same problems with different 
operating systems something is wrong in the hardware. 
 
> Now if he really is right then RAM should first swapped out to its full in
> order to make OS crash but never did that happened, we've never been out
> of Memory as 96GB RAM is pretty high. We've also took some precaution to
> debug this issue :
[snip]

This "let's blame it on an application" will never produce positive results 
if the problem is truly hardware related.

One long-standing and well known situation is poorly engineered hardware 
usually gets "fixed" in the WIntel world by patching work around(s) into 
driver code. This just hides the problem from the user. So in a situation 
like this you will find the machine magically doesn't crash when running 
Windows on it, but since these magic-bullet "fixes" do not map directly into 
the Unix world it takes a lot more developer effort to achieve a similar 
repair. When you see this effect, most of this is in (but not necessarily 
limited to) driver code. 

> 
> Now i am confused if application really can crash server without swapping
> it out ? Could there be any php function which could make a crash :-| . Is
> FreeBSD is the cause of crash ? Things are pretty blurred right now :(.

If it also crashes with Debian why would you want to blame FreeBSD?

> Here is the Kernel tuning values :
> 
> http://pastebin.com/nEnxkV6y
> 
[snip]

My own personal recommendation is to simplify things down. I usually start 
by choosing the "default" BIOS cmos settings. Usually there are two; one is 
a bare bones default and the other is usually an "optimized defaults". I 
usually always start with the "optimized" choice as it is still very 
generic. 

I would remove all customizations and reduce to pristine OS install with no 
tunings. Even to the point of running the box with the LSI controller 
disabled and run it on just a SATA drive or two. This gets the driver for 
the LSI controller out of the way.

But let me back up the train first. First thing I'd do after setting BIOS to 
defaults is to disable the HPET timer. Second thing I'd do is disable the 
NUMA aware OS setting. Third thing I'd do is take the ipmi load out from 
loader.conf. Also disable the entire USB subsystem at some point in the 
experimentation to rule out fbsd's USB subsystem, etc. 

Basically remove and strip things down until the problem goes away. Then you 
have a smaller pool of possible subsystems causing the problem. The HPET 
timer is best used in a Windows environment for synchronizing multimedia. If 
you disable it in BIOS only to find that FreeBSD tries to utilize it anyway 
it can be disabled in loader.conf with:

kern.timecounter.hardware=TSC

I mean, in the *Nix world do you really want micro millisecond time stamps 
on all logging, just spinning CPU cycles and wasting performance? I suspect 
*Nix systems run better without the HPET timer. My opinion (don't have 
benchmarkings to prove).

In my experience software bugs usually present with a narrowing down to a 
very specific sequence of steps that can reliably reproduce a problem. 
Hardware problems, on the other hand, can show little or no pattern 
whatsoever (totally erratic). And the intermittent hardware failure is the 
absolute worst because you can only really troubleshoot during the period 
the intermittent is showing. If you get and intermittent that is essentially 
instantaneous, it happens and is gone. Very frustrating as this usually 
reboots the box with little or no info left behind to go on.

However, I'd like to also point out that 5 machines all doing the same thing 
is not likely to be an intermittent. IMHO this correlates to a hardware 
compatibility situation. My first thoughts there are almost always the memory 
subsystem. If the RAM has not had a proper engineering validation I'd look 
at the list from SuperMicro and try and obtain something that has been 
validated. Should this magically make the problem disappear, then the vendor 
you bought the hardware from is putting RAM into boxen that does NOT have a 
guarantee that it will work. I've seen machines that would behave fine during 
normal operation but that would reboot only when running a make buildworld, 
as this pushes the RAM quite a bit harder than regular day to day stuff.

Just a few $0.02 food for thought type things. It's nice to have the time to 
be able to drill down and discover a solution. It's rewarding. But in the 
real world as soon as I saw Debian produce the same situation I'd be on the 
phone to RMA. If I had a little more time I might try Windows to see if it 
somehow "Just Works". The datapoint here being it would point into the 
possibility that WIntel is releasing driver patch work around(s) to cover up 
poor hardware design. But really, I generally don't have this kind of time 
and in order to meet deadlines sometimes have to go with a Plan B even if I 
don't like it.


-Mike