Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Aug 2009 06:08:54 -0400
From:      Michael Powell <nightrecon@hotmail.com>
To:        freebsd-questions@freebsd.org
Subject:   Re: boot sector f*ed
Message-ID:  <h63d34$rqp$1@ger.gmane.org>
References:  <20090811173211.6FE4D106567B@hub.freebsd.org> <20090812193008.F19821@sola.nimnet.asn.au> <4A82A8D9.30406@videotron.ca> <20090812172704.GA27066@slackbox.xs4all.nl> <4A831DF7.9090506@videotron.ca> <20090812232810.GA37833@slackbox.xs4all.nl> <4A841AC2.1050809@videotron.ca> <20090814012551.H19821@sola.nimnet.asn.au>

next in thread | previous in thread | raw e-mail | index | archive | help
Ian Smith wrote:
[snip]
> 
> Smells like flakey hardware .. intermittent, inexplicable glitches.  It
> might survive hours on one workload, minutes on another, no sense to it?
> 
>  > All that I am seeing is that there is either a problem with the bios
>  > (which I even reinstalled and that changed nothing in the functioning)
>  > or something is going on with the OS.
> 
> After you've thoroughly proven the hardware is AOK under sustained and
> varied pressure, then you can suspect software issues - which tend to be
> far more consistent and repeatable - but if the hardware's acting flakey
> then you likely won't see any consistency in software issues, which does
> seem to concur with your descriptions to date.
>

In my experience, hardware problems can quite possibly show little pattern 
to where and when in the usage of said machine they cause the box to flake. 
One that is malfunctioning all the time is relatively easy to find. 

The intermittent is the bane of all troubleshooting. I hate the intermittent 
more than I hate anything. One pattern an intermittent will show is 
eventually as the bad part gets worse the period between flakes will get 
shorter, and ultimately at some point die completely. Initially the period 
can be quite large so proper troubleshooting is difficult as you can't 
troubleshoot during the 'in between' when it's not malfunctioning.

I also have an 80/20 rule about hardware as to whether it is a hot or cold 
failure. The 80% part is that most hardware problems occur when very dense 
VLSI chips heat up. So a machine may not show any problem until it's been 
powered up for a while. The other 20% is the cold start. Turn the box on and 
there is immediately some kind of problem early on in the course of booting. 
Leave it powered on, walk away for 20 minutes to get a coffee, and reset it 
after it's had a chance to warm up and now it works fine the rest of the 
day. These patterns are indicative of a typical pattern in hardware trouble 
behavior.

A software error, on the other hand, most of the time shows itself as a well 
defined repeatable sequence of steps that cause the problem every time the 
sequence is executed. This can also usually be easily reproduced by others 
running the same, or similar enough, platform(s) by executing said sequence. 
This can get quite sticky as even the BIOS code is software! Bad buggy BIOS 
code having a bad reaction to the compiled boot loader binary, even though 
probably quite rare, is not totally outside the realm of possibility.   

Somewhere very near the root of the geometric logic tree of troubleshooting 
you need to be able to drive a wedge between hardware and software in a 
divide and conquer kind of way. Making any arbitrary assumptions as to which 
side is the problem early on will blind the troubleshooter to avenues of 
hypothesis this and test that. Assume that the hardware is 100% OK so it 
must be a software problem without proof is a mistake, and vice versa.

And it might be as simple as installing another OS such as a Linux distro or 
Windows to the box. If it is truly a hardware problem it may continue to 
malfunction and cause trouble no matter what the choice of OS. Or it may 
not, as sometimes buggy hardware design failures are compensated for with 
workarounds in drivers, thus hiding the flaw. It's the old 'have a <insert 
brand name> box with xyz hardware' with a known problem and the fix is to 
download and install <insert brand name> driver revision such and such from 
the OEM.

Since these kinds of things are not generally propagated far and wide an OS 
such as FreeBSD may not be privy to such bad hardware details. Sometimes the 
developers do incorporate hacks for hardware. If you can accurately identify 
such a situation the most likely way to get it fixed for the long run is to 
file a proper PR. If done well enough and it catches the eye of a dev who 
may be interested and actually possess the piece of hardware a workaround 
may get coded and become a part of FreeBSD.

Just a lot of generalizations here. As always, there is the YMMV clause. :-)
 
[snip]

-Mike





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?h63d34$rqp$1>