Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 28 Jun 1998 23:11:54 -0700 (PDT)
From:      Tom <tom@uniserve.com>
To:        "Louis A. Mamakos" <louie@TransSys.COM>
Cc:        "Michael R. Gile" <gilem@wsg.net>, freebsd-stable@FreeBSD.ORG
Subject:   Re: determining ecc errors on freebsd-stable 
Message-ID:  <Pine.BSF.3.96.980628230424.23093A-100000@shell.uniserve.ca>
In-Reply-To: <199806290549.BAA02456@whizzo.transsys.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Mon, 29 Jun 1998, Louis A. Mamakos wrote:

> > On Sun, 28 Jun 1998, Michael R. Gile wrote:
> > 
> > > >  There is no way to log ECC corrections are they are done
> > > >transparently in the hardware, and currently there is no mechanism for the
> > > >hardware to make available that kind of info.
> > > 
> > > there must be some status register that records these errors.  Otherwise what 
> > > good is ECC?  If it doesn't tell you that something is wrong, then it is useless 
> > 
> >   Either ECC fixes the error, or if the error is unfixable, the hardware
> > generates a NMI which will cause a panic and reboot.
> > 
> >   Basically, if a fixable error occurs, you won't know about it.  If an
> > unfixable error occurs, you'll know real fast.
> 
> Well, geez, it would be nice to know that you had bum memory in the
> machine so you could replace it at some time of your choosing.  ECC 
> memory ought to be better than just having your system crash later
> rather than sooner.

  Well, you could trap the NMI and kill whatever occupied the offending
location, and make it sure it wasn't used again.  This is an operating
system issue, not a hardware one.

  An NMI panic is MUCH better that "crashing later", as you know precisely
what caused it.  Memory corruption on non-ECC/non-parity systems is very
difficult to track.  Plus, you could be corrupting valuable data in the
process.  With existing ECC systems, at least you get a clean reboot
before anything serious is wreaked.

> This is the kind of thing that seperates toy computers from robust, 
> has to be up no matter what mission critical computers.  

  Yeah, yeah... Sun makes a big deal about this... fact of the matter is,
if you lose some memory containing the kernel you have to reboot anyhow.

  If you don't want a toy computer, you get a cluster anyhow, since there
is way more stuff that can fail than memory (and more often too).

> louie

Tom


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.980628230424.23093A-100000>