Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Aug 2001 12:11:02 +0200 (CEST)
From:      Oliver Fromme <olli@secnetix.de>
To:        freebsd-stable@FreeBSD.ORG
Subject:   Re: 4.4-rc instability
Message-ID:  <200108281011.MAA27326@lurza.secnetix.de>
In-Reply-To: <20010828111949.B95570@freebie.xs4all.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
Wilko Bulte <wkb@freebie.xs4all.nl> wrote:
 > On Tue, Aug 28, 2001 at 11:00:37AM +0200, Oliver Fromme wrote:
 > > Unfortunately, FreeBSD does not support ECC RAM (or did
 > > that change recently?).
 > 
 > ? ECC is a hardware function. All of my Alpha machines have ECC memory
 > (for example).

Sure, I once did have a PC with ECC memory, too.

When that ECC stick started dying, at first I didn't notice
at all, because the chipset (i.e. the memory controller in
the northbridge, I think) corrected the errors silently.
When the errors grew so that they weren't ECC-correctable
anymore, processes started dying on sigsegv, and it got
worse at a fast pace.  Soon I couldn't even boot into
single-user anymore, because the /bin/sh sigsegved
instantly.

At first I thought the processor had gone bad, or maybe the
mainboard itself (a Gigabyte dual P2 board, intel 440BX
chipset).  I believed in ECC, so the RAM was no suspect to
me at that time.  I had seen bad ECC memory in Sun Sparc
workstations running Solaris at the university, which
started logging "bad memory page, ECC error" or similar in
the syslog, and automatically disabled that particular page
if a certain number of errors had occured on it.  That was
a very cool feature, I thought.

Finally I ripped my DIMM out and put it into a different
board (an MSI Athlon board with AMD chipset, i.e. different
design, different processor, different BIOS).  Guess what?
It failed in the same ways.  So it was indeed the fault of
the ECC memory.  I took it to a computer shop where a
hardware memory tester was available, which confirmed that
this DIMM had gone foobar.

Since then, I never bought expensive ECC memory again, but
instead preferred well-known brands (such as Infineon).
They're less expensive, and I've never had any memory
problems ever since then.

So, the bottom line is, ECC memory is good as long as there
are few enough errors that they can be corrected by the
chipset.  If there are more of them, you're doomed just as
if you had no ECC in the first place.  At least that's the
experience of mine with i386 P2/Athlon mainboards.  Alpha
might be a different story.

Frankly, I expected the machine to halt or freeze with
something like an NMI or "parity check error", like the old
PCs with parity SIMMs did.  Would have been better than
just randomly dying.

Even better would be if the operating system recognized the
correctable errors and log them somewhere, and (_even_
better!) offer the possibility to disable memory pages with
known errors.  Tru64 on Alpha supports exactly this.
Solaris on Sparc does, too.  FreeBSD does not.  That's what
I meant when I wrote that FreeBSD does not support ECC RAM.
(I'm sorry, I should have been more elaborate on this.
Please excuse me.)

I think I still have that broken DIMM somewhere in a
drawer, and I'm willing to send it to anyone who wants to
look at it and improve FreeBSD's handling of this (I
already offered this a few months ago, but got no reply).
On the other hand, this particular one is probably too
broken to be even useful for this kind of stuff.

Anyhow, that's my story about ECC memory.

Regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"All that we see or seem is just a dream within a dream" (E. A. Poe)

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200108281011.MAA27326>