From owner-freebsd-doc  Fri Feb 18 12:20:32 2000
Delivered-To: freebsd-doc@freebsd.org
Received: from home.gtcs.com (home.gtcs.com [209.181.16.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id C478B37BA6B; Fri, 18 Feb 2000 12:20:24 -0800 (PST)
	(envelope-from bruce@gtcs.com)
Received: from gtcs.com (localhost [127.0.0.1])
	by  home.gtcs.com (8.8.8/8.8.8) with ESMTP id NAA29729;
	Fri, 18 Feb 2000 13:19:37 -0700 (MST)
	(envelope-from bruce@gtcs.com)
Message-Id: <200002182019.NAA29729@ home.gtcs.com>
Date: Fri, 18 Feb 2000 13:19:32 -0700 (MST)
From: Bruce Gingery <bgingery@gtcs.com>
Reply-To: bgingery@gtcs.com
Subject: Re: Recommended addition to FAQ (Troubleshooting) 
To: "Jordan K. Hubbard" <jkh@zippy.cdrom.com>
Cc: freebsd-doc@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG
In-Reply-To: <79456.950901038@zippy.cdrom.com>
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=US-ASCII
Sender: owner-freebsd-doc@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On 18 Feb 2000, Jordan K. Hubbard <jkh@zippy.cdrom.com> 
issued forth a missive of approximately 3647 bytes,
entitled "Re: Recommended addition to FAQ (Troubleshooting) ",
the text of which in full or in part is quoted here:

-} The situation here, I hate to say, is that you were simply very lucky
-} in having a software memory tester show you anything at all.
-} 
-} If your experience had been more typical, you would have run memtest86
-} and it would have declared your memory to be free of errors.  Then
-} you'd have gone right on having problems and losing more hair until
-} you finally just came back to the memory and swapped it out on
-} suspicion.  Bingo, the problems are suddenly fixed and you're dragging
-} memtest86 to KDE's trashcan with a resolve to never trust it again.

    Hmmm, maybe TkDesk's trashcan  :)   I use Fvwm2.

-} The reason why software memory testers are so generally ineffectual is
-} that there's a whole bunch of things getting in their way.  Leaving
-} aside for the moment the nasty problem of having your memory checker
-} loaded into the bad memory in question, the cache also seriously gets
-} in your way (and I'll bet you never even thought to turn both L1 and
-} L2 caches off, did you? :-)
[munch]

   As I understand it, I was getting a report of page faults
   in kernel on that machine.  I KNEW something was wrong, and was
   quite sure it wasn't the build of FreeBSD :)  With the machine nearly
   a thousand miles from me, it was difficult to tell for sure that
   I got a good sense of _everything_ that was going on.

   This is the second time I've had bad RAM since starting to use
   FreeBSD (and once, previously, while using NeXTstep) and had
   difficulty in pinning it down.  Actually, I tried another (at the
   time it was DOS based) memory checker the previous occasion which
   revealed NOTHING, and that's why I was a little quicker to jump to
   an assumption that it probably WAS bad RAM this time, rather than 
   bad diskettes etc.

   In simple point of fact, you're right.  I did NOT manually disable
   cache in the BIOS, because it never got that far.  The memtest86 
   runs a series of tests with and without cache...  That machine has
   one bank of 64M and all internal cache, so chip switching wasn't
   an easy solution, ESPECIALLY at this distance.

[quote memtest86 docs]
Test  0 [Address test, walking ones, no cache] 
Test  1 [Moving Inv, ones&zeros, cached] 
Test  2 [Modulo 20, ones&zeros, cached] 
Test  3 [Address test, own address, no cache] 
Test  4 [Moving inv, 8 bit pat, cached] 
Test  5 [Moving inv, 32 bit pat, cached] 
Test  6 [Moving inv, ones&zeros, no cache] 
Test  7 [Modulo 20+, ones&zeros, cached] 
Test  8 [Moving inv, 8 bit pat, no cache] 
Test  9 [Modulo 20, 8 bit, cached] 
Test 10 [Moving inv, 32 bit pat, no cache] 

(AND) ...test memory using longer refresh rates. This makes is possible
to detect marginal errors that otherwise would go undetected with the 
normal refresh rate.  Three refresh rates are available, the normal 
rate of 15ms, an extended refresh rate of 150ms and an extra long rate 
of 500ms. The default refresh rate is used for test 0 and tests 1 - 7 
use an extended rate of 150ms. The extended tests (8 - 10) use the extra 
long refresh rate of 500ms. The refresh rate may be overridden at any 
time via online configuration commands. 
[end quote]

    While I FULLY understand your hesitancy, and yes, even if
    RAM passes all of these tests that is not a sure indication
    that the RAM is good,  but it's a sure problem when it FAILS.
    reproducably.  With my daughter's machine, the errors started
    showing on test 4, and from what I can tell, about 1/3 of the
    way up the 64M.  A failure at a different point, of course
    might have given different symptoms, and a failure of a different
    type might indeed NOT have been detected by memtest86.

    Also, not all of us are as happy with fingers inside the case as
    on the keys or pointer, however long we've been at it.  :)

    The reason I was suggesting the second program be run at
    boot (before multiuser startup) was not to give a false
    sense of security, but rather one more possible safety
    margin.

    I can completely understand your hesitance, and the reasons
    for it.  I do wonder though, if there isn't some way to 
    make it clear to people that this is a precaution, that
    may be "better than nothing", even if it is not a 100%
    effective safeguard against hardware malfunctions.  It's
    EXTREMELY difficult to "bootstrap" a hardware test, by using
    the same hardware that is questionable.  I don't really think
    it could be 100% effective.

    OTOH, I really DO understand your position on this.  I just argued
    the opposite direction about the Stacheldraht/TFN2K checker for
    Solaris and Linux which has been put up on the FBI's NIPC site, and
    with much of the same reasoning you display about this RAM test
    software.

    That "signature checker" which is documented as tested on 3
    versions of Solaris, and 2 RedHat distributions, but NOT able to
    handle a.out nor COFF binaries, is (perhaps) worse than nothing at
    all, because of the false sense of security it might convey.  Of
    course, one additional thing against it is that it's a binary-only
    distribution that they say _must_be_run_as_root?! (apparently 
    it does direct memory access to loaded code in addition to
    scanning stored binaries for a recognizable compiled signature?)  

    Am I the only one seeing something windowsey about that?  Who gets
    to make the recurring profits this time for band-aid solutions to
    a spurting artery.
    
	http://www.fbi.gov/nipc/trinoo.htm

    So I'll leave it up to you.  There should be info at least in
    a FAQ somewhere that indicates that bad RAM is not something
    that can be ruled out until tested adequately, and perhaps a 
    checklist of symptoms that (virtually) ALWAYS indicate bad RAM,
    or at least should make it suspect.

	Bruce Gingery	<bgingery@gtcs.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message