From owner-freebsd-doc Fri Feb 18 12:20:32 2000 Delivered-To: freebsd-doc@freebsd.org Received: from home.gtcs.com (home.gtcs.com [209.181.16.2]) by hub.freebsd.org (Postfix) with ESMTP id C478B37BA6B; Fri, 18 Feb 2000 12:20:24 -0800 (PST) (envelope-from bruce@gtcs.com) Received: from gtcs.com (localhost [127.0.0.1]) by home.gtcs.com (8.8.8/8.8.8) with ESMTP id NAA29729; Fri, 18 Feb 2000 13:19:37 -0700 (MST) (envelope-from bruce@gtcs.com) Message-Id: <200002182019.NAA29729@ home.gtcs.com> Date: Fri, 18 Feb 2000 13:19:32 -0700 (MST) From: Bruce Gingery Reply-To: bgingery@gtcs.com Subject: Re: Recommended addition to FAQ (Troubleshooting) To: "Jordan K. Hubbard" Cc: freebsd-doc@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG In-Reply-To: <79456.950901038@zippy.cdrom.com> MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=US-ASCII Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On 18 Feb 2000, Jordan K. Hubbard issued forth a missive of approximately 3647 bytes, entitled "Re: Recommended addition to FAQ (Troubleshooting) ", the text of which in full or in part is quoted here: -} The situation here, I hate to say, is that you were simply very lucky -} in having a software memory tester show you anything at all. -} -} If your experience had been more typical, you would have run memtest86 -} and it would have declared your memory to be free of errors. Then -} you'd have gone right on having problems and losing more hair until -} you finally just came back to the memory and swapped it out on -} suspicion. Bingo, the problems are suddenly fixed and you're dragging -} memtest86 to KDE's trashcan with a resolve to never trust it again. Hmmm, maybe TkDesk's trashcan :) I use Fvwm2. -} The reason why software memory testers are so generally ineffectual is -} that there's a whole bunch of things getting in their way. Leaving -} aside for the moment the nasty problem of having your memory checker -} loaded into the bad memory in question, the cache also seriously gets -} in your way (and I'll bet you never even thought to turn both L1 and -} L2 caches off, did you? :-) [munch] As I understand it, I was getting a report of page faults in kernel on that machine. I KNEW something was wrong, and was quite sure it wasn't the build of FreeBSD :) With the machine nearly a thousand miles from me, it was difficult to tell for sure that I got a good sense of _everything_ that was going on. This is the second time I've had bad RAM since starting to use FreeBSD (and once, previously, while using NeXTstep) and had difficulty in pinning it down. Actually, I tried another (at the time it was DOS based) memory checker the previous occasion which revealed NOTHING, and that's why I was a little quicker to jump to an assumption that it probably WAS bad RAM this time, rather than bad diskettes etc. In simple point of fact, you're right. I did NOT manually disable cache in the BIOS, because it never got that far. The memtest86 runs a series of tests with and without cache... That machine has one bank of 64M and all internal cache, so chip switching wasn't an easy solution, ESPECIALLY at this distance. [quote memtest86 docs] Test 0 [Address test, walking ones, no cache] Test 1 [Moving Inv, ones&zeros, cached] Test 2 [Modulo 20, ones&zeros, cached] Test 3 [Address test, own address, no cache] Test 4 [Moving inv, 8 bit pat, cached] Test 5 [Moving inv, 32 bit pat, cached] Test 6 [Moving inv, ones&zeros, no cache] Test 7 [Modulo 20+, ones&zeros, cached] Test 8 [Moving inv, 8 bit pat, no cache] Test 9 [Modulo 20, 8 bit, cached] Test 10 [Moving inv, 32 bit pat, no cache] (AND) ...test memory using longer refresh rates. This makes is possible to detect marginal errors that otherwise would go undetected with the normal refresh rate. Three refresh rates are available, the normal rate of 15ms, an extended refresh rate of 150ms and an extra long rate of 500ms. The default refresh rate is used for test 0 and tests 1 - 7 use an extended rate of 150ms. The extended tests (8 - 10) use the extra long refresh rate of 500ms. The refresh rate may be overridden at any time via online configuration commands. [end quote] While I FULLY understand your hesitancy, and yes, even if RAM passes all of these tests that is not a sure indication that the RAM is good, but it's a sure problem when it FAILS. reproducably. With my daughter's machine, the errors started showing on test 4, and from what I can tell, about 1/3 of the way up the 64M. A failure at a different point, of course might have given different symptoms, and a failure of a different type might indeed NOT have been detected by memtest86. Also, not all of us are as happy with fingers inside the case as on the keys or pointer, however long we've been at it. :) The reason I was suggesting the second program be run at boot (before multiuser startup) was not to give a false sense of security, but rather one more possible safety margin. I can completely understand your hesitance, and the reasons for it. I do wonder though, if there isn't some way to make it clear to people that this is a precaution, that may be "better than nothing", even if it is not a 100% effective safeguard against hardware malfunctions. It's EXTREMELY difficult to "bootstrap" a hardware test, by using the same hardware that is questionable. I don't really think it could be 100% effective. OTOH, I really DO understand your position on this. I just argued the opposite direction about the Stacheldraht/TFN2K checker for Solaris and Linux which has been put up on the FBI's NIPC site, and with much of the same reasoning you display about this RAM test software. That "signature checker" which is documented as tested on 3 versions of Solaris, and 2 RedHat distributions, but NOT able to handle a.out nor COFF binaries, is (perhaps) worse than nothing at all, because of the false sense of security it might convey. Of course, one additional thing against it is that it's a binary-only distribution that they say _must_be_run_as_root?! (apparently it does direct memory access to loaded code in addition to scanning stored binaries for a recognizable compiled signature?) Am I the only one seeing something windowsey about that? Who gets to make the recurring profits this time for band-aid solutions to a spurting artery. http://www.fbi.gov/nipc/trinoo.htm So I'll leave it up to you. There should be info at least in a FAQ somewhere that indicates that bad RAM is not something that can be ruled out until tested adequately, and perhaps a checklist of symptoms that (virtually) ALWAYS indicate bad RAM, or at least should make it suspect. Bruce Gingery To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message