From owner-freebsd-questions@FreeBSD.ORG Mon Apr 27 00:35:35 2015 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 42899D92 for ; Mon, 27 Apr 2015 00:35:35 +0000 (UTC) Received: from mail-qg0-x229.google.com (mail-qg0-x229.google.com [IPv6:2607:f8b0:400d:c04::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 00B611513 for ; Mon, 27 Apr 2015 00:35:34 +0000 (UTC) Received: by qgfi89 with SMTP id i89so43780756qgf.1 for ; Sun, 26 Apr 2015 17:35:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=A/aqxTf1t1NQ6G06uZW5tbDwYegWGh3qoBcXY7f7Nms=; b=rKTNTHn1ZkCqh8L2n8JclwfXKPNpY3jq48P3MMhH4WrvKlVRjP36zfwCoS0egDrr7A nyMEvCSldRrtPycu5q0CmD9C+HOHD1UxCb5pWeMzX4PMbzhyg+oyBK4LqPTZC3hK9cdC vPoI8h4azd4zYxKiehC96VK34OwYVic8NkPRjpyn/XJrM+f02egDiieL2G17sc4e9QaG zg78U5sYQCwrIiFg1dbIIFkwm5Lm4NspRdFdhdord4eFritkBeWOqo9HQKtJI/ySKT9v N2zK78Dh9bSiS786yXRA0DMosImEAItGzQU8lZt0MDvZ58M0tTwTDfSh4PnIzx3JwDqI 7ymA== X-Received: by 10.140.81.39 with SMTP id e36mr5993751qgd.10.1430094934145; Sun, 26 Apr 2015 17:35:34 -0700 (PDT) Received: from localhost.localdomain ([209.181.150.218]) by mx.google.com with ESMTPSA id 15sm10885087qkz.30.2015.04.26.17.35.32 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 26 Apr 2015 17:35:32 -0700 (PDT) Message-ID: <553D8452.9050601@gmail.com> Date: Sun, 26 Apr 2015 18:35:30 -0600 From: jd1008 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: freebsd-questions@freebsd.org Subject: Re: Debugging bad memory problems References: <5480.69.209.235.143.1430078703.squirrel@cosmo.uchicago.edu> <5793.69.209.235.143.1430086547.squirrel@cosmo.uchicago.edu> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Apr 2015 00:35:35 -0000 On 04/26/2015 06:02 PM, Mehmet Erol Sanliturk wrote: > On Sun, Apr 26, 2015 at 3:15 PM, Valeri Galtsev > wrote: > >> On Sun, April 26, 2015 4:05 pm, Fernando Apesteguía wrote: >>> On Sun, Apr 26, 2015 at 10:05 PM, Valeri Galtsev >>> wrote: >>>> On Sun, April 26, 2015 12:11 pm, Fernando Apesteguía wrote: >>>>> Hi, >>>>> >>>>> I suspect my old and beloved AMD64 laptop is suffering from bad memory >>>>> problems: I get random crashes of well tested programs like sh, which, >>>>> etc even when I executed some of them from /rescue. >>>> If RAM is a suspect the first thing I would do is re-seat memory >>>> modules. >>>> Open the box. (Observe static precautions!) Remove memory modules. >>>> Install >>>> them again. >>>> >>>> Do memtest86 (by booting into memtest86, you can have that in your boot >>>> options, or you can boot off external media as others suggested). >>>> >>>> If you still have problems: try to run with one memory module instead of >>>> two. At some point when they went to higher RAM speeds memory bus >>>> amplifier became more fragile (some chips, some manufacturers, as not it >>>> is part of CPU, this may be true only about some of the CPU models). You >>>> sometimes can slightly fry it if you merely leave laptop running on >>>> battery, letting battery run down and laptop powering off due to that. >>>> With some of chips this may lead to slightly frying it - memory >>>> controller >>>> portion of it, address bus amplifier in particular. Bus amplifier >>>> becomes >>>> slightly lower frequency, which results in poorer handling capacitive >>>> load >>>> (which is larger if you have more RAM), and it is marginally OK, >>>> occasionally having address errors. Going to one module may resolve >>>> this. >>>> You will know if this is likely the case if memtest86 is successful with >>>> each of single RAM modules, but fails (in random places, often not >>>> reproducible) with both. >>>> >>>> Good luck! >>> I booted from a memtest CD-ROM. It passed a couple of tests fine and >>> then it rebooted while doing a "bit fade" test at around 93%. Removing >>> the modules is tricky since this laptop has screws all around in dark >>> corners (even removing the battery needs a screw driver). I will try >>> to limit physical memory with hw.physmem and see if it makes any >>> difference. >> The last will not help against what I mentioned, as capacitive load on >> memory address bus is defined by what is physically attached to it. >> >> One usually runs memtest86 for 24 hours at lest. One loop will catch >> "solid defects" like adjacent line on the board connected (while they >> shouldn't). Memory related failures to the contrary are often >> intermittent. In worst case I've seen, they only manifested under intense >> load of the box (whereas memtest86 is equivalent to almost zero load). >> >> Good luck! >> >> Valeri >> >> ++++++++++++++++++++++++++++++++++++++++ >> Valeri Galtsev >> Sr System Administrator >> Department of Astronomy and Astrophysics >> Kavli Institute for Cosmological Physics >> University of Chicago >> Phone: 773-702-4247 >> ++++++++++++++++++++++++++++++++++++++++ >> > > > Failure may be in memory management circuits instead of memory chips . > To test this situation , the existing memories may be replaced by memory > chips that they known to work ( if it can be done ) . > > > Thank you very much . > > > Mehmet Ero Sanliturk One slight, and perhaps remote, possibility is that memory is a hair slower than what the memory controller expects, especially, as Valerie mentioned, under heavy memory load. On systems where the cpu clocking is unlocked, one might be able to slow down the cpu clock just slightly to see if the problem is mitigated.