From owner-freebsd-stable  Thu Dec 27 10:48:10 2001
Delivered-To: freebsd-stable@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2B10B37B417; Thu, 27 Dec 2001 10:48:01 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id fBRIlxh52129;
	Thu, 27 Dec 2001 10:47:59 -0800 (PST)
	(envelope-from dillon)
Date: Thu, 27 Dec 2001 10:47:59 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200112271847.fBRIlxh52129@apollo.backplane.com>
To: Nils Holland <nils@tisys.org>
Cc: =?iso-8859-1?Q?S=F8ren_Schmidt?= <sos@freebsd.dk>,
	Matthew Gilbert <agilbertm@earthlink.net>,
	freebsd-stable@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG
Subject: Re: 4.4-STABLE crashes - suspects new ata-driver over wd-drivers
References: <200112262355.fBQNtfK48250@apollo.backplane.com> <200112270945.fBR9j1e97273@freebsd.dk> <20011227163252.A151@tisys.org>
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-stable.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-stable>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-stable>
X-Loop: FreeBSD.ORG

    This is great news!  I'm crossing my fingers and hoping that Nils can't
    reproduce the crash any more with Soren's fix.

    Just to let you all know, Nils has been working his ass off helping me
    track his crash down.  I've been pulling my hair out... I gave him patch
    after patch to test various conditions & panic if the nfs_node's hash list
    somehow got broken, and for the last week not a single one of those tests
    detected the problem prior to the panic.  The nfs_node's hash list
    was being corrupted seemingly out of nowhere.

    The last two days I've had Nils use hardware watchpoints in DDB> to 
    try to track down what was modifying the memory location, with no 
    success.  The watchpoint was catching the (correct) write to the list
    head but then failed to catch the corrupted write prior to the system
    panicing, which is what makes me believe it is some sort of chipset
    issue.

    Another thing to note:  One of the really weird things about Nils crashes
    is that the same memory location was getting corrupted every time, five
    times in a row (which made it possible to use a hardware watch point).
    The corruption changed somewhat when he added the hardware watch point.
    Another similar set of crashes in the vm_page_list (that other people
    report, including a number of machines at Yahoo), have a similar M.O....
    IDE drive, medium/heavy activity, but while corrupted address always
    winds up in the (static) vm_page array, it always tends to be slightly
    different.  I'm hoping that it winds up being the same or similar
    issue.  I'm not ruling out the possibility that chipsets other then
    the 686B have problems too.

    In anycase, Nils description makes a lot of sense.  I've asked him to
    continue testing his system to make sure that this particular crash cannot
    be reproduced, and I am crossing my fingers.

    I'm also wondering how applicable this patch might be in regards to 
    forcing a 'safe' mode for other PCI chipsets, to allow us to test
    it on non-686B machines that have similar problems.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


:On Thu, Dec 27, 2001 at 10:45:01AM +0100, Søren Schmidt stood up and spoke:
:> 
:> OK, here goes the VIA 686b patch, it is hand cut out from the bulk patches
:> to go into 4.5 so beware :)
:
:Well, as Matt has said, I reported a crash that he's trying to debug. Since
:I have the 686b in my machine, I applied the patch. Ever since then I was
:not able to reproduce the crash again, although yesterday it was so easy
:that I could do it twice an hour ;-)
:
:Anyway, you (Soren) said that the right way to fix this is a BIOS update.
:Now, could it be that some mainboard manufacturers are incapabel of
:handling this? I'm using the latest BIOS for my board, and according to
:http://www.chaintech.com.tw/DL/7xMB/7AJA0.HTM, this should already have
:been fixed in their BIOS release from 2001-04-23...
:
:Second interesting thing: I was using a UDMA66 drive on my 686b until a few
:weeks ago and never had any problems - the stuff Matt is looking at only
:started two appear a short while after I exchanged that drive for a UDMA100
:one. So, it seems as if probably the slower drive didn't produce a high
:enough PCI workload for anything to actually happen.
:
:This fix will probably also have some influence on a few other similar
:problems (I read Matt was working on many of them). In the end I hope that
:this fix - or a variation thereof - will actually go into 4.5.
:
:Greetings
:Nils
:
:-- 
:Nils Holland

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message