From owner-freebsd-stable Thu Dec 27 10:48:10 2001 Delivered-To: freebsd-stable@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 2B10B37B417; Thu, 27 Dec 2001 10:48:01 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id fBRIlxh52129; Thu, 27 Dec 2001 10:47:59 -0800 (PST) (envelope-from dillon) Date: Thu, 27 Dec 2001 10:47:59 -0800 (PST) From: Matthew Dillon Message-Id: <200112271847.fBRIlxh52129@apollo.backplane.com> To: Nils Holland Cc: =?iso-8859-1?Q?S=F8ren_Schmidt?= , Matthew Gilbert , freebsd-stable@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: 4.4-STABLE crashes - suspects new ata-driver over wd-drivers References: <200112262355.fBQNtfK48250@apollo.backplane.com> <200112270945.fBR9j1e97273@freebsd.dk> <20011227163252.A151@tisys.org> Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG This is great news! I'm crossing my fingers and hoping that Nils can't reproduce the crash any more with Soren's fix. Just to let you all know, Nils has been working his ass off helping me track his crash down. I've been pulling my hair out... I gave him patch after patch to test various conditions & panic if the nfs_node's hash list somehow got broken, and for the last week not a single one of those tests detected the problem prior to the panic. The nfs_node's hash list was being corrupted seemingly out of nowhere. The last two days I've had Nils use hardware watchpoints in DDB> to try to track down what was modifying the memory location, with no success. The watchpoint was catching the (correct) write to the list head but then failed to catch the corrupted write prior to the system panicing, which is what makes me believe it is some sort of chipset issue. Another thing to note: One of the really weird things about Nils crashes is that the same memory location was getting corrupted every time, five times in a row (which made it possible to use a hardware watch point). The corruption changed somewhat when he added the hardware watch point. Another similar set of crashes in the vm_page_list (that other people report, including a number of machines at Yahoo), have a similar M.O.... IDE drive, medium/heavy activity, but while corrupted address always winds up in the (static) vm_page array, it always tends to be slightly different. I'm hoping that it winds up being the same or similar issue. I'm not ruling out the possibility that chipsets other then the 686B have problems too. In anycase, Nils description makes a lot of sense. I've asked him to continue testing his system to make sure that this particular crash cannot be reproduced, and I am crossing my fingers. I'm also wondering how applicable this patch might be in regards to forcing a 'safe' mode for other PCI chipsets, to allow us to test it on non-686B machines that have similar problems. -Matt Matthew Dillon :On Thu, Dec 27, 2001 at 10:45:01AM +0100, Søren Schmidt stood up and spoke: :> :> OK, here goes the VIA 686b patch, it is hand cut out from the bulk patches :> to go into 4.5 so beware :) : :Well, as Matt has said, I reported a crash that he's trying to debug. Since :I have the 686b in my machine, I applied the patch. Ever since then I was :not able to reproduce the crash again, although yesterday it was so easy :that I could do it twice an hour ;-) : :Anyway, you (Soren) said that the right way to fix this is a BIOS update. :Now, could it be that some mainboard manufacturers are incapabel of :handling this? I'm using the latest BIOS for my board, and according to :http://www.chaintech.com.tw/DL/7xMB/7AJA0.HTM, this should already have :been fixed in their BIOS release from 2001-04-23... : :Second interesting thing: I was using a UDMA66 drive on my 686b until a few :weeks ago and never had any problems - the stuff Matt is looking at only :started two appear a short while after I exchanged that drive for a UDMA100 :one. So, it seems as if probably the slower drive didn't produce a high :enough PCI workload for anything to actually happen. : :This fix will probably also have some influence on a few other similar :problems (I read Matt was working on many of them). In the end I hope that :this fix - or a variation thereof - will actually go into 4.5. : :Greetings :Nils : :-- :Nils Holland To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message