From owner-freebsd-current@FreeBSD.ORG Mon Aug 2 00:05:32 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4DB5516A4CE; Mon, 2 Aug 2004 00:05:32 +0000 (GMT) Received: from www.cryptography.com (li-22.members.linode.com [64.5.53.22]) by mx1.FreeBSD.org (Postfix) with ESMTP id 09CBE43D49; Mon, 2 Aug 2004 00:05:32 +0000 (GMT) (envelope-from nate@cryptography.com) Received: from [10.0.5.50] (adsl-64-171-186-94.dsl.snfc21.pacbell.net [64.171.186.94]) by www.cryptography.com (8.12.8/8.12.8) with ESMTP id i7205Mra025185; Sun, 1 Aug 2004 17:05:22 -0700 Message-ID: <410D853F.6080704@cryptography.com> Date: Sun, 01 Aug 2004 17:05:19 -0700 From: Nate Lawson User-Agent: Mozilla Thunderbird 0.7.2 (Windows/20040707) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Brian Fundakowski Feldman References: <410AD054.8070202@root.org> <20040731064433.GD33220@green.homeunix.org> In-Reply-To: <20040731064433.GD33220@green.homeunix.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Mon, 02 Aug 2004 12:04:09 +0000 cc: current@freebsd.org cc: sos@deepcore.dk Subject: Re: memory corruption/panic solved ("FAILURE - ATAPI_IDENTIFY no interrupt") X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Aug 2004 00:05:32 -0000 Brian Fundakowski Feldman wrote: > On Fri, Jul 30, 2004 at 03:48:52PM -0700, Nate Lawson wrote: >>I've tracked down the source of the memory corruption in -current that >>results when booting with various CD and DVD drives (especially the ones >>that come with Thinkpads including T23, R32, T41, etc.) The panic is >>obvious when running with INVARIANTS ("memory modified after free") but >>not so obvious in other configurations. For instance, without >>INVARIANTS, part of the rt_info structure is corrupted on my wireless >>card, resulting in a panic during ifconfig on boot. This is likely the >>source of other problems, including phk's ACPI panic (again, only >>triggered when booting with the CD drive in the bay.) >> >>The root problem is that ata_timeout() fires and calls ata_pio_read() >>which overwrites 512 bytes random memory. There are actually two bugs >>here that overwrite memory. The code path is as follows: > > Good job identifying it more exactly. I decided it should just fundamentally > be using GEOM primitives everywhere to move the solutions to all these > side cases into where they're already handled generically... still think > that's probably the right solution, but I'm glad to see this specific > problem fixed. I'm not sure if this is a troll or not but I'll answer it seriously. GEOM and other upper layers are never the right place to handle error recovery for transactions initiated at the lower layers (like this device scan). In every system I've seen, error recovery is the hardest part of storage code to get right and is seldom well-tested. It's a very difficult problem that involves a lot of careful fault injection/testing. Divergence in hardware fault handling behavior only complicates things. -Nate