Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 1 Aug 2004 20:15:45 -0400
From:      Brian Fundakowski Feldman <green@freebsd.org>
To:        Nate Lawson <nate@cryptography.com>
Cc:        sos@deepcore.dk
Subject:   Re: memory corruption/panic solved ("FAILURE - ATAPI_IDENTIFY no interrupt")
Message-ID:  <20040802001545.GA91621@green.homeunix.org>
In-Reply-To: <410D853F.6080704@cryptography.com>
References:  <410AD054.8070202@root.org> <20040731064433.GD33220@green.homeunix.org> <410D853F.6080704@cryptography.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Aug 01, 2004 at 05:05:19PM -0700, Nate Lawson wrote:
> Brian Fundakowski Feldman wrote:
> >On Fri, Jul 30, 2004 at 03:48:52PM -0700, Nate Lawson wrote:
> >>I've tracked down the source of the memory corruption in -current that
> >>results when booting with various CD and DVD drives (especially the ones
> >>that come with Thinkpads including T23, R32, T41, etc.)  The panic is
> >>obvious when running with INVARIANTS ("memory modified after free") but
> >>not so obvious in other configurations.  For instance, without
> >>INVARIANTS, part of the rt_info structure is corrupted on my wireless
> >>card, resulting in a panic during ifconfig on boot.  This is likely the
> >>source of other problems, including phk's ACPI panic (again, only
> >>triggered when booting with the CD drive in the bay.)
> >>
> >>The root problem is that ata_timeout() fires and calls ata_pio_read()
> >>which overwrites 512 bytes random memory.  There are actually two bugs 
> >>here that overwrite memory.  The code path is as follows:
> >
> >Good job identifying it more exactly.  I decided it should just 
> >fundamentally
> >be using GEOM primitives everywhere to move the solutions to all these
> >side cases into where they're already handled generically... still think
> >that's probably the right solution, but I'm glad to see this specific
> >problem fixed.
> 
> I'm not sure if this is a troll or not but I'll answer it seriously.
> GEOM and other upper layers are never the right place to handle error
> recovery for transactions initiated at the lower layers (like this 
> device scan).
> 
> In every system I've seen, error recovery is the hardest part of storage
> code to get right and is seldom well-tested.  It's a very difficult
> problem that involves a lot of careful fault injection/testing.
> Divergence in hardware fault handling behavior only complicates things.

What would make it a troll?  If GEOM were used so that all transactions
were centrallized, and there were one timeout mechanism used to run the
request queues for ATA, it wouldn't be racing and crashing when a device
reset occurs (and it would be a net reduction in code).

-- 
Brian Fundakowski Feldman                           \'[ FreeBSD ]''''''''''\
  <> green@FreeBSD.org                               \  The Power to Serve! \
 Opinions expressed are my own.                       \,,,,,,,,,,,,,,,,,,,,,,\



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040802001545.GA91621>