Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Mar 2005 11:02:25 -0500
From:      Paul Mather <paul@gromit.dlib.vt.edu>
To:        Karl Denninger <karl@denninger.net>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE
Message-ID:  <1112284945.1048.3.camel@zappa.Chelsea-Ct.Org>
In-Reply-To: <20050330233018.B68235@denninger.net>
References:  <20050329200841.A772@denninger.net> <20050329230830.A3222@denninger.net> <20050329234318.A3883@denninger.net> <44027.128.222.32.10.1112202442.squirrel@mail.scadian.net> <424AF396.6010909@mykitchentable.net> <20050330233018.B68235@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 2005-03-30 at 23:30 -0600, Karl Denninger wrote:

> BTW its NOT your hardware at fault here - the same hardware that returns 
> these complaints for me on 5.x works perfectly with 4.11.  There have been 
> changes made to the ATA code that apparently interact VERY badly with 
> some controllers - particularly some very common SATA (SII chipset, used 
> on Adaptec and Bustek boards, among others) ones.  

It's not just a SATA problem.  I get the problem (though more
infrequently than it seems you do) on an Intel PIIX4 UDMA33 controller.
The problem occurs on two different systems (one Gateway, one Dell), and
only started happening some way through the 5.x life cycle, indicating
to me that a serious regression was introduced (in 5.2, I believe).  The
problem does not afflict 4.x.

> I don't know if GEOM/GMIRROR is truly involved here although that's the
> easiest way for me to provoke it - I suspect not - its just that
> GEOM/GMIRROR produces an I/O load pattern that is conducive to the 
> breakage showing up.  Specifically, a "DD" from one or more disks does NOT
> fail - a mix of reads and writes and fairly significant load appears 
> necessary to cause trouble.  Of course installation produces a very nice
> load of that type....

On both systems that experience the problem, I am using some kind of
software mirroring.  On one I'm using geom_mirror, and on the other I'm
using geom_vinum.  Both suffer from the WRITE_DMA disconnect problem.
The Dell, using geom_mirror, is now running HEAD.  The Gateway running
RELENG_5 is annoying because when a drive becomes disconnected, the only
way right now to rebuild the plexes on the geom_vinum drive that is down
is to reboot the system.  (I've used "setstate" to flag the drive as up,
but then "gvinum start" of any down plex causes an immediate
panic/reboot.)

Ian Dowse posted a patch to the freebsd-current mailing list for the
WRITE_DMA issue
(http://lists.freebsd.org/mailman/htdig/freebsd-current/2005-February/046773.html).  According to Dowse, the patch "attempts to clean up the handling of timeouts in the ATA code by using the new callout_init_mtx() function."  It was successful for me.  I still got the WRITE_DMA timeouts, but not the disconnects.  I don't know if RELENG_5 has "the new callout_init_mtx() function."  If it does, this patch might help there, too.

> I opened a PR on this quite some time ago - IMHO this sort of breakage
> should be considered a critical fault sufficient to stop a release until 
> its completely resolved.  A workaround that stops the system from blowing up
> but leaves the pauses and errors isn't really a fix - I doubt anyone
> will consider that acceptable as a means of truly addressing the problem 
> (at least I hope not!)

I agree that it wouldn't be ideal, but having something that fixed just
the disconnects in the tree would be better than nothing at all.  It's a
pain to have to track third-party patches.

> I got "surprised" by this (in a bad way) and have been fighting 
> workarounds since 5.3 was deemed "production" quality.  Going back to 
> 4.x is possible for me, but highly undesireable for a number of reasons, not
> the least of which is the official FreeBSD posture on where work is and will
> be done on the OS down the road.

It's disappointing the way this problem appears to have been silently
ignored (except by those whom it afflicts), because it is a regression
that occurred during the 5.x lifecycle.  It's one thing to know that
your hardware won't work properly going from 4.x to 5.x, but another
thing to have it stop working going from one 5.x release to another.
(Or maybe it isn't, given the strange "Early Adopter" status of the
start of the 5.x release cycle.)

Anyway, I'm glad you are trying to keep this problem in the spotlight,
because an unreliable ATA subsystem is a miserable thing to have to
suffer. :-(

Cheers,

Paul.
-- 
e-mail: paul@gromit.dlib.vt.edu

"Without music to decorate it, time is just a bunch of boring production
 deadlines or dates by which bills must be paid."
        --- Frank Vincent Zappa



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1112284945.1048.3.camel>