From owner-freebsd-stable@FreeBSD.ORG Wed Mar 30 05:08:32 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9BA8416A4CE for ; Wed, 30 Mar 2005 05:08:32 +0000 (GMT) Received: from FS.denninger.net (wsip-68-15-213-52.at.at.cox.net [68.15.213.52]) by mx1.FreeBSD.org (Postfix) with ESMTP id B632543D58 for ; Wed, 30 Mar 2005 05:08:31 +0000 (GMT) (envelope-from karl@FS.denninger.net) Received: from fs.denninger.net (localhost [127.0.0.1]) by FS.denninger.net (8.13.3/8.13.1) with SMTP id j2U58Us2003595 for ; Tue, 29 Mar 2005 23:08:30 -0600 (CST) (envelope-from karl@FS.denninger.net) Received: from fs.denninger.net [127.0.0.1] by Spamblock-sys; Tue Mar 29 23:08:30 2005 Received: (from karl@localhost) by FS.denninger.net (8.13.3/8.13.1/Submit) id j2U58U50003593; Tue, 29 Mar 2005 23:08:30 -0600 (CST) (envelope-from karl) Message-ID: <20050329230830.A3222@denninger.net> Date: Tue, 29 Mar 2005 23:08:30 -0600 From: Karl Denninger To: "Matthew N. Dodd" References: <20050329200841.A772@denninger.net> <20050329233843.L328@sasami.jurai.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20050329233843.L328@sasami.jurai.net>; from Matthew N. Dodd on Tue, Mar 29, 2005 at 11:40:48PM -0500 Organization: Karl's Sushi and Packet Smashers X-Die-Spammers: Spammers cheerfully broiled for supper and served with ketchup! cc: freebsd-stable@freebsd.org Subject: Re: DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Mar 2005 05:08:32 -0000 On Tue, Mar 29, 2005 at 11:40:48PM -0500, Matthew N. Dodd wrote: > On Tue, 29 Mar 2005, Karl Denninger wrote: > > 1.42: When resubmitting a timed out request, reset donecount. > > 1.41: Reset timeout when we are back from interrupt. > > 1.40: Correct logical error, result was that retries wasn't always made but > > failure reported instead. > > 1.39: Do not retry on requests that have lost their device during reinit. > > > > This change is EXTREMELY DANGEROUS. > > > > This change needs to be backed out immediately until it can be determined > > why a requeued request destabilizes the system. > > The changes in question are very small. Could you attempt to isolate > which one is the cause? > > Thanks. Pretty sure its the requeue (e.g. 1.40 and 1.42); I attempted to put this patch in the system back before it was MFC'd (when it orginally showed up in -HEAD) and it failed in exactly the same way. The first time it created a LOT of head-scratching ("how come my serial board has suddenly gone deaf?!") and it wasn't until it got to where the console wouldn't respond that the light went on and I said "oh, so THAT's what that patch really does!" :-> That got backed out FAST :-) I believe the previous version of that file in -STABLE was 1.38 - that has the 'errors don't actually get retried' problem that results in immediate detaches - the reason for the update was that I noted the commit and figured that the problem from my last attempt with including this had either been fixed or I had missed some dependancy in my earlier attempt. I have an open PR on the underlying problem (SATA drives on a number of common configurations returning false errors and detaching when part of a geom mirror) which I've marked as "serious". Its at http://www.freebsd.org/cgi/query-pr.cgi?pr=77643 There is a comment attached to the PR from another user who has duplicated the underlying problem. Note that back on 3/2/05 I attempted to apply the 1.42 version of this file to -STABLE and got the same failure, and added that fact to the PR. I also reported it here. It appears that both reports were either missed or ignored and this change was committed to -RELENG_5. I'm not sure if I can cobble up a test machine with the right configuration of hardware to go through each of the above changes in turn to see if I can isolate which of the three it is, but I'll give it a shot over the next couple of days. I'm 1 SATA disk short of what I need to do this in my sandbox. If I do not trigger the requeue all appears to be fine. This is one that IMHO has to either be found and fixed or backed out for the impending -RELEASE. -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind