From owner-freebsd-current@FreeBSD.ORG Thu Dec 25 01:07:49 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BE81416A4CE; Thu, 25 Dec 2003 01:07:49 -0800 (PST) Received: from spider.deepcore.dk (cpe.atm2-0-53484.0x50a6c9a6.abnxx9.customer.tele.dk [80.166.201.166]) by mx1.FreeBSD.org (Postfix) with ESMTP id DBFAD43D46; Thu, 25 Dec 2003 01:07:46 -0800 (PST) (envelope-from sos@spider.deepcore.dk) Received: from spider.deepcore.dk (localhost [127.0.0.1]) by spider.deepcore.dk (8.12.10/8.12.10) with ESMTP id hBP97Atx031337; Thu, 25 Dec 2003 10:07:11 +0100 (CET) (envelope-from sos@spider.deepcore.dk) Received: (from sos@localhost) by spider.deepcore.dk (8.12.10/8.12.10/Submit) id hBP97AUG031336; Thu, 25 Dec 2003 10:07:10 +0100 (CET) (envelope-from sos) From: Soren Schmidt Message-Id: <200312250907.hBP97AUG031336@spider.deepcore.dk> In-Reply-To: <20031225145427.J16564@gamplex.bde.org> To: Bruce Evans Date: Thu, 25 Dec 2003 10:07:10 +0100 (CET) X-Mailer: ELM [version 2.4ME+ PL99f (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=ISO-8859-1 X-mail-scanned: by DeepCore Virus & Spam killer v1.3 X-Mailman-Approved-At: Thu, 25 Dec 2003 04:50:36 -0800 cc: current@FreeBSD.ORG cc: sos@FreeBSD.ORG Subject: Re: deadlock in ata_queue_request() X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Dec 2003 09:07:50 -0000 It seems Bruce Evans wrote: > ata_queue_request() sleeps in an interrupt handler here: Yes, I have a local fix to help this, the sleep was originally left in to make a backport to -stable easier (ie no mutexes), but this need to be changed here. I'll get it committed asap, but it is hollidays and the kids has alot of new toys :) > % void > % ata_queue_request(struct ata_request *request) > % { > % /* mark request as virgin (it might be a reused one) */ > % request->result = request->status = request->error = 0; > % request->flags &= ~ATA_R_DONE; > % > % /* put request on the locked queue at the specified location */ > % mtx_lock(&request->device->channel->queue_mtx); > % if (request->flags & ATA_R_AT_HEAD) > % TAILQ_INSERT_HEAD(&request->device->channel->ata_queue, request, chain); > % else > % TAILQ_INSERT_TAIL(&request->device->channel->ata_queue, request, chain); > % mtx_unlock(&request->device->channel->queue_mtx); > % > % /* should we skip start ? */ > % if (!(request->flags & ATA_R_SKIPSTART)) > % ata_start(request->device->channel); > % > % /* if this was a requeue op callback/sleep already setup */ > % if (request->flags & ATA_R_REQUEUE) > % return; > % > % /* if this is not a callback and we havn't seen DONE yet -> sleep */ > % if (!request->callback) { > % while (!(request->flags & ATA_R_DONE)) > % tsleep(request, PRIBIO, "atareq", hz/10); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > % } > % } > > when it is called from an interrupt handler. It is called from an interrupt > handler as part of timeout processing: > > ... > msleep(...) > ata_queue_request(...) > ata_via_family_setmode(...) > ata_identify_devices(...) > ata_reinit(...) > ata_timeout(...) > softclock(...) > ithread_loop(...) > ... > > The timeout was called here shortly after ad2 hung: > > Dec 25 12:28:27 besplex kernel: ad2: TIMEOUT - READ_DMA retrying (2 retries left) > Dec 25 12:28:27 besplex kernel: ata1: resetting devices .. > Dec 25 12:28:27 besplex kernel: ad2: FAILURE - already active DMA on this device > Dec 25 12:28:27 besplex kernel: ad2: setting up DMA failed > > ATA_R_DONE was never set and wakeup_request() was never called either, so > softclock() was deadlocked and tsleep() never returned. > > The system ran surprisingly well with softclock() deadlocked. ad0 worked > and everything that didn't use timeouts worked. Examples of things that > didn't work because they use timeouts: > - syscons screen updates. > - statistics in top and systat. > - sleep 1 in shells. > - mbmon (shows the status, then never repeats). > > I tried the following to recover: > - call wakeup(request) using ddb. This worked, but ATA_R_DONE was never > set so ata_queue_request() just looped. > - also ignore the ATA_R_DONE check using ddb. This un-deadlocked > softclock(), but ata1 remained wedged. > - then call "atacontrol reinit 1". This partly worked: > > Dec 25 14:31:12 besplex kernel: ata1: resetting devices .. > Dec 25 14:31:44 besplex kernel: ad2: WARNING - removed from configuration > Dec 25 14:31:44 besplex kernel: done > > but the ata driver caused a null pointer panic an instant later. > > - ad2 didn't come back after a hard reset. > - ad2 came back after a power cycle. > > This was on an undermydesktop. Problems resuming on laptops may be similar. > The hardware may really be wedged. Then the software shouldn't make things > worse by sleeping or spinning in the timeout handler. > > Bruce > -Søren Yes I know it works under windows!!