From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 22:16:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 332A4374 for ; Mon, 21 Jan 2013 22:16:19 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta12.emeryville.ca.mail.comcast.net (qmta12.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:227]) by mx1.freebsd.org (Postfix) with ESMTP id EA29BBB for ; Mon, 21 Jan 2013 22:16:18 +0000 (UTC) Received: from omta20.emeryville.ca.mail.comcast.net ([76.96.30.87]) by qmta12.emeryville.ca.mail.comcast.net with comcast id qcMx1k0031smiN4ACmGJF4; Mon, 21 Jan 2013 22:16:18 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta20.emeryville.ca.mail.comcast.net with comcast id qmGH1k00P1t3BNj8gmGHnW; Mon, 21 Jan 2013 22:16:17 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 0FAB673A1B; Mon, 21 Jan 2013 14:16:17 -0800 (PST) Date: Mon, 21 Jan 2013 14:16:17 -0800 From: Jeremy Chadwick To: freebsd-fs@freebsd.org Subject: Re: disk "flipped" - a known problem? Message-ID: <20130121221617.GA23909@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1358806578; bh=/iTyfZ3lJ0vlxQ1SmyJbSM32sUhApKQYl6ec7ToAT3g=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=YAuFLSrzgp8UNh9SKLb0Qn6gVvgDKFj1wedIrAM1IqTRqaGp9WQPjUME4+V6NpjV2 mvgUd1r9cngalYqNbZt3Honmw/wvQZg28tu3IpMdSZpK17AfO8cxPwCmdFFu3lLw9Z 3pYo0KdRvPXVKp3poqtuBuC8P/hkbAfNyV6rzRGUAplIrOFJaLVSzKtTSfMZ4Ovr6Z IT0zgu22CqoLkti052in2RcAwBk+Ru0I3//Tu3vkApoGjo39Q950jXzVt5jIa3FVTM WCTYnM0R/1uvfjMmu4SFsUH62M9lE8Zopa116RgZKr0DT3AvSVUdeeCdZKUkNy6eHj BDZ5l+/xjwQkw== Cc: mav@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 22:16:19 -0000 (Please keep me CC'd as I am not subscribed) WRT this: http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html I can reproduce the first problem 100% of the time on my home system here. I can provide hardware specs if needed, but the important part is that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI mode (and does not share an IRQ), hot-swap bays are in use, and I'm using ahci.ko. I also want to make this clear to Andriy: I'm not saying "there's a problem with your disk". In my case, I KNOW there's a problem with the disk (that's the entire point to my tests! :-) ). In my case the disk is a WD Raptor (150GB, circa 2006) that has a very badly-designed firmware that goes completely catatonic when encountering certain sector-level conditions. That's not the problem though -- the problem is with FreeBSD apparently getting confused as to the internal state of its devices after a device falls off the bus and comes back. Explanation: 1. System powered off; disk is attached; system powered on, shows up as ada5. Can communicate with device in every way (the way I tend to test simple I/O is to use "smartctl -a /dev/ada5"). This disk has no filesystems or other "stuff" on it -- it's just a raw disk, so I believe the g_wither_washer oddity does not apply in this situation. 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" 3. Drive hits a bad sector which it cannot remap/deal with. Drive firmware design flaw results in drive becoming 100% stuck trying to re-read the sector and work out internal decisions to do remapping or not. Drive audibly clicking during this time (not actuator arm being reset to track 0 noise; some other mechanical issue). Due to firmware issue, drive remains in this state indefinitely. 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 times (kern.cam.da.retry_count+1). 5. FreeBSD spits out similar messages you see; retries exhausted, cam_periph_alloc error, and devfs claims device removal. 6. Drive is still catatonic of course. Only way to reset the drive is to power-cycle it. Drive removed from hot-swap bay, let sit for 20 seconds, then is reinserted. 7. FreeBSD sees the disk reappear, shows up much like it did during #1, except... 8. "smartctl -a /dev/ada5" claims no such device or unknown device type (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol devlist" shows the disk on the bus, yet I/O does not work. If I remember right, re-attempting the dd command returns some error (I forget which). 9. "camcontrol rescan all" stalls for quite some time when trying to communicate with entry 5, but eventually does return (I think with some error). camcontrol reset all" works without a hitch. "camcontrol devlist" during this time shows the same disk on ada5 (which to me means ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning I/O works at some level). 10. System otherwise works fine, but the only way to bring back usability of ada5 is to reboot ("shutdown -r now"). To me, this looks like FreeBSD at some layer within the kernel (or some driver (I don't know which)) is internally confused about the true state of things. Alexander, do you have any ideas? I can enable CAM debugging (I do use options CAMDEBUG so I can toggle this with camcontrol) as well as take notes and do a full step-by-step diagnosis (along with relevant kernel output seen during each phase) if that would help you. And I can test patches but not against -CURRENT (will be a cold day in hell before I run that, sorry). Let me know, time permitting. :-) -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |