Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Jan 2013 17:17:54 -0800
From:      Jeremy Chadwick <jdc@koitsu.org>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org, avg@freebsd.org
Subject:   Re: disk "flipped" - a known problem?
Message-ID:  <20130126011754.GA1806@icarus.home.lan>
In-Reply-To: <20130125213209.GA1858@icarus.home.lan>
References:  <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> <20130125212559.GA1772@icarus.home.lan> <20130125213209.GA1858@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jan 25, 2013 at 01:32:09PM -0800, Jeremy Chadwick wrote:
> On Fri, Jan 25, 2013 at 01:25:59PM -0800, Jeremy Chadwick wrote:
> > On Fri, Jan 25, 2013 at 01:12:32PM -0800, Jeremy Chadwick wrote:
> > > On Fri, Jan 25, 2013 at 12:36:19AM -0800, Jeremy Chadwick wrote:
> > > > On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote:
> > > > > On 22.01.2013 00:16, Jeremy Chadwick wrote:
> > > > > > (Please keep me CC'd as I am not subscribed)
> > > > > > 
> > > > > > WRT this:
> > > > > > 
> > > > > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html
> > > > > > 
> > > > > > I can reproduce the first problem 100% of the time on my home system
> > > > > > here.  I can provide hardware specs if needed, but the important part is
> > > > > > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI
> > > > > > mode (and does not share an IRQ), hot-swap bays are in use, and I'm
> > > > > > using ahci.ko.
> > > > > > 
> > > > > > I also want to make this clear to Andriy: I'm not saying "there's a
> > > > > > problem with your disk".  In my case, I KNOW there's a problem with the
> > > > > > disk (that's the entire point to my tests! :-) ).
> > > > > > 
> > > > > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very
> > > > > > badly-designed firmware that goes completely catatonic when encountering
> > > > > > certain sector-level conditions.  That's not the problem though -- the
> > > > > > problem is with FreeBSD apparently getting confused as to the internal
> > > > > > state of its devices after a device falls off the bus and comes back.
> > > > > > Explanation:
> > > > > > 
> > > > > > 1. System powered off; disk is attached; system powered on, shows up as
> > > > > > ada5.  Can communicate with device in every way (the way I tend to test
> > > > > > simple I/O is to use "smartctl -a /dev/ada5").  This disk has no
> > > > > > filesystems or other "stuff" on it -- it's just a raw disk, so I believe
> > > > > > the g_wither_washer oddity does not apply in this situation.
> > > > > > 
> > > > > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k"
> > > > > > 
> > > > > > 3. Drive hits a bad sector which it cannot remap/deal with.  Drive
> > > > > > firmware design flaw results in drive becoming 100% stuck trying to
> > > > > > re-read the sector and work out internal decisions to do remapping or
> > > > > > not.  Drive audibly clicking during this time (not actuator arm being
> > > > > > reset to track 0 noise; some other mechanical issue).  Due to firmware
> > > > > > issue, drive remains in this state indefinitely.
> > > > > > 
> > > > > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ)
> > > > > > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5
> > > > > > times (kern.cam.da.retry_count+1).
> > > > > > 
> > > > > > 5. FreeBSD spits out similar messages you see; retries exhausted,
> > > > > > cam_periph_alloc error, and devfs claims device removal.
> > > > > > 
> > > > > > 6. Drive is still catatonic of course.  Only way to reset the drive is
> > > > > > to power-cycle it.  Drive removed from hot-swap bay, let sit for 20
> > > > > > seconds, then is reinserted.
> > > > > > 
> > > > > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1,
> > > > > > except...
> > > > > > 
> > > > > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type
> > > > > > (I forget which).  "ls -l /dev/ada5" shows an entry.  "camcontrol
> > > > > > devlist" shows the disk on the bus, yet I/O does not work.  If I
> > > > > > remember right, re-attempting the dd command returns some error (I
> > > > > > forget which).
> > > > > > 
> > > > > > 9. "camcontrol rescan all" stalls for quite some time when trying to
> > > > > > communicate with entry 5, but eventually does return (I think with some
> > > > > > error).  camcontrol reset all" works without a hitch.  "camcontrol
> > > > > > devlist" during this time shows the same disk on ada5 (which to me means
> > > > > > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning
> > > > > > I/O works at some level).
> > > > > > 
> > > > > > 10. System otherwise works fine, but the only way to bring back
> > > > > > usability of ada5 is to reboot ("shutdown -r now").
> > > > > > 
> > > > > > To me, this looks like FreeBSD at some layer within the kernel (or some
> > > > > > driver (I don't know which)) is internally confused about the true state
> > > > > > of things.
> > > > > > 
> > > > > > Alexander, do you have any ideas?
> > > > > > 
> > > > > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle
> > > > > > this with camcontrol) as well as take notes and do a full step-by-step
> > > > > > diagnosis (along with relevant kernel output seen during each phase) if
> > > > > > that would help you.  And I can test patches but not against -CURRENT
> > > > > > (will be a cold day in hell before I run that, sorry).
> > > > > 
> > > > > Command timeout itself is not a reason for AHCI driver to drop the disk,
> > > > > neither it is for CAM in case of payload requests. Disk can be dropped
> > > > > if controller report device absence detected by SATA PHY, or by errors
> > > > > during device reinitialization after reset by CAM SATA XPT.
> > > > 
> > > > I have some theories as to why this is happening and it relates to the
> > > > underlying design of the drive firmware and the drive controller used.
> > > > I could write some pseudo-code showing how I believe the drive behaves,
> > > > but it's really besides the point, as you point out below.
> > > > 
> > > > > What is interesting, is what exactly goes on after disk got stuck and
> > > > > you have removed it. In normal case controller should immediately report
> > > > > PHY status change, driver should run PHY reset and see that link is
> > > > > lost. It should trigger bus rescan for CAM, that should invalidate
> > > > > device. That should make dd abort with error. After dd gone, device
> > > > > should be destroyed and ready for reattachment.
> > > > 
> > > > Yup, that sounds exactly like what should happen.  I know that in
> > > > userland (dd) the command eventually does abort/fail with an error (I
> > > > believe I/O error or some other message), and that's good.  The device
> > > > disappearing can also be confirmed.  It's after the drive is
> > > > power-cycled (to bring it back online) where its re-tasted and I/O (at
> > > > the kernel level) works, but now userland utilities interfacing with
> > > > /dev/ada5 insist "unknown device" or "no such device".  It's easier to
> > > > show than it is to explain.  My theory is that there is some kind of
> > > > internal (kernel-level) "state" that is not being reset correctly when a
> > > > device is lost and then brought back.
> > > > 
> > > > > So it should be great if you start with the full verbose dmesg from the
> > > > > boot up to the moment when system becomes stable after disk removal. If
> > > > > it won't be enough, we can enable some more debugging with `camcontrol
> > > > > debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`.
> > > > 
> > > > This is exactly what I needed; thank you!
> > > > 
> > > > I'll spend some time tomorrow collecting the data + documenting and will
> > > > provide the results once I've compiled them.  This will be more useful
> > > > than speculation on my part.
> > > 
> > > Finished.
> > > 
> > > http://jdc.koitsu.org/freebsd/ahci_cam_testing/
> > > 
> > > I was not able to get the cam_periph_alloc error message; I'll talk
> > > about that at the end of my mail in attempt to stay focused on what I
> > > did find / what I was able to reproduce.
> > > 
> > > Things start to get interesting around phase 23.
> > > 
> > > Phase 31 is where things are confirmed broken in some way ("no such file
> > > or directory" even though /dev/ada5 is there).
> > > 
> > > Direct I/O to /dev/ada5 still works (shown in phase 33), but smartctl
> > > ceases to work ***to that device only*** from then onward (e.g. smartctl
> > > ada0 works fine).  A reboot is needed to recover from this.
> > > 
> > > I'm aware that smartmontools uses xpt(4), and I think therein is where
> > > the issue is.  The only difference between the tests/phases is that I
> > > issued "camcontrol reset" and "camcontrol rescan" prior to the breakage.
> > > 
> > > Based on CAMDEBUG output in phase 36 it looks like xpt(4) is spinning on
> > > something internally and causing what I'm reporting.  I can reproduce it
> > > reliably at least.
> > > 
> > > Let me know what else I can do to help.  If I need to turn on CAMDEBUG
> > > and re-do some of the tests + provide full kernel/CAMDEBUG output during
> > > each phase, no problem, just let me know what you need.  I just hate
> > > risking interspersed kernel printf output...
> > 
> > I just realised that the numbering scheme I was using for the phases is
> > *completely* buggered.  I obviously made a manual typo at some point and
> > it just proliferated from there.
> > 
> > I'll see if I can figure out where my mistake was and clean it up.
> 
> Yeah, there's no way for me to easily work out what happened here.
> Sorry, I'm going to have to re-do these tests.  I'm still fairly certain
> the issue has to do with camcontrol reset or camcontrol rescan though,
> but I can't work out exactly where my typo was in the files.

Okay, I've figured out the exact, 100% reproducible condition that
causes the situation.  It took me a lot of tries and a digital pocket
recorder to take verbal notes (there are just too many things to look at
simultaneously), but I've figured it out.

I'm sorry for the verbosity, but it's necessary.

Assume the disk we're talking about is /dev/ada5.

1. Prior to any issues, we have this:

root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
crw-r-----  1 root  operator  0x8c Jan 25 16:41 /dev/ada5
crw-------  1 root  operator  0x75 Jan 25 16:35 /dev/pass5
crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0

2. ada5 begins experiencing issues -- ATA commands (CDBs) submit do not
get a response (not going to discuss how/why that can happen).

3. These types of messages are seen on console (naturally the CDB and
request type will vary -- in this case it was because I was doing the dd
zero'ing, thus tickling the bad sector/naughty firmware on the drive):

Jan 25 16:29:28 icarus kernel: ahcich5: Timeout on slot 0 port 0
Jan 25 16:29:28 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017
Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset...
Jan 25 16:29:28 icarus kernel: ahcich5: SATA connect time=1000us status=00000113
Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset: device found
Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED.  ACB: 61 80 80 77 01 40 00 00 00 00 00 00
Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command

4. Any I/O submit to ada5 during this time blocks (this is normal).

5. **While this situation is happening**, something using xpt(4)
attempts to submit a CDB to the disk (ex. smartctl -a /dev/ada5).
This request also blocks (again, normal).

6. Physical device falls off bus, or CAM kicks the disk off the bus.
Doesn't matter which.  We see messages resembling this (boy am I tired
of this interspersed output problem):

Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): lost device
Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): lost device
Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry
Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): passdevgonecb: devfs entry is gone

7. Standard I/O requests fail with errno=6 "Device not configured".
xpt(4) requests also fail with the same errno.

8. Device-wise, at this stage all we have is:

root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0

9. Device comes back online for whatever reason.  FreeBSD sees the disk,
blah blah blah:

Jan 25 16:30:16 icarus kernel: GEOM: new disk ada5
Jan 25 16:30:16 icarus kernel: ada5: <WDC WD1500ADFD-00NLR4 21.07QR4> ATA-7 SATA 1.x device
Jan 25 16:30:16 icarus kernel: ada5: Serial Number WD-WMAP41573589
Jan 25 16:30:16 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
Jan 25 16:30:16 icarus kernel: ada5: Command Queueing enabled
Jan 25 16:30:16 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C)
Jan 25 16:30:16 icarus kernel: ada5: Previously was known as ad14

...um, where's pass5?

10. /dev/pass5 is now completely (permanently) missing:

root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5*
crw-r-----  1 root  operator  0x99 Jan 25 16:42 /dev/ada5
crw-------  1 root  operator  0x51 Jan 25 16:35 /dev/xpt0

11. Any further attempts to communicate via xpt(4) with ada5 fail.
Detaching and reattaching the disk does not fix the issue; the only fix
is to reboot the system.

12. "camcontrol debug -IPXp scbus5" results in tons and tons of output
all pertaining to xpt(4).  It looks like xpt(4) is in some kind of
loop.

Below is my verbose boot (with non-kernel things removed), which
also includes "camcontrol debug" output once things are in a bad state:

http://jdc.koitsu.org/freebsd/xpt_oddity.log

In this log you'll see that after 1 CAM timeout I yanked the drive, then
roughly 30 seconds later reinserted it.

If you need me to turn on CAM debugging *prior* to the above, I can do
that, just let me know.

The important step is #5.  Without that, the problem shown in #9/10/11
does not happen.

It's a good thing I don't run smartd(8) -- most users I see using that
software set the interval to something like 180s or 60s.  Imagine this
frustration: "okay so the disk fell off the bus, but what, now I can't
talk to it with SMART?  Uhhh... <reboots>  Err, works now?  Whatever".

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130126011754.GA1806>