Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Apr 2011 14:41:04 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Conall O'Brien <conall@conall.net>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: Problems Terminating zpool scrub...
Message-ID:  <20110426214104.GA69929@icarus.home.lan>
In-Reply-To: <BANLkTikV262gQ-_GDiWQSe%2BFCDjPQxGOCA@mail.gmail.com>
References:  <BANLkTinYp674E=96PhMaR0%2BUy9e9B6boVA@mail.gmail.com> <BANLkTimQ4FWnC12O3cDtptJR%2BvA2PcNqYA@mail.gmail.com> <BANLkTikbPsf1d3p687RDVsaL_FO0KgKbfA@mail.gmail.com> <BANLkTi=Jban2q6h0HEpEMhWrfr56k1O_Jw@mail.gmail.com> <20110426134903.GA62578@icarus.home.lan> <BANLkTikV262gQ-_GDiWQSe%2BFCDjPQxGOCA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 26, 2011 at 06:39:58PM +0100, Conall O'Brien wrote:
> On 26 April 2011 14:49, Jeremy Chadwick <freebsd@jdc.parodius.com> wrote:
> > On Tue, Apr 26, 2011 at 02:25:00PM +0100, Conall O'Brien wrote:
> >> On 26 April 2011 13:15, ambrosehuang ambrose <ambrosehua@gmail.com> wrote:
> >> > Could you post your PR number?I was curious about the driver used by
> >> > West Digital Disk, cause I use
> >> > the WR10EARS?
> >>
> >> http://www.freebsd.org/cgi/query-pr.cgi?pr=156647
> >>
> >> I chalked it up to the SATA controller, since only 2 of my 5 identical
> >> WD20EARS disks were reporting DMA issues.
> >>
> >> >
> >> > 2011/4/25 Conall O'Brien <conall@conall.net>
> >> >>
> >> >> On 15 April 2011 15:59, Conall O'Brien <conall@conall.net> wrote:
> >> >> > Hello,
> >> >> >
> >> >> >
> >> >> > I've got a NAS box running 8-STABLEW [1] which I'm running with 5x
> >> >> > Western Digital 2TB disks.
> >> >> >
> >> >> >
> >> >> > One of the disks was having DMA issues as reported in dmesg, so I
> >> >> > began the usual zfs workflow of "zpool offline pool dev", physically
> >> >> > removing it and tried to "zpool replace pool dev" but my attempts to
> >> >> > do so fail, actually the zpool command keeps ending up in
> >> >> > uninterruptable wait (the D state). Before resorting to replacing the
> >> >> > disk, a zpool scrub was in progress. Now, I can't kill it using "zpool
> >> >> > scrub -s pool", it too ends up in the D state.
> >> >> >
> >> >> >
> >> >> > Is there another way than "zpool scrub -s pool" to terminate a scrub
> >> >> > process, so I can proceed with the disk replacement. I care more about
> >> >> > resilvering my pool before getting around to scrubbing it.
> >> >> >
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> >
> >> >> > [1] For completeness, uname -a reports FreeBSD galvatron.taku.ie
> >> >> > 8.2-STABLE FreeBSD 8.2-STABLE #1: Sat Mar 19 13:18:46 UTC 2011
> >> >> > root@galvatron.taku.ie:/usr/src/obj/usr/src/sys/GALVATRON ??amd64
> >> >>
> >> >> I worked out the problem. There's a regression in one of the drivers
> >> >> between the kernel I was running and my previous kernel:
> >> >>
> >> >> FreeBSD galvatron.taku.ie 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #0:
> >> >> Wed Dec 29 04:00:27 UTC 2010
> >> >> root@galvatron.taku.ie:/usr/src/obj/usr/src/sys/GALVATRON ??amd64
> >> >>
> >> >>
> >> >> I'll file a PR to get it fixed.
> >
> > The PR is extremely terse/sub-part quality. ??There isn't actual evidence
> > of the problem being a driver regression. ??What needs to be provided in
> > the PR:
> 
> Yeah, I wasn't sure what specifics would be needed, but I wanted to
> open a PR and go from there. It was the first time I've run into a
> kernel related issue, PRs for bugs in the ports collection are so much
> easier to describe.

I understand the situation; "okay, there's some issue here, but I don't
know what to put into the PR... well, better file it anyway".  It's
better than not doing anything at all.

> > - Relevant dmesg output (pertaining to ataX and adX devices and anything
> > ??else seen around that time; stuff from /var/adm/messages might be more
> > ??useful since it contains timestamps)
> > - Full dmesg seen during a fresh reboot
> > - vmstat -i
> > - atacontrol cap ataX (for each ataX channel. ??You can XXX out the
> > ??serial number if desired)
> > - smartctl -a /dev/adX (for each disk, be sure to label which disk
> > ??is associated with what data. ??You can XXX out the serial number if
> > ??desired)
> >
> > What really needs to be shown are the actual errors themselves, and in
> > sequential order / with timestamps. ??"DMA errors" is too vague; I want
> > to assume READ_DMA48 but I cannot assume that.
> 
> Now that my RAID array is healthy again, I'm happy to reboot into my
> suspect kernel and collect better diagnostics reports.

Perfect, thank you very much!

> > Next:
> >
> > I'm not sure if your system support its, but can you run the controller
> > in AHCI mode (BIOS setting) and load ahci.ko instead (ahci_load="yes" in
> > /boot/loader.conf, your disks will change to /dev/adaX)? ??If so, this
> > would allow you to narrow down whether or not the issue is truly a
> > driver problem. ??You should try this *before* attempting the below.
> 
> I actually intended to convert my disks over to AHCI anyway, to
> facilitiate hot swapping better. I assume I can do a "zpool import" to
> get my ZFS pool to work using the new devices.

You actually don't have to do anything (export or import); ZFS will
taste the disks during kernel start, find ZFS metadata, and naturally
import everything automatically.  Super convenient.

> > Try updating your source to something newer than March 19th. ??There have
> > been ata(4) changes since then that might pertain to your issue. ??If the
> > same issue happens on a present-day build of RELENG_8 then we can start
> > by trying to narrow it down to commits between, roughly, late December
> > 2010 to mid-March 2011. ??Since you follow RELENG_8, you will need to
> > follow commits. ??src/sys/dev/ata is what's relevant here, as well as the
> > chipsets/ directory under that.
> 
> Agreed, I probably shouldn't have left it so long between kernel
> rebuilds. I guess I was hoping there weren't too many changes related
> to my SATA controller, but that does naively assume the problem is the
> SATA controller driver.
> 
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/chipsets/
> >
> > Let's get this figured out before other users start correlating their
> > problems with whatever this is.
> 
> Agreed.

I'll be watching the PR.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110426214104.GA69929>