FreeBSD Mail Archives

Date:      Mon, 20 Oct 2008 21:38:23 +1100
From:      "Kristian Rooke" <kristianr@gmail.com>
To:        "Jeremy Chadwick" <koitsu@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: SETFEATURES SET TRANSFER MODE taskqueue timeout.. Error occuring constantly.. Please help!!
Message-ID:  <f9ccec500810200338m56b8c3ar2288e9c1f6415fd1@mail.gmail.com>
In-Reply-To: <20081018212543.GA58536@icarus.home.lan>
References:  <f9ccec500810180100j7969b1eeucb6e974f37b05961@mail.gmail.com> <20081018102403.GA46124@icarus.home.lan> <f9ccec500810180932k5fe192e1uc360afe41ae8581f@mail.gmail.com> <20081018212543.GA58536@icarus.home.lan>

I have made some changes, and provided requested details.
Issue is still occuring, so if it looks like it's going to be more trouble
than it's worth I will probably just replace 3 of the PATA IDE disks with a
SATA disk and just throw the remaining PATA on the Nvidia ATA controller?

Thanks for your help thus far! :)

On Sun, Oct 19, 2008 at 8:25 AM, Jeremy Chadwick <koitsu@freebsd.org> wrote:
> On Sun, Oct 19, 2008 at 03:32:29AM +1100, Kristian Rooke wrote:
>> Thanks for the quick response!
>>
>> Please see requested output below:
>
> Cool, thanks.  One thing I forgot to ask for was "vmstat -i" output.

interrupt                          total       rate
irq1: atkbd0                           6          0
irq6: fdc0                             1          0
irq14: ata0                         2060          2
irq16: atapci1                       612          0
irq17: em0                           810          0
cpu0: timer                      1812646       1998
cpu1: timer                      1812344       1998
Total                            3628479       4000

> For now, let's break it down for ease of understanding:
>
> FreeBSD 7.0-RELEASE i386, built February 2008.
>
> atapci0: nVidia nForce MCP73 ATA133 controller -- IRQ 14
> atapci1: Silicon Image 0680 ATA133 controller  -- IRQ 16
>
> ata0: attached to atapci0
> ata1: attached to atapci0
> ata2: attached to atapci1
> ata3: attached to atapci1
>
> ad0: <Seagate ST380011A 3.06>   at ata0-master PIO4
> ad4: <Seagate ST3320620A 3.AAF> at ata2-master PIO4
> ad5: <Seagate ST3320620A 3.AAF> at ata2-slave  PIO4
> ad6: <Seagate ST3750640A 3.AAE> at ata3-master PIO4
> ad7: <Seagate ST3320620A 3.AAD> at ata3-slave  PIO4
>
> ATA errors are reported for disks ad4, ad5, ad6, and ad7.  ad0 appears
> to be error-free.
>
> First and foremost: there are known problems with Silicon Image
> controllers on all operating systems (Windows, Linux, and FreeBSD in
> particular), known for causing data loss and other sporadic issues.
> This is at least confirmed on their SATA controllers, and I've become
> quite the "pick something else" advocate when it comes to their stuff.
> However: I've no idea about their PATA controllers.

I was originally using a Promise PATA IDE controller, but that's when the
issues first began so I bought a cheap Silicon Image IDE controller to
replace it. After reading your email I have replaced the SI card with the
Promise controller. Below is the detail from dmesg:

atapci1: <Promise PDC20270 UDMA100 controller> port
0xcf00-0xcf07,0xce00-0xce03,0xcd00-0xcd07,0xcc00-0xcc03,0xcb00-0xcb0f mem
0xefbf0000-0xefbfffff irq 16 at device 5.0 on pci1

>
> Secondly, so far there isn't any evidence that the ad0 disk, which uses
> the nVidia controller, has any problem -- all the disks having problems
> are on the Silicon Image controller.  That is a very key piece of
> information here.
>
> If when you're writing data to, say, the ad4 disk, and you start to see
> errors on all disks (ad4 through ad7), then what this probably means is
> the controller has locked up or is behaving badly.  This adds further
> evidence that the Silicon Image controller may be at fault here.
>
> Thirdly, you said the system requires a hard reset to get things back in
> working order.  Sometimes this can be induced by a power supply that
> isn't providing decent/proper voltages, or is being overloaded,
> particularly during heavy disk I/O (drawing more power in some cases).
> It might be good to check your voltages inside of your system BIOS,
> write them down, and type them in here.  FreeBSD does not provide a
> decent set of tools for monitoring this stuff inside the OS (yet; I'm
> working on it, mainly for server boards.  I do what I can...)

When error messages (same as pasted previously) begin being displayed in
console, the system becomes unresponsive.
I can no longer SSH to the device, and when I attempt to use it via console
it simply continues to constantly scroll the disk error messages.

I am currently using an Anter 550w PSU. Below are the Voltage details from
BIOS:

Vcore - 1.19V
Vcc12V - 12.30V
Vcc3.3V - 3.28V
Vcc5.0V - 5.04V

> But keep in mind that a controller locking up hard could also require a
> hard reset (pressing reset on the front of the PC) -- a soft reset
> (Ctrl-Alt-Del) would probably work, except much of the running kernel is
> spinning hard trying to deal with ATA problems.
>
> Fourthly, I see a "<some output omitted>" line in your original dmesg.
> Can you provide that output?  It's important -- sometimes people have
> seen issues where their ATA controller shows problems, but it turns out
> to be an IRQ sharing or device compatibility problem with another device
> (e.g. their board was showing ATA errors, but at the exact same time,
> also showing NIC watchdog timeouts or other anomalies).  They omitted
> the dmesg data thinking it had nothing to do with the problem, when in
> fact it helps determine if the issue is truly with one piece or the
> entire system.

The <some output omitted> was simply repeats of error messages I previously
provided. I just had a look then and there was no mention of anything but
ad4-ad7 errors in /var/log/messages. However, if you believe the extra logs
would help, let me know and I will drop the whole lot in.

Also, it seems that when this error has been occuring recently no errors
have been written into /var/log/messages, I'm guessing this is due to the
system load during ATA problem.

>
> Next, let's take a look at your SMART output, which tells a tale of
> something very very bad:
>
> Disk ad4 has a good temperature, and no sign of bad blocks/sectors.  The
> disk had been powered on for a total of 7799 hours.
>
> There was a CRC error detected when attempting to set specific
> capabilities on the device.  The error occurred at LBA 0 on the disk,
> which is completely bizarre, but the SMART error log might just say LBA
> 0 to indicate "no LBA was being accessed" (e.g. the error was purely
> during the mode setting attempts).  However, the SMART error "wraps" its
> timestamps at 49.710 days (every 1149.840 hours), so it's going to be
> difficult to determine if the below SMART error log entry was from long
> ago, or was fairly recent.  Looking at other disks might help, so let's
> continue.
>
> Disk ad5 has an excellent temperature, and no sign of bad blocks/sectors
> either.  The disk has been powered on for a total of 11956 hours.  No
> errors were found in the SMART log.
>
> Disk ad6 has a good temperature, and no sign of bad blocks/sectors.  No
> errors were found in the SMART log.
>
> Disk ad7 has an excellent temperature, and no sign of bad blocks/sectors
> either.  The disk had been powered on for a total of 12512 hours.
>
> However, much like disk ad4, this disk also witnessed a CRC error when
> attempting to either do a DMA read operation or when setting
> capabilities on the device.  I'm prone to believe it's when setting
> capabilities, because LBA 0 is also seen here, which isn't a likely LBA.
> This error happened at the 6310 hour mark, which was about half of its
> lifetime ago.
>
> All of this is somewhat of a mystery.  Disk ad4 is on a completely
> different physical cable than disk ad7, so that *could* rule out cabling
> problems.  The errors seen are only when setting device capabilities
> (making an educated guess, but I'm not 100% positive), not when actually
> accessing data on the disks.  Heck, I'm not even sure the errors in the
> SMART log are accurate, as the disks have been powered on for quite some
> time after the supposed errors occurred.
>
> Power draw could also explain this, ditto with the voltage possibility.
>
> I would start by doing 3 easy things:
>
> 1) Re-enable DMA mode; it's obviously not the cause of your problems
> since PIO mode shows the same problem for you,

This has now been re-enabled

> 2) Replacing both sets of PATA cables with brand new ones.  There's no
> evidence this is the problem, but changing these is easy and cheap.  If
> it doesn't solve the problem, then you're one step closer to tracking it
> down,

Cables (and controller) have both been changed. Just did some checks then
and confirmed issue is still occuring.
I have been using Samba to copy files over, but I also tested by mounting a
NTFS locally and issues still occured.

> 3) Getting voltages from the BIOS and providing them here.  Again, this
> won't be an accurate representation of the system under load, but it's
> the best we've got right now.

As above.

> Assuming the problem continues after #2, and the voltages shown in #3
> look good, this is what I'd do for the next step:
>
> Buy a PCI, PCI-X (if this make sure it's backwards-compatible with
> 32-bit 33MHz PCI slots, unless you actually have a PCI-X slot!) or PCI
> Express PATA controller -- specifically, one that does not use a Silicon
> Image chip.  This may be hard to accomplish since PATA is a dying
> interface (and good riddance!).
>
> I will also stress this in capitals, just to make it clear: DO NOT BUY A
> SATA CONTROLLER THEN USE PATA-TO-SATA ADAPTERS.  Those adapters will
> cause you even more problems.  If you go the SATA route, buy actual SATA
> disks and recycle or sell your old PATA ones.
>
> That said, Highpoint and Promise both make PATA controllers -- not to
> mention, I even see that you've tried to load the hptrr(4) driver on
> that system!  :-) Additionally, DO NOT use the "RAID" features of these
> cards (if you end up buying one that has such); just plug the disks in
> and use them in a JBOD fashion.
>
> You might find that the disk numbers (e.g. ad4) change on you when
> doing this; that's to be expected.
>
> Others might recommend that you should try replacing the PSU before
> buying a new PATA controller, but I have doubts the problem is with the
> PSU; I would expect more odd/awkward problems if the PSU was to blame.
> If you do try a different PSU, go with one that does 450W or more.  You
> DO NOT need a l33t-g4m3-d00dz-omgwtfbbq!! 850-1000W PSU; most of the
> power draw for hard disks happens during power-on, when the disks have
> to spin up, not once they're already spinning.
>
> Hope this helps, and good luck!
>
> --
> | Jeremy Chadwick                                jdc at parodius.com |
> | Parodius Networking                       http://www.parodius.com/ |
> | UNIX Systems Administrator                  Mountain View, CA, USA |
> | Making life hard for others since 1977.              PGP: 4BD6C0CB |
>
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f9ccec500810200338m56b8c3ar2288e9c1f6415fd1>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation