From owner-freebsd-stable@FreeBSD.ORG Mon Oct 20 10:38:26 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CC0911065679 for ; Mon, 20 Oct 2008 10:38:26 +0000 (UTC) (envelope-from kristianr@gmail.com) Received: from ti-out-0910.google.com (ti-out-0910.google.com [209.85.142.185]) by mx1.freebsd.org (Postfix) with ESMTP id E4C588FC1D for ; Mon, 20 Oct 2008 10:38:25 +0000 (UTC) (envelope-from kristianr@gmail.com) Received: by ti-out-0910.google.com with SMTP id d27so838493tid.3 for ; Mon, 20 Oct 2008 03:38:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:cc:in-reply-to:mime-version:content-type:references; bh=W2FnyW9xe5XKzsSqcNJAeHhd4MUTwkdBIu69duojdg0=; b=Z5CW/u+zbDIvX7DZHjGz/PM2Midq2BiUqNCJm+TaQ6lboRIh0cy2KnZnqnM9Ujr/G0 lZc+x+IErSTtClCfM5ncZhiLZiYJB10sBUW6XBUyckCHRq4Byra0HyAalluLnQ2d4WCH GDvCqxChsDUG1hvbe7muWx18rMLuSasAJ4crA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version :content-type:references; b=BC0yIjgykBgZLLLsHu66aZwZybyhPa8+SEF2G4BBICli7HftWCEj/jWwlnj44qofU5 Sup1yJpxz/3AWwQNvrqIZv3P9wB1twjuKM/LpLE8GUI6d/l4bs6oAXWj3IVzs7G9m4fF NJNKa7JqtrxIqOBCBRlj0or6zFij5S12xLy7E= Received: by 10.110.33.15 with SMTP id g15mr4812752tig.35.1224499104400; Mon, 20 Oct 2008 03:38:24 -0700 (PDT) Received: by 10.110.53.3 with HTTP; Mon, 20 Oct 2008 03:38:23 -0700 (PDT) Message-ID: Date: Mon, 20 Oct 2008 21:38:23 +1100 From: "Kristian Rooke" To: "Jeremy Chadwick" In-Reply-To: <20081018212543.GA58536@icarus.home.lan> MIME-Version: 1.0 References: <20081018102403.GA46124@icarus.home.lan> <20081018212543.GA58536@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-stable@freebsd.org Subject: Re: SETFEATURES SET TRANSFER MODE taskqueue timeout.. Error occuring constantly.. Please help!! X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Oct 2008 10:38:27 -0000 I have made some changes, and provided requested details. Issue is still occuring, so if it looks like it's going to be more trouble than it's worth I will probably just replace 3 of the PATA IDE disks with a SATA disk and just throw the remaining PATA on the Nvidia ATA controller? Thanks for your help thus far! :) On Sun, Oct 19, 2008 at 8:25 AM, Jeremy Chadwick wrote: > On Sun, Oct 19, 2008 at 03:32:29AM +1100, Kristian Rooke wrote: >> Thanks for the quick response! >> >> Please see requested output below: > > Cool, thanks. One thing I forgot to ask for was "vmstat -i" output. interrupt total rate irq1: atkbd0 6 0 irq6: fdc0 1 0 irq14: ata0 2060 2 irq16: atapci1 612 0 irq17: em0 810 0 cpu0: timer 1812646 1998 cpu1: timer 1812344 1998 Total 3628479 4000 > For now, let's break it down for ease of understanding: > > FreeBSD 7.0-RELEASE i386, built February 2008. > > atapci0: nVidia nForce MCP73 ATA133 controller -- IRQ 14 > atapci1: Silicon Image 0680 ATA133 controller -- IRQ 16 > > ata0: attached to atapci0 > ata1: attached to atapci0 > ata2: attached to atapci1 > ata3: attached to atapci1 > > ad0: at ata0-master PIO4 > ad4: at ata2-master PIO4 > ad5: at ata2-slave PIO4 > ad6: at ata3-master PIO4 > ad7: at ata3-slave PIO4 > > ATA errors are reported for disks ad4, ad5, ad6, and ad7. ad0 appears > to be error-free. > > First and foremost: there are known problems with Silicon Image > controllers on all operating systems (Windows, Linux, and FreeBSD in > particular), known for causing data loss and other sporadic issues. > This is at least confirmed on their SATA controllers, and I've become > quite the "pick something else" advocate when it comes to their stuff. > However: I've no idea about their PATA controllers. I was originally using a Promise PATA IDE controller, but that's when the issues first began so I bought a cheap Silicon Image IDE controller to replace it. After reading your email I have replaced the SI card with the Promise controller. Below is the detail from dmesg: atapci1: port 0xcf00-0xcf07,0xce00-0xce03,0xcd00-0xcd07,0xcc00-0xcc03,0xcb00-0xcb0f mem 0xefbf0000-0xefbfffff irq 16 at device 5.0 on pci1 > > Secondly, so far there isn't any evidence that the ad0 disk, which uses > the nVidia controller, has any problem -- all the disks having problems > are on the Silicon Image controller. That is a very key piece of > information here. > > If when you're writing data to, say, the ad4 disk, and you start to see > errors on all disks (ad4 through ad7), then what this probably means is > the controller has locked up or is behaving badly. This adds further > evidence that the Silicon Image controller may be at fault here. > > Thirdly, you said the system requires a hard reset to get things back in > working order. Sometimes this can be induced by a power supply that > isn't providing decent/proper voltages, or is being overloaded, > particularly during heavy disk I/O (drawing more power in some cases). > It might be good to check your voltages inside of your system BIOS, > write them down, and type them in here. FreeBSD does not provide a > decent set of tools for monitoring this stuff inside the OS (yet; I'm > working on it, mainly for server boards. I do what I can...) When error messages (same as pasted previously) begin being displayed in console, the system becomes unresponsive. I can no longer SSH to the device, and when I attempt to use it via console it simply continues to constantly scroll the disk error messages. I am currently using an Anter 550w PSU. Below are the Voltage details from BIOS: Vcore - 1.19V Vcc12V - 12.30V Vcc3.3V - 3.28V Vcc5.0V - 5.04V > But keep in mind that a controller locking up hard could also require a > hard reset (pressing reset on the front of the PC) -- a soft reset > (Ctrl-Alt-Del) would probably work, except much of the running kernel is > spinning hard trying to deal with ATA problems. > > Fourthly, I see a "" line in your original dmesg. > Can you provide that output? It's important -- sometimes people have > seen issues where their ATA controller shows problems, but it turns out > to be an IRQ sharing or device compatibility problem with another device > (e.g. their board was showing ATA errors, but at the exact same time, > also showing NIC watchdog timeouts or other anomalies). They omitted > the dmesg data thinking it had nothing to do with the problem, when in > fact it helps determine if the issue is truly with one piece or the > entire system. The was simply repeats of error messages I previously provided. I just had a look then and there was no mention of anything but ad4-ad7 errors in /var/log/messages. However, if you believe the extra logs would help, let me know and I will drop the whole lot in. Also, it seems that when this error has been occuring recently no errors have been written into /var/log/messages, I'm guessing this is due to the system load during ATA problem. > > Next, let's take a look at your SMART output, which tells a tale of > something very very bad: > > Disk ad4 has a good temperature, and no sign of bad blocks/sectors. The > disk had been powered on for a total of 7799 hours. > > There was a CRC error detected when attempting to set specific > capabilities on the device. The error occurred at LBA 0 on the disk, > which is completely bizarre, but the SMART error log might just say LBA > 0 to indicate "no LBA was being accessed" (e.g. the error was purely > during the mode setting attempts). However, the SMART error "wraps" its > timestamps at 49.710 days (every 1149.840 hours), so it's going to be > difficult to determine if the below SMART error log entry was from long > ago, or was fairly recent. Looking at other disks might help, so let's > continue. > > Disk ad5 has an excellent temperature, and no sign of bad blocks/sectors > either. The disk has been powered on for a total of 11956 hours. No > errors were found in the SMART log. > > Disk ad6 has a good temperature, and no sign of bad blocks/sectors. No > errors were found in the SMART log. > > Disk ad7 has an excellent temperature, and no sign of bad blocks/sectors > either. The disk had been powered on for a total of 12512 hours. > > However, much like disk ad4, this disk also witnessed a CRC error when > attempting to either do a DMA read operation or when setting > capabilities on the device. I'm prone to believe it's when setting > capabilities, because LBA 0 is also seen here, which isn't a likely LBA. > This error happened at the 6310 hour mark, which was about half of its > lifetime ago. > > All of this is somewhat of a mystery. Disk ad4 is on a completely > different physical cable than disk ad7, so that *could* rule out cabling > problems. The errors seen are only when setting device capabilities > (making an educated guess, but I'm not 100% positive), not when actually > accessing data on the disks. Heck, I'm not even sure the errors in the > SMART log are accurate, as the disks have been powered on for quite some > time after the supposed errors occurred. > > Power draw could also explain this, ditto with the voltage possibility. > > I would start by doing 3 easy things: > > 1) Re-enable DMA mode; it's obviously not the cause of your problems > since PIO mode shows the same problem for you, This has now been re-enabled > 2) Replacing both sets of PATA cables with brand new ones. There's no > evidence this is the problem, but changing these is easy and cheap. If > it doesn't solve the problem, then you're one step closer to tracking it > down, Cables (and controller) have both been changed. Just did some checks then and confirmed issue is still occuring. I have been using Samba to copy files over, but I also tested by mounting a NTFS locally and issues still occured. > 3) Getting voltages from the BIOS and providing them here. Again, this > won't be an accurate representation of the system under load, but it's > the best we've got right now. As above. > Assuming the problem continues after #2, and the voltages shown in #3 > look good, this is what I'd do for the next step: > > Buy a PCI, PCI-X (if this make sure it's backwards-compatible with > 32-bit 33MHz PCI slots, unless you actually have a PCI-X slot!) or PCI > Express PATA controller -- specifically, one that does not use a Silicon > Image chip. This may be hard to accomplish since PATA is a dying > interface (and good riddance!). > > I will also stress this in capitals, just to make it clear: DO NOT BUY A > SATA CONTROLLER THEN USE PATA-TO-SATA ADAPTERS. Those adapters will > cause you even more problems. If you go the SATA route, buy actual SATA > disks and recycle or sell your old PATA ones. > > That said, Highpoint and Promise both make PATA controllers -- not to > mention, I even see that you've tried to load the hptrr(4) driver on > that system! :-) Additionally, DO NOT use the "RAID" features of these > cards (if you end up buying one that has such); just plug the disks in > and use them in a JBOD fashion. > > You might find that the disk numbers (e.g. ad4) change on you when > doing this; that's to be expected. > > Others might recommend that you should try replacing the PSU before > buying a new PATA controller, but I have doubts the problem is with the > PSU; I would expect more odd/awkward problems if the PSU was to blame. > If you do try a different PSU, go with one that does 450W or more. You > DO NOT need a l33t-g4m3-d00dz-omgwtfbbq!! 850-1000W PSU; most of the > power draw for hard disks happens during power-on, when the disks have > to spin up, not once they're already spinning. > > Hope this helps, and good luck! > > -- > | Jeremy Chadwick jdc at parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, USA | > | Making life hard for others since 1977. PGP: 4BD6C0CB | > >