From owner-freebsd-stable@FreeBSD.ORG Tue Sep 16 18:19:17 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4E02C106564A for ; Tue, 16 Sep 2008 18:19:17 +0000 (UTC) (envelope-from clint@0lsen.net) Received: from belle.0lsen.net (belle.0lsen.net [75.150.32.89]) by mx1.freebsd.org (Postfix) with ESMTP id 2F66F8FC1B for ; Tue, 16 Sep 2008 18:19:11 +0000 (UTC) (envelope-from clint@0lsen.net) Received: by belle.0lsen.net (Postfix, from userid 1001) id C4F387962D; Tue, 16 Sep 2008 11:19:03 -0700 (PDT) Date: Tue, 16 Sep 2008 11:19:03 -0700 From: Clint Olsen To: stable@freebsd.org Message-ID: <20080916181903.GC7540@0lsen.net> References: <20080916170452.GB4861@0lsen.net> <20080916175858.GA70396@icarus.home.lan> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="bg08WKrSYDhXBjb5" Content-Disposition: inline In-Reply-To: <20080916175858.GA70396@icarus.home.lan> User-Agent: Mutt/1.4.2.3i Organization: NULlsen Network X-Disclaimer: Mutt Bites! X-0lsen-net-MailScanner-Information: Please contact the ISP for more information X-MailScanner-ID: C4F387962D.3357D X-0lsen-net-MailScanner: Found to be clean X-0lsen-net-MailScanner-From: clint@0lsen.net X-Spam-Status: No Cc: Subject: Re: Help debugging DMA_READ errors X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Sep 2008 18:19:17 -0000 --bg08WKrSYDhXBjb5 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi Jeremy: Thanks for your detailed response. Here are the answers I have thus far: On Sep 16, Jeremy Chadwick wrote: > acd0 is a CD/DVD drive. ad4 is a hard disk. What exactly were you > doing with the system at the time these errors appeared? Were you using > the CD/DVD drive? Was there a disc in the drive that was mounted? > If none of these things, I'm baffled as to what would read acd0 and > cause what you see here. I was not at the system at the time. I never have had a disk in the drive nor is /cdrom mounted currently. I have dump backups that run in the middle of the night on the various filesystems. > Can you please provide full details of what these disks are connected > to? I'd like to see dmesg output for ata0, ata2, and ata3, as well as > the atapci devices those ataX devices are attached to, ditto with > vmstat -i output. Are there any other errors in your logs around > that time (e.g. watchdog timeouts of any kind on network devices, etc.?) # dmesg | grep -i ata atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0 ata0: on atapci0 ata1: on atapci0 atapci1: port 0xeff0-0xeff7,0xefe4-0xefe7,0xefa8-0xefaf,0xefe0-0xefe3,0xef60-0xef6f irq 18 at device 31.2 on pci0 ata2: on atapci1 ata3: on atapci1 I skipped the disks, of course. > Additionally, it would be very useful if you could install > ports/sysutils/smartmontools and provide the following output: > > # smartctl -a /dev/ad0 > # smartctl -a /dev/ad4 See attached. > http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting > > The bottom line is that, if the problems you're seeing are the "same > thing" others are seeing, then you are not alone. As I said initially, > finding the source of these problems is difficult, and they are often > "unique" to each individual's machine. For some, replacing cables, the > entire motherboard, disk controller, or just the PSU helped; for others, > the problem disappeared on its own; in other cases, the problem was > so severe that they ended up switching to Linux. I'll take a look at this page. Thanks, -Clint -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. --bg08WKrSYDhXBjb5 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=ad0 smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Western Digital Caviar SE family Device Model: WDC WD1200JB-32EVA0 Serial Number: WD-WMAEL1302890 Firmware Version: 15.05R15 User Capacity: 120,034,123,776 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Sep 16 11:11:04 2008 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (3801) seconds. Offline data collection capabilities: (0x79) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 53) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 162 148 021 Pre-fail Always - 2433 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 79 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 7 7 Seek_Error_Rate 0x000b 100 253 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 042 042 000 Old_age Always - 42740 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 79 194 Temperature_Celsius 0x0022 111 253 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 193 193 000 Old_age Always - 7 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. --bg08WKrSYDhXBjb5 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=ad4 smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Western Digital Caviar SE Serial ATA family Device Model: WDC WD1200JD-00GBB0 Serial Number: WD-WMAET1326141 Firmware Version: 02.05D02 User Capacity: 120,034,123,776 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Sep 16 11:11:17 2008 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (3801) seconds. Offline data collection capabilities: (0x79) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 53) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 160 139 021 Pre-fail Always - 2508 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 55 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 051 051 000 Old_age Always - 36471 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 55 194 Temperature_Celsius 0x0022 104 253 000 Old_age Always - 46 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 2 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. --bg08WKrSYDhXBjb5--