From owner-freebsd-questions@FreeBSD.ORG Mon Oct 11 11:26:26 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4375016A4CF for ; Mon, 11 Oct 2004 11:26:26 +0000 (GMT) Received: from smtp.rdsnet.ro (smtp.rdsmail.ro [193.231.236.72]) by mx1.FreeBSD.org (Postfix) with ESMTP id CE96543D54 for ; Mon, 11 Oct 2004 11:26:22 +0000 (GMT) (envelope-from itetcu@apropo.ro) Received: (qmail 8966 invoked by uid 89); 11 Oct 2004 11:26:20 -0000 Received: from unknown (HELO it.buh.cameradicommercio.ro) (82.76.1.117) by 0 with SMTP; 11 Oct 2004 11:26:20 -0000 Received: from it.buh.cameradicommercio.ro (localhost.buh.tecnik93.com [127.0.0.1]) by it.buh.cameradicommercio.ro (Postfix) with SMTP id 0EBBA590; Mon, 11 Oct 2004 14:13:54 +0300 (EEST) Date: Mon, 11 Oct 2004 14:09:31 +0300 From: Ion-Mihai Tetcu To: questions@freebsd.org, current@freebsd.org Message-ID: <20041011140931.7934d78b@it.buh.cameradicommercio.ro> X-Mailer: Sylpheed-Claws 0.9.12a (GTK+ 1.2.10; i386-portbld-freebsd5.3) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: TIMEOUT - WRITE_DMA and smart questions X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Oct 2004 11:26:26 -0000 [ please reply only on questions@ if this is not appropriate for current@ ] Hi, While doing nothing special the system start printing TIMEOUT - WRITE_DMA erros and eventually after an atacontrol mode 0 PIO4 PIO4 hanged completely at 04:20. After restart I've got a few TIMEOUT .. but no hung, however the machine is idle. SMART was enabled as seen bellow, but smartd wasn't running (stupid, huh :-/ ). Obvious question: is the hdd dying ? Second question, as I'm not familiar with SMART: how much can one trust SMART reports ? Third question: could you suggest some settings for smartd ? I'm, asking this because I don't fully understand the man pages for smartctl and smartd; a link explaining more about smart would also be appreciated. System details: Local system status (last daily mail): 3:01AM up 2 days, 11:56, 2 users, load averages: 1.04, 1.07, 0.95 % uname -a FreeBSD it.buh.cameradicommercio.ro 5.3-BETA7 FreeBSD 5.3-BETA7 #3: Mon Oct 4 21:57:25 EEST 2004 root@it.buh.tecnik93.com:/usr/obj/usr/src/sys/IT53_d i386 Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020 Oct 11 04:07:02 it kernel: ata0: reiniting channel .. Oct 11 04:07:02 it kernel: ata0: reset tp1 mask=03 ostat0=d0 ostat1=d0 Oct 11 04:07:02 it kernel: ad0: stat=0xd0 err=0xd0 lsb=0xd0 msb=0xd0 Oct 11 04:07:02 it last message repeated 95 times Oct 11 04:07:02 it kernel: ad0: stat=0x50 err=0x01 lsb=0x00 msb=0x00 Oct 11 04:07:02 it kernel: ata0-slave: stat=0x00 err=0x01 lsb=0x00 msb=0x00 Oct 11 04:07:02 it kernel: ata0: reset tp2 stat0=50 stat1=00 devices=0x1 Oct 11 04:07:02 it kernel: ata0: resetting done .. Oct 11 04:07:02 it kernel: ad0: pio=0x0c wdma=0x22 udma=0x45 cable=80pin Oct 11 04:07:02 it kernel: ad0: setting PIO4 on VIA 8235 chip Oct 11 04:07:02 it kernel: ad0: setting UDMA100 on VIA 8235 chip Oct 11 04:07:02 it kernel: ata0: device config done .. Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): error 22 Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): Unretryable Error Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): error 22 Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): Unretryable Error ......... # grep LBA /var/log/messages Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020 Oct 11 04:07:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165839908 Oct 11 04:08:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165849220 Oct 11 04:09:12 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165851556 Oct 11 04:09:32 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165859748 Oct 11 04:10:44 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103 Oct 11 04:11:23 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210916 Oct 11 04:11:36 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186211044 Oct 11 04:11:58 it kernel: acd0: FAILURE - ATA_IDENTIFY status=51 error=4 LBA=0 Oct 11 04:13:21 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309294340 Oct 11 04:14:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421156 Oct 11 04:14:24 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=175421156 Oct 11 04:15:04 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421796 Oct 11 04:15:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=130261540 Oct 11 04:16:10 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421892 Oct 11 04:16:53 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=173918724 Oct 11 04:18:50 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309924420 Oct 11 04:19:14 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4920283 Oct 11 04:40:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4918975 Oct 11 04:40:56 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6067199 Oct 11 10:46:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103 # grep sw /var/log/messages Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s1e, blkno: 14841, size: 4096 Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 14381, size: 4096 Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 60732, size: 4096 Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33481, size: 4096 Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33488, size: 4096 The disk is: # atacontrol cap 0 0 ATA channel 0, Master, device ad0: Protocol ATA/ATAPI revision 6 device model WDC WD1600JB-00EVA0 serial number WD-WCAEK1298992 firmware revision 15.05R15 cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 312579695 sectors dma supported overlap not supported Feature Support Enable Value Vendor write cache yes no read ahead yes yes dma queued no no 0/0x00 SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 0/0x00 automatic acoustic management yes yes 254/0xFE 128/0x80 # smartctl -a /dev/ad0 smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD1600JB-00EVA0 Serial Number: WD-WCAEK1298992 Firmware Version: 15.05R15 Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Oct 11 12:37:32 2004 EEST SMART support is: Available - device has SMART capability. SMART support is: Enabled The SMART RETURN STATUS return value (smartmontools -H option/Directive) can not be retrieved with this version of ATAng, please do not rely on this value === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x05) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Disabled. Self-test execution status: ( 40) The self-test routine was interrupted by the host with a hard or soft reset. Total time to complete Offline data collection: (5061) seconds. Offline data collection capabilities: (0x79) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 67) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 155 147 021 Pre-fail Always - 2775 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 464 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 8 7 Seek_Error_Rate 0x000b 200 199 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3360 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 462 194 Temperature_Celsius 0x0022 124 253 000 Old_age Always - 26 196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 2 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended captive Interrupted (host reset) 80% 77 - # 2 Extended offline Aborted by host 90% 77 - # 3 Conveyance offline Completed without error 00% 76 - # 4 Short offline Completed without error 00% 76 - # 5 Conveyance offline Completed without error 00% 233 - # 6 Short captive Interrupted (host reset) 90% 233 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Thanks, -- IOnut Unregistered ;) FreeBSD "user"