From owner-freebsd-stable@FreeBSD.ORG Mon Oct 20 17:09:18 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 18540106566C; Mon, 20 Oct 2008 17:09:18 +0000 (UTC) (envelope-from joao@matik.com.br) Received: from msrv.matik.com.br (msrv.matik.com.br [200.153.48.3]) by mx1.freebsd.org (Postfix) with ESMTP id 4B1BC8FC28; Mon, 20 Oct 2008 17:09:16 +0000 (UTC) (envelope-from joao@matik.com.br) Received: from [10.10.2.2] (189-19-2-198.dsl.telesp.net.br [189.19.2.198]) by msrv.matik.com.br (8.14.2/8.14.2) with ESMTP id m9KH9APK070036; Mon, 20 Oct 2008 15:09:10 -0200 (BRST) (envelope-from joao@matik.com.br) From: JoaoBR Organization: Infomatik To: Jeremy Chadwick Date: Mon, 20 Oct 2008 15:07:30 -0200 User-Agent: KMail/1.9.7 References: <200810171530.45570.joao@matik.com.br> <200810200837.40451.joao@matik.com.br> <20081020132208.GA3847@icarus.home.lan> In-Reply-To: <20081020132208.GA3847@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200810201507.30778.joao@matik.com.br> X-Spam-Status: No, score=2.1 required=5.0 tests=ALL_TRUSTED,AWL, BR_RECEIVED_SPAMMER, SARE_RECV_SPAM_DOMN02, TW_ZF autolearn=no version=3.2.5 X-Spam-Level: ** X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on msrv.matik.com.br X-Virus-Scanned: ClamAV 0.93.3/8449/Mon Oct 20 11:48:09 2008 on msrv.matik.com.br X-Virus-Status: Clean Cc: freebsd-stable@freebsd.org Subject: Re: constant zfs data corruption X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Oct 2008 17:09:18 -0000 On Monday 20 October 2008 11:22:08 you wrote: > On Mon, Oct 20, 2008 at 08:37:40AM -0200, JoaoBR wrote: > > On Friday 17 October 2008 15:39:59 Chuck Swiger wrote: > > > On Oct 17, 2008, at 11:30 AM, JoaoBR wrote: > > > > constantly I find data corruption on ZFS volums, ever from rrdtool, > > > > this > > > > corrupt data happens on SATA disks, never seem on SCSI > > > > > > Presumably your SATA drives are correctly being reported by ZFS as > > > corrupting data, and you should do something like replace cables, the > > > drives themselves, perhaps try downgrading to SATA-150 rather than > > > -300 if you are using the later. Also consider running a drive > > > diagnostic utility from the mfgr (or smartmontools) and doing an > > > extended self-test or destructive write surface check. > > > > well, hardware seems to be ok and not older than 6 month, also happens > > not only on one machine ... smartctl do not report any hw failures on > > disk > > > > regarding jumpering the drives to 150 you suspect a driver problem? > > It's not because of a driver problem. There are known SATA chipsets > which do not properly work with SATA300 (particularly VIA and SiS > chipsets); they claim to support it, but data is occasionally corrupted. > Capping the drive to SATA150 fixes this problem. > > http://en.wikipedia.org/wiki/Serial_ATA#SATA_1.5_Gbit.2Fs_and_SATA_3_Gbit= =2E2 >Fs > > There are also known problems with Silicon Image chipsets (on Linux, > Windows, and FreeBSD). > > Because you didn't provide your smartctl output, I can't really tell if > the drives are in "good shape" or not. :-) > ok then here it comes smartctl version 5.38 [amd64-portbld-freebsd7.0] Copyright (C) 2002-8 Bruce= =20 Allen Home page is http://smartmontools.sourceforge.net/ =3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D Model Family: Hitachi Deskstar T7K500 Device Model: Hitachi HDT725025VLA380 Serial Number: VFL101RK0A9SDP =46irmware Version: V5DOA7EA User Capacity: 250.058.268.160 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 Local Time is: Mon Oct 20 15:07:01 2008 BRST SMART support is: Available - device has SMART capability. SMART support is: Enabled =3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection:=20 Disabled. Self-test execution status: ( 0) The previous self-test routine=20 completed without error or no self-test has e= ver been run. Total time to complete Offline data collection: (4949) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off= =20 support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 83) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED = =20 WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 099 099 016 Pre-fail =20 Always - 3 2 Throughput_Performance 0x0005 100 100 050 Pre-fail =20 Offline - 0 3 Spin_Up_Time 0x0007 117 117 024 Pre-fail =20 Always - 316 (Average 322) 4 Start_Stop_Count 0x0012 100 100 000 Old_age =20 Always - 36 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail =20 Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail =20 Always - 0 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail =20 Offline - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age =20 Always - 800 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail =20 Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age =20 Always - 36 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age =20 Always - 69 193 Load_Cycle_Count 0x0012 100 100 000 Old_age =20 Always - 69 194 Temperature_Celsius 0x0002 130 130 000 Old_age =20 Always - 46 (Lifetime Min/Max 19/52) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age =20 Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age =20 Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age =20 Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age =20 Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. > Also, do you not think it's a little odd that the only data corruption > occurring for you are related to RRDtool? this yes I think is suspitious =2D-=20 Jo=E3o A mensagem foi scaneada pelo sistema de e-mail e pode ser considerada segura. Service fornecido pelo Datacenter Matik https://datacenter.matik.com.br