Date: Sat, 26 Nov 2005 16:00:18 GMT From: Jan Srzednicki <w@wrzask.pl> To: freebsd-bugs@FreeBSD.org Subject: Re: kern/50201: [twe] 3ware RAID 5 resulting in data corruption Message-ID: <200511261600.jAQG0Id5007684@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/50201; it has been noted by GNATS. From: Jan Srzednicki <w@wrzask.pl> To: bug-followup@freebsd.org, bruce@engmail.uwaterloo.ca, dpk@dpk.net Cc: Subject: Re: kern/50201: [twe] 3ware RAID 5 resulting in data corruption Date: Sat, 26 Nov 2005 16:58:36 +0100 I'm experiencing a similar problem, though with a few notable differences. First of all, I'm running FreeBSD 5.4-RELEASE (with RELENG_5_4 fixes) on my machine. Here's a brief output from my dmesg related to the 3ware controller: [16:32] hostname:~ # dmesg | grep twe twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0xcc00-0xcc0f mem 0xfe000000-0xfe7fffff irq 21 at device 0.0 on pci2 twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS BE7X 1.08.00.048 twed0: <Unit 0, RAID5, Normal> on twe0 twed0: 1192370MB (2441975040 sectors) The controller is a 7000-class 8-way RAID controller with PATA interfaces. I'm experiencing repeatable data corruption, but it's was far more difficult to pin it down. I'm using the array for backups, which I'm doing via ssh over the network (100Mbit ethernet) in the following way: dump | gzip | md5checker | network(ssh) | md5checker | split twe0/files md5checker is my small utility to calculate md5 sums of each 1MB chunk of data piped through it. It assured me that data corruption does not occur on the network, as MD5 sums on each sides match each other. The total size of backuped data after gzipping sums to about 43GB. The strange thing was that performing _the same_ backup in the following way: dump | gzip > file cat file | md5checker | network(ssh) | md5checker | split twe0/files .. did not produce any errors (I repeated both "ways" several times, to make sure). Well, it appears that the data corruption is somehow related to the speed of the data transmition, as dump output is quite irregular and becomes rather slow when it hits a bunch of small files. The whole dump process takes about 6 hours. I tried dumping the data into an IDE disk on the machine with the controller, which resulted in no errors. I also tried turning off softupdates on the filesystem on the 3ware array, with no effect. It clearly appears the data corruption is somehow related to the 3ware controller. After some investigation, I've discovered the following facts: - data is corrupted in exact 128kB chunks; the whole 128kB is bad and appears to be random (that is, I could not find any similar chunk in other files on the partition). - errors are pretty rare; in the whole 43GB stream I'm getting about 3 or 4 errors. - I'm not able to repeat data corruption locally. Things like: cat /dev/(zero|urandom) | md5checker | split array/files .. did not produce _any_ errors, after piping about a terabyte of data. It also appears that turning off write-cache on the controller fixed the problem, but writes are very slow now. I don't have another 3ware controller, so I cannot check if it isn't a hardware issue within it. I'm of course willing to provide any feedback needed on that issue, but because of the duration of the process testing stuff is rather slow.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200511261600.jAQG0Id5007684>