Date: Tue, 31 May 2011 11:25:56 +0200 From: Olaf Seibert <O.Seibert@cs.ru.nl> To: Dan Nelson <dnelson@allantgroup.com> Cc: freebsd-stable@freebsd.org, Olaf Seibert <O.Seibert@cs.ru.nl>, Jeremy Chadwick <freebsd@jdc.parodius.com> Subject: Re: ZFS I/O errors Message-ID: <20110531092556.GD6733@twoquid.cs.ru.nl> In-Reply-To: <20110530171909.GE6688@dan.emsphone.com> References: <20110530093546.GX6733@twoquid.cs.ru.nl> <20110530101051.GA49825@twoquid.cs.ru.nl> <20110530103349.GA73825@icarus.home.lan> <20110530110946.GC6733@twoquid.cs.ru.nl> <20110530171909.GE6688@dan.emsphone.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon 30 May 2011 at 12:19:10 -0500, Dan Nelson wrote: > The ZFS compression code will panic if it can't allocate the buffer needed > to store the compressed data, so that's unlikely to be your problem. The > only time I have seen an "illegal byte sequence" error was when trying to > copy raw disk images containing ZFS pools to different disks, and the > destination disk was a different size than the original. I wasn't even able > to import the pool in that case, though. Yet somehow some incorrect data got written, it seems. That never happened before, fortunately, even though we had crashes before that seemed to be related to ZFS running out of memory. > The zfs IO code overloads the EILSEQ error code and uses it as a "checksum > error" code. Returning that error for the same block on all disks is > definitely weird. Could you have run a partitioning tool, or some other > program that would have done direct writes to all of your component disks? I hope I would remember doing that if I did! > Your scrub is also a bit worrying - 24k checksum errors definitely shouldn't > occur during normal usage. It turns out that the errors are easy to provoke: they happen every time I do an ls of of the affected directories. There were processes running that were likely to be trying to write to the same directories (the file system is exported over NFS), so in that case it is easy to imagine that the numbers rack up quickly. I moved those directories to the side, for the moment, but I haven't been able to delete them yet. The data is a bit bigger than we're able to backup so "just restoring a backup" isn't an easy thing to do. Possibly I could make a new filesystem in the same pool, if that would do the trick; it isn't more than 50% full but the affected one is the biggest filesystem in it. The end result of the scrub is as follows: pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed after 12h56m with 3 errors on Mon May 30 23:56:47 2011 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 6.38K raidz2 ONLINE 0 0 25.4K da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: tank/vol-fourquid-1:<0x0> tank/vol-fourquid-1@saturday:<0x0> /tank/vol-fourquid-1/.zfs/snapshot/saturday/backups/dumps/dump_usr_friday.dump /tank/vol-fourquid-1/.zfs/snapshot/saturday/sverberne/CLEF-IP11/parts_abs+desc /tank/vol-fourquid-1/.zfs/snapshot/sunday/sverberne/CLEF-IP11/parts_abs+desc /tank/vol-fourquid-1/.zfs/snapshot/monday/sverberne/CLEF-IP11/parts_abs+desc -Olaf. -- Pipe rene = new PipePicture(); assert(Not rene.GetType().Equals(Pipe));
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110531092556.GD6733>