Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 15 Jul 2008 12:40:19 -0400 (EDT)
From:      Mike Andrews <mandrews@fark.com>
To:        FreeBSD-gnats-submit@FreeBSD.org
Subject:   kern/125644: zfs unfixable fs errors caused panic when trying to destroy filesystem
Message-ID:  <20080715164019.72BAD3161@whiskey.fark.com>
Resent-Message-ID: <200807151700.m6FH0AvP009219@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         125644
>Category:       kern
>Synopsis:       zfs unfixable fs errors caused panic when trying to destroy filesystem
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jul 15 17:00:10 UTC 2008
>Closed-Date:
>Last-Modified:
>Originator:     Mike Andrews
>Release:        FreeBSD 7.0-STABLE amd64
>Organization:
Fark, Inc
>Environment:
System: FreeBSD whiskey.fark.com 7.0-STABLE FreeBSD 7.0-STABLE #21: Thu Jul 3 16:13:09 EDT 2008 mandrews@vodka.int.fark.com:/usr/obj/usr/src/sys/FARK64 amd64


	Supermicro PDSMi+, Core 2 Quad Q6600, 6 GB memory
	Two ST3250820AS/3.AAE connected to onboard ICH7 in AHCI mode

>Description:

	The root filesystem is a 4 GB gmirror of ad4s1a+ad6s1a, both drives then have
	4 GB swap partitions, and the remaining ad4s1d+ad6s1d is a mirrored zpool.
	Last Friday, these messages appeared:

Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: zpool I/O failure, zpool=whiskey error=86

	I did a scrub on the zpool to see if ZFS could correct the errors, and it
	said it could not.  However, only one file was damaged, and it was in an
	old snapshot I didn't care about:

whiskey# zpool status -v                                                                                                            
  pool: whiskey                                                                                                                     
 state: ONLINE                                                                                                                      
status: One or more devices has experienced an error resulting in data                                                              
        corruption.  Applications may be affected.                                                                                  
action: Restore the file in question if possible.  Otherwise restore the                                                            
        entire pool from backup.                                                                                                    
   see: http://www.sun.com/msg/ZFS-8000-8A                                                                                          
 scrub: scrub completed with 1 errors on Fri Jul 11 10:56:41 2008                                                                   
config:                                                                                                                             
                                                                                                                                    
        NAME        STATE     READ WRITE CKSUM                                                                                      
        whiskey     ONLINE       0     0     4                                                                                      
          mirror    ONLINE       0     0     4                                                                                      
            ad4s1d  ONLINE       0     0     8                                                                                      
            ad6s1d  ONLINE       0     0     8                                                                                      
                                                                                                                                    
errors: Permanent errors have been detected in the following files:                                                                 
                                                                                                                                    
        whiskey/home@monthly.1:<filename snipped; it was a small unneeded jpeg>

	Here's the issue: attempting to destroy that snapshot resulted in a panic:

whiskey# zfs destroy whiskey/home@monthly.1
panic: solaris assert: end <= sm->sm_start + sm->sm_size (0x14454c7000 <= 0x1400000000), file: /usr/src/sys/modules/zfs/../../kernel trap 12 with interrupts disabled
cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line: 93

cpuid = 1

KDB: enter: panic
[thread pid 208 tid 100152 ]
Stopped at      kdb_enter_why+0x3d:     movq    $0,0x40ba01(%rip)
db> bt
Tracing pid 208 tid 100152 td 0xffffff00025ae9f0
kdb_enter_why() at kdb_enter_why+0x3d
panic() at panic+0x16c
space_map_add() at space_map_add+0x227
metaslab_free_dva() at metaslab_free_dva+0xfe
metaslab_free() at metaslab_free+0x6e
zio_dva_free() at zio_dva_free+0x20
arc_free() at arc_free+0x10a
dsl_dataset_destroy_sync() at dsl_dataset_destroy_sync+0x2df
dsl_sync_task_group_sync() at dsl_sync_task_group_sync+0x13e
dsl_pool_sync() at dsl_pool_sync+0xc3
spa_sync() at spa_sync+0x38a
txg_sync_thread() at txg_sync_thread+0x129
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffc77c9d30, rbp = 0 ---
db>

	Unfortunately, I had to get the machine back up and running quickly, so I
	did not dd the corrupted disk to an image file for further analysis...  I
	just backed up all the files, wiped the zpool and recreated it, and restored,
	and have been running fine since.  So I'm not able to do any further
	troubleshooting than this.  I'm mostly filing this as an FYI/heads-up to
	what may (not?) have have been a one-off quirk.

	I guess for me the curiosities are, how did the corruption happen (errno.h
	says error 86 is "illegal byte sequence"...) in a way that affected both disks
	and why did zfs panic over it instead of allowing the bad data to be deleted.

	This system has hw.ata.wc = 1 which is known dangerous in a UFS2 situation
	(this is safe for ZFS, though, right?  Uh... right?) :)  However I'm pretty
	sure the machine has not lost power abruptly in a very long time so I don't
	think that was an issue.


>How-To-Repeat:

	No idea, unfortunately, unless scribbling a small chunk of /dev/random onto
	the middle of a zpool would do it :)

>Fix:

	Backup, destroy, recreate, restore the zpool.


>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080715164019.72BAD3161>