Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 16 Dec 2009 16:21:10 GMT
From:      Tom Payne <Tom.Payne@unige.ch>
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   misc/141685: zfs corruption on adaptec 5805 raid controller
Message-ID:  <200912161621.nBGGLAF8035555@www.freebsd.org>
Resent-Message-ID: <200912161630.nBGGU1tN084593@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         141685
>Category:       misc
>Synopsis:       zfs corruption on adaptec 5805 raid controller
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Dec 16 16:30:01 UTC 2009
>Closed-Date:
>Last-Modified:
>Originator:     Tom Payne
>Release:        8.0-RELEASE
>Organization:
ISDC
>Environment:
FreeBSD isdc3202.isdc.unige.ch 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:02:08 UTC 2009     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

>Description:
Short version:

zfs on a new 5.44T Adaptec 5805 hardware RAID5 partition reports lots of zfs checksum errors.  Tests claim that the hardware is working correctly.

Long version:

I have an Adaptec RAID 5085 controller with eight 1TB SAS disks:

# dmesg | grep aac
aac0: <Adaptec RAID 5805> mem 0xfbc00000-0xfbdfffff irq 16 at device 0.0 on pci9
aac0: Enabling 64-bit address support
aac0: Enable Raw I/O
aac0: Enable 64-bit array
aac0: New comm. interface enabled
aac0: [ITHREAD]
aac0: Adaptec 5805, aac driver 2.0.0-1
aacp0: <SCSI Passthrough Bus> on aac0
aacp1: <SCSI Passthrough Bus> on aac0
aacp2: <SCSI Passthrough Bus> on aac0
aacd0: <RAID 5> on aac0
aacd0: 16370MB (33525760 sectors)
aacd1: <RAID 5> on aac0
aacd1: 6657011MB (13633558528 sectors)


It's configured with a small partition (aacd0) for the root filesystem, the rest (aacd1) is a single large zpool:
# zpool create tank aacd1
# zfs list | head -n 2
NAME                                       USED  AVAIL  REFER  MOUNTPOINT
tank                                       792G  5.44T    18K  none


After a few days of light use (rsync'ing data from older disk servers) zfs reports lots of checksum errors:

# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 1h17m with 49 errors on Mon Dec 14 13:35:50 2009
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0    98
          aacd1     ONLINE       0     0   196

These 49 errors are in various files scattered across the the 200+ zfs filesystems on the disk.


/var/log/messages contains, for example:
# grep ZFS /var/log/messages
Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072
Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072
Dec 14 13:23:50 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86
Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072
Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072
Dec 14 13:27:47 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86
Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072
Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072
Dec 14 13:28:07 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86


The 49 checksum errors occur at 49 different offsets in three distinct ranges:
  70743228416..  84649705472 ( 6)
1406828281856..1441780858880 (14)
2749871030272..2817199702016 (29)


The Adaptec controller firmware was updated the latest version (at the time of writing) after the first errors were observed.  Since the firmware was updated more errors have been observed.
# arcconf getversion
Controllers found: 1
Controller #1
==============
Firmware           : 5.2-0 (17544)
Staged Firmware    : 5.2-0 (17544)
BIOS               : 5.2-0 (17544)
Driver             : 5.2-0 (17544)
Boot Flash         : 5.2-0 (17544)


I ran a verify task on the RAID controller with
# arcconf task start 1 logicaldrive 1 verify noprompt
As far as I can tell, this verify task did not find any errors.  The array status is still reported as "optimal" and there seems to be nothing in the logs.


A 24 hour memory test with memtest86+ version 4.00 did not detect any memory errors.


Previously, problems have been found with zfs on USB drives:
http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005510.html


As I understand it, the situation is:
- zfs has checksum errors
- the hardware RAID believes that the data on disk is consistent
- there are no obvious memory problems


Could this be a FreeBSD bug?

>How-To-Repeat:
Unknown
>Fix:
Unknown

>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200912161621.nBGGLAF8035555>