Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 2 Sep 2005 00:43:26 +0200 (CEST)
From:      Philip Paeps <philip@FreeBSD.org>
To:        FreeBSD-gnats-submit@FreeBSD.org
Cc:        apeiron+usenet@coitusmentis.info
Subject:   kern/85603: FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown
Message-ID:  <200509012243.j81MhQDY035598@fasolt.home.paeps.cx>
Resent-Message-ID: <200509012250.j81MoI8E096836@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         85603
>Category:       kern
>Synopsis:       FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Sep 01 22:50:18 GMT 2005
>Closed-Date:
>Last-Modified:
>Originator:     Philip Paeps
>Release:        FreeBSD 7.0-CURRENT i386
>Organization:
>Environment:
System: FreeBSD fasolt.home.paeps.cx 7.0-CURRENT FreeBSD 7.0-CURRENT #39: Sun
Aug 21 15:52:38 CEST 2005
philip@fasolt.home.paeps.cx:/usr/obj/usr/src/sys/FASOLT i386

>Description:
	
Recently, after a power failure, I experience some inexplicable problems with
an ATA disks, which could quite possibly be due to hardware.  However, after
having experienced the same problems on a second disk, and discovering, in a
discussion on comp.unix.bsd.freebsd.misc, that others have seen the same sort
of issue, I've begun to suspect a kernel issue.

The first time I saw the problem, the machine initially came up fine, and I
could dirty-mount the filesystem and let bgfsck take care of things.  Soon
after the fsck began, the kernel started spewing out errors along the lines
of 'uncorrectable' and 'dma_read'.  Unfortunately, I've not managed to
reproduce the problem with a loggable console.

After a reboot, the filesystem on the disk refused to mount again.  Manually
forcing an fsck, complained about unreadable sectors.  Again, the kernel
spewed out the 'uncorrectable' and 'dma_read' errors.

According to SMART, the disk is quite healthy, though some errors were logged
in the the log:

 | Error 387 occurred at disk power-on lifetime: 5315 hours (221 days + 11 hours)
 |   When the command that caused the error occurred, the device was in an unknown state.
 | 
 |   After command completion occurred, registers were:
 |   ER ST SC SN CL CH DH
 |   -- -- -- -- -- -- --
 |   40 51 10 80 00 00 e0  Error: UNC 16 sectors at LBA = 0x00000080 = 128
 | 
 |   Commands leading to the command that caused the error were:
 |   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 |   -- -- -- -- -- -- -- --  ----------------  --------------------
 |   c8 00 10 80 00 00 e0 08      00:09:49.792  READ DMA
 |   25 00 01 ff 87 bd 40 08      00:08:28.160  READ DMA EXT
 |   c8 00 02 00 00 00 e0 08      00:08:28.160  READ DMA
 |   c8 00 01 01 00 00 e0 08      00:08:28.160  READ DMA
 |   c8 00 01 00 00 00 e0 08      00:08:28.160  READ DMA

Four other errors were logged, differing only in error number (decrementing by
one each time - 387 386 385) and LBA address (similarly decrementing).

The funny thing is, after newfsing the disk, and restoring the data, all seems
to be working well and happy on the disk.  The first disk I had this problem
with, has now been under medium heavy use again for over a month, the second
disk (see below) has been in use again for two weeks.

In the case of the second disk, the machine paniced shortly after starting the
bgfsck - unfortunately, I wasn't able to capture the the panic.  Following the
panic, the machine refused to boot with an LBA error 16 in the boot loader.

Trying to mount the filesystems on another machine, read-only, produced the
same 'uncorrectable' and 'dma_read' errors as seen on the first disk with the
problem.  Forcing fsck also caused the same errors as before.  Possibly an
unrelated issue: ls on some directories on the dirty-mounted ro filesystem
sometimes worked, cp'ing the files to somewhere else, paniced the kernel.

Again with the second disk, newfs and restoring data made all work happily
again.  Not a trace of any dma_read errors or uncorrectable reads.

I realize there's not much hard debugging information here, but maybe this
makes sense to a filesystem or ata guru.  I experienced the problems on 5.x
-STABLE kernels from late may, and -CURRENT kernels from the middle of June
and July.  I've not seen problems since, but then, I've not had any power
failures either.

I'm happy to help debug this further, if indeed it's a software bug, and not
something with flaky hardware.  Cc: Christopher Nehren who reported similar
issues on Usenet and suggested a PR be filed.  He might be able to add more
useful information.

For what it's worth, the disks were Maxtor 6Y200P0 and Maxtor 6E040L0 on a 
VIA 8235 UDMA133 controller and a VIA 8231 UDMA100 controller in my case.

>How-To-Repeat:
	
Lose power or panic the machine with a filesystem on an ATA disk and wait for
phase of moon and other elements of faith to be properly aligned.  I have been
able to reproduce the problem (and the 'working well after newfs') three times
by accident, never yet by force.

>Fix:

Hopefully! :-)
>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200509012243.j81MhQDY035598>