Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 Feb 2004 15:35:10 -0800 (PST)
From:      Don Lewis <truckman@FreeBSD.org>
To:        ma@dt.e-technik.uni-dortmund.de
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: ahc and massive ffs+softupdates corruption
Message-ID:  <200402172335.i1HNZB7E051322@gw.catspoiler.org>
In-Reply-To: <m38yj15m59.fsf@merlin.emma.line.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 18 Feb, Matthias Andree wrote:
> Hi,
> 
> I have seen a SCSI timeout that preceded massive file system corruption
> some days ago. As this happened with the drive's write cache turned off
> and with ffs softupdates, I think it's worth reporting.
> 
> Hardware: Micropolis 4345WS (UWSCSI disk drive, 7200/min, device da0)
> attached to an Adaptec 2940UW Pro (ahc0, aic7880 chip); on the same bus
> there's a Yamaha CRW4416S CD writer (current firmware) and a Plextor
> PX-20TS CD-ROM (firmware 1.00).
> 
> This machine had a SCSI timeout problem on Friday Feb 6th and went down
> hard, suffering massive file system corruption on /var. At that time,
> the machine was running portupgrade -a. /var is using softupdates and
> uses default mount options. As said before, the drive's FWC enable was
> set to 0 in both the current and saved editions of mode page 8, and I
> wonder how such massive corruption can happen. I was under the
> impression that softupdates prevented any on-disk corruptions that
> require user intervention at fsck time. Given that the write cache was
> off, I am wondering if there are any ffs+softupdates or tagged command
> queueing bugs left (that might reorder writes - ordered tag forgotten or
> something).

The UNKNOWN FILE TYPE complains are a pretty good clue that a block
containing inodes got overwritten by garbage.  I've seen this sort of
thing happen if power to a drive fails.  It could also be caused by a
driver or firmware bug that causes data to get written to the wrong
place, or a cabling or termination problem that causes the drive to see
the wrong command.

> Here's the log, first the SCSI crash, then, on Feb 7, then excerpts from
> the reboot including fsck. / and /usr also suffered minor file system
> inconsistencies that "fsck -p" was able to fix automatically.
> 
> Feb  6 20:10:20 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 4296, size: 24576
> Feb  6 20:10:41 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x0 - timed out
> Feb  6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
> Feb  6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out
> Feb  6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
> Feb  6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out
> Feb  6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
> Feb  6 20:13:30 libertas /kernel: ahc0: Issued Channel A Bus Reset. 12 SCBs aborted
> Feb  6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 4296, size: 24576
> Feb  6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 2056, size: 8192
> Feb  6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x20 - timed out
> Feb  6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
> Feb  6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb  6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
> Feb  6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb  6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
> Feb  6 20:13:51 libertas /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs aborted
> Feb  6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x22 - timed out
> Feb  6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
> Feb  6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb  6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
> Feb  6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb  6 20:14:14 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
> Feb  6 20:14:14 libertas /kernel: ahc0: Issued Channel A Bus Reset. 4 SCBs aborted
> Feb  6 20:15:01 libertas /usr/sbin/cron[58193]: (root) CMD (/usr/libexec/atrun) 
> Feb  6 20:20:00 libertas /usr/sbin/cron[64038]: (root) CMD (/usr/libexec/atrun) 
> 
> Feb  7 12:37:03 libertas /kernel: ahc0: <Adaptec 2940 Pro Ultra SCSI adapter> port 0xe800-0xe8ff mem 0xe3020000-0xe3020fff irq 15 at devi
> ce 8.0 on pci0
> Feb  7 12:37:03 libertas /kernel: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
> 
> Feb  7 12:37:04 libertas /kernel: da0 at ahc0 bus 0 target 0 lun 0
> Feb  7 12:37:04 libertas /kernel: da0: <MICROP 4345WS x43h> Fixed Direct Access SCSI-2 device 
> Feb  7 12:37:04 libertas /kernel: da0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing Enabled
> Feb  7 12:37:04 libertas /kernel: da0: 4340MB (8890029 512 byte sectors: 255H 63S/T 553C)
> 
> Feb  7 12:37:04 libertas /kernel: /dev/da0s1e: 
> Feb  7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: /dev/da0s1e: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.
> Feb  7 12:37:04 libertas /kernel: THE FOLLOWING FILE SYSTEM HAD AN UNEXPECTED INCONSISTENCY:
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: /dev/da0s1e (/var)
> Feb  7 12:37:04 libertas /kernel: Automatic file system check failed . . . help!
> Feb  7 12:37:04 libertas /kernel: Enter full pathname of shell or RETURN for /bin/sh: 
> Feb  7 12:37:04 libertas /kernel: # 
> 
> Entered shell and typed "fsck":
> 
> Feb  7 12:37:04 libertas /kernel: ** /dev/da0s1e
> Feb  7 12:37:04 libertas /kernel: ** Last Mounted on /var
> Feb  7 12:37:04 libertas /kernel: ** Phase 1 - Check Blocks and Sizes
> 
> Feb  7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: CLEAR? [yn] 
> Feb  7 12:37:04 libertas /kernel: 
> 
> Last 6 lines repeat with "I=" figures monotonically increasing, finally:
> 
> Feb  7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=264
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: CLEAR? [yn] 
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: 
> Feb  7 12:37:04 libertas /kernel: UNREF FILE 
> 
> The log ends here, but isn't complete, I have agreed to reconnect dozens
> of files to lost+found after fsck asked, many but not all of them from
> /var/db/pkg.
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200402172335.i1HNZB7E051322>