From owner-freebsd-stable@FreeBSD.ORG Tue Feb 17 15:36:36 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C460516A4CE for ; Tue, 17 Feb 2004 15:36:36 -0800 (PST) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 38C9943D1F for ; Tue, 17 Feb 2004 15:36:36 -0800 (PST) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9p2/8.12.9) with ESMTP id i1HNZB7E051322; Tue, 17 Feb 2004 15:35:14 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <200402172335.i1HNZB7E051322@gw.catspoiler.org> Date: Tue, 17 Feb 2004 15:35:10 -0800 (PST) From: Don Lewis To: ma@dt.e-technik.uni-dortmund.de In-Reply-To: MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: freebsd-stable@FreeBSD.org Subject: Re: ahc and massive ffs+softupdates corruption X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Feb 2004 23:36:36 -0000 On 18 Feb, Matthias Andree wrote: > Hi, > > I have seen a SCSI timeout that preceded massive file system corruption > some days ago. As this happened with the drive's write cache turned off > and with ffs softupdates, I think it's worth reporting. > > Hardware: Micropolis 4345WS (UWSCSI disk drive, 7200/min, device da0) > attached to an Adaptec 2940UW Pro (ahc0, aic7880 chip); on the same bus > there's a Yamaha CRW4416S CD writer (current firmware) and a Plextor > PX-20TS CD-ROM (firmware 1.00). > > This machine had a SCSI timeout problem on Friday Feb 6th and went down > hard, suffering massive file system corruption on /var. At that time, > the machine was running portupgrade -a. /var is using softupdates and > uses default mount options. As said before, the drive's FWC enable was > set to 0 in both the current and saved editions of mode page 8, and I > wonder how such massive corruption can happen. I was under the > impression that softupdates prevented any on-disk corruptions that > require user intervention at fsck time. Given that the write cache was > off, I am wondering if there are any ffs+softupdates or tagged command > queueing bugs left (that might reorder writes - ordered tag forgotten or > something). The UNKNOWN FILE TYPE complains are a pretty good clue that a block containing inodes got overwritten by garbage. I've seen this sort of thing happen if power to a drive fails. It could also be caused by a driver or firmware bug that causes data to get written to the wrong place, or a cabling or termination problem that causes the drive to see the wrong command. > Here's the log, first the SCSI crash, then, on Feb 7, then excerpts from > the reboot including fsck. / and /usr also suffered minor file system > inconsistencies that "fsck -p" was able to fix automatically. > > Feb 6 20:10:20 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 4296, size: 24576 > Feb 6 20:10:41 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x0 - timed out > Feb 6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout > Feb 6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out > Feb 6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer > Feb 6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out > Feb 6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b > Feb 6 20:13:30 libertas /kernel: ahc0: Issued Channel A Bus Reset. 12 SCBs aborted > Feb 6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 4296, size: 24576 > Feb 6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer: device: #da/0x20001, blkno: 2056, size: 8192 > Feb 6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x20 - timed out > Feb 6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout > Feb 6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out > Feb 6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer > Feb 6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out > Feb 6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b > Feb 6 20:13:51 libertas /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs aborted > Feb 6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x22 - timed out > Feb 6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout > Feb 6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out > Feb 6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message buffer > Feb 6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out > Feb 6 20:14:14 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b > Feb 6 20:14:14 libertas /kernel: ahc0: Issued Channel A Bus Reset. 4 SCBs aborted > Feb 6 20:15:01 libertas /usr/sbin/cron[58193]: (root) CMD (/usr/libexec/atrun) > Feb 6 20:20:00 libertas /usr/sbin/cron[64038]: (root) CMD (/usr/libexec/atrun) > > Feb 7 12:37:03 libertas /kernel: ahc0: port 0xe800-0xe8ff mem 0xe3020000-0xe3020fff irq 15 at devi > ce 8.0 on pci0 > Feb 7 12:37:03 libertas /kernel: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs > > Feb 7 12:37:04 libertas /kernel: da0 at ahc0 bus 0 target 0 lun 0 > Feb 7 12:37:04 libertas /kernel: da0: Fixed Direct Access SCSI-2 device > Feb 7 12:37:04 libertas /kernel: da0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing Enabled > Feb 7 12:37:04 libertas /kernel: da0: 4340MB (8890029 512 byte sectors: 255H 63S/T 553C) > > Feb 7 12:37:04 libertas /kernel: /dev/da0s1e: > Feb 7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256 > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: /dev/da0s1e: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY. > Feb 7 12:37:04 libertas /kernel: THE FOLLOWING FILE SYSTEM HAD AN UNEXPECTED INCONSISTENCY: > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: /dev/da0s1e (/var) > Feb 7 12:37:04 libertas /kernel: Automatic file system check failed . . . help! > Feb 7 12:37:04 libertas /kernel: Enter full pathname of shell or RETURN for /bin/sh: > Feb 7 12:37:04 libertas /kernel: # > > Entered shell and typed "fsck": > > Feb 7 12:37:04 libertas /kernel: ** /dev/da0s1e > Feb 7 12:37:04 libertas /kernel: ** Last Mounted on /var > Feb 7 12:37:04 libertas /kernel: ** Phase 1 - Check Blocks and Sizes > > Feb 7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256 > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: CLEAR? [yn] > Feb 7 12:37:04 libertas /kernel: > > Last 6 lines repeat with "I=" figures monotonically increasing, finally: > > Feb 7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=264 > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: CLEAR? [yn] > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: > Feb 7 12:37:04 libertas /kernel: UNREF FILE > > The log ends here, but isn't complete, I have agreed to reconnect dozens > of files to lost+found after fsck asked, many but not all of them from > /var/db/pkg. >