From owner-freebsd-bugs@FreeBSD.ORG Wed Mar 28 22:10:13 2007 Return-Path: X-Original-To: freebsd-bugs@hub.freebsd.org Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4CCE816A401 for ; Wed, 28 Mar 2007 22:10:13 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [69.147.83.40]) by mx1.freebsd.org (Postfix) with ESMTP id ED61A13C483 for ; Wed, 28 Mar 2007 22:10:12 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.4/8.13.4) with ESMTP id l2SMAC8r093318 for ; Wed, 28 Mar 2007 22:10:12 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.13.4/8.13.4/Submit) id l2SMACRf093317; Wed, 28 Mar 2007 22:10:12 GMT (envelope-from gnats) Date: Wed, 28 Mar 2007 22:10:12 GMT Message-Id: <200703282210.l2SMACRf093317@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org From: Heikki Suonsivu Cc: Subject: Re: kern/68978: [crash] firewire crashes with failing hard disk, loose pointers in kernel? X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Heikki Suonsivu List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Mar 2007 22:10:13 -0000 The following reply was made to PR kern/68978; it has been noted by GNATS. From: Heikki Suonsivu To: Harrison Grundy Cc: bug-followup@FreeBSD.org Subject: Re: kern/68978: [crash] firewire crashes with failing hard disk, loose pointers in kernel? Date: Thu, 29 Mar 2007 00:38:52 +0300 Harrison Grundy wrote: > As absurd as this might sound, can you confirm if this still occurs in 6.x? The external drives are still misbehaving, though the symptoms have been changing by time. I get panics after the umass gets confused. I was wrong in my original report, it was not just firewire, and not failing disk (unless all 9+ external drives I have used are bad). The same problem occurs with Maxtor, Lacie external disks, various models, and the problems occur both USB and firewire connected disks. This happens at least on two PentiumIII computers and one AMD64. I have not tested others. The more I load the disks, the more often this occurs. The rate of crashes on slow, but active computer with 5 drives is something like couple of 2-3 times per month, on passive fast but passive computer with 3 drives maybe every 2 months. I cannot see any difference between drives from different manufacturers, but I haven't done any scientific analysis :-/ I think I might have filed a new PR so it might a good idea to check for duplicates. Below are some excerpts from various symptoms. I keep getting these when USB or firewire disks are loaded a lot. When doing a large copy these pop into console at rate of every few seconds, at random-looking intervals. Mar 29 00:22:32 lelu kernel: Unknown service addr 0x0000:0x09aab800 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:32 lelu kernel: Unknown service addr 0x0000:0x0ad03000 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:37 lelu kernel: Unknown service addr 0x0000:0x0e37c000 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:46 lelu kernel: Unknown service addr 0x0000:0x15fd6800 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:46 lelu kernel: Unknown service addr 0x0000:0x0e716000 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:47 lelu kernel: Unknown service addr 0x0000:0x0ab54000 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:47 lelu kernel: Unknown service addr 0x0000:0x156e4000 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:50 lelu kernel: Unknown service addr 0x0000:0x0df7a000 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:57 lelu kernel: Unknown service addr 0x0000:0x10209000 WREQB(1) src=0xffc1 data=8 Mar 29 00:22:58 lelu kernel: Unknown service addr 0x0000:0x150f7000 WREQB(1) src=0xffc1 data=8 Mar 29 00:23:01 lelu kernel: Unknown service addr 0x0000:0x148a5800 WREQB(1) src=0xffc1 data=8 Mar 29 00:23:02 lelu kernel: Unknown service addr 0x0000:0x10f27000 WREQB(1) src=0xffc1 data=8 Mar 29 00:23:02 lelu kernel: Unknown service addr 0x0000:0x0702c000 WREQB(1) src=0xffc1 data=8 Then, at some point I might see Mar 5 12:18:47 lelu kernel: umass0: Invalid CSW: tag 145294 should be 145295 Mar 5 12:18:47 lelu kernel: umass0: Invalid CSW: tag 145295 should be 145296 Mar 5 12:18:48 lelu kernel: umass0: Invalid CSW: tag 145296 should be 145297 Mar 5 12:18:48 lelu kernel: umass0: Invalid CSW: tag 145297 should be 145298 Mar 5 12:18:49 lelu kernel: umass0: Invalid CSW: tag 145298 should be 145299 Mar 5 12:18:49 lelu kernel: g_vfs_done():da0s1e[READ(offset=150001205248, lengt h=16384)]error = 5 Mar 5 12:18:49 lelu kernel: g_vfs_done():da0s1e[READ(offset=150001221632, lengt h=16384)]error = 5 Mar 5 12:18:49 lelu kernel: g_vfs_done():da0s1e[WRITE(offset=114688, length=163 84)]error = 5 Mar 5 12:18:49 lelu kernel: umass0: Invalid CSW: tag 145299 should be 145300 Mar 5 12:18:50 lelu kernel: umass0: Invalid CSW: tag 145300 should be 145301 Mar 5 12:18:51 lelu kernel: umass0: Invalid CSW: tag 145301 should be 145302 Mar 5 12:18:51 lelu kernel: umass0: Invalid CSW: tag 145302 should be 145303 Mar 5 12:18:52 lelu kernel: umass0: Invalid CSW: tag 145303 should be 145304 Mar 5 12:18:52 lelu kernel: g_vfs_done():da0s1e[READ(offset=6209536, length=204 8)]error = 5 Mar 5 12:18:52 lelu kernel: umass0: Invalid CSW: tag 145304 should be 145305 Mar 5 12:18:53 lelu kernel: umass0: Invalid CSW: tag 145305 should be 145306 Mar 5 12:18:54 lelu kernel: umass0: Invalid CSW: tag 145308 should be 145309 Mar 5 12:18:54 lelu kernel: g_vfs_done():da0s1e[READ(offset=411775385600, lengt h=16384)]error = 5 Mar 5 12:18:54 lelu kernel: g_vfs_done():da0s1e[WRITE(offset=406895493120, leng th=16384)]error = 5 Mar 5 12:18:55 lelu kernel: umass0: Invalid CSW: tag 145309 should be 145310 Mar 5 12:22:10 lelu kernel: umass0: BBB reset failed, TIMEOUT Mar 5 12:23:15 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT Mar 5 12:24:20 lelu kernel: umass0: BBB bulk-out clear stall failed, TIMEOUT Mar 5 12:26:30 lelu kernel: umass0: BBB reset failed, TIMEOUT Mar 5 12:27:35 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT Mar 5 12:28:40 lelu kernel: umass0: BBB bulk-out clear stall failed, TIMEOUT Mar 5 12:30:50 lelu kernel: umass0: BBB reset failed, TIMEOUT Mar 5 12:31:55 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT Mar 5 12:33:00 lelu kernel: umass0: BBB bulk-out clear stall failed, TIMEOUT Mar 5 12:35:10 lelu kernel: umass0: BBB reset failed, TIMEOUT Mar 5 12:36:15 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT Mar 5 12:37:20 lelu kernel: umass0: BBB bulk-out clear stall failed, TIMEOUT ... Mar 5 21:43:15 lelu kernel: umass0: BBB bulk-out clear stall failed, TIMEOUT Mar 5 21:44:01 lelu kernel: umass0: BBB bulk-out clear stall failed, IOERROR Mar 5 21:44:01 lelu kernel: umass0: BBB reset failed, IOERROR Mar 5 21:44:01 lelu kernel: g_vfs_done():da0s1e[WRITE(offset=114688, length=163 84)]error = 5 Mar 5 21:44:01 lelu kernel: g_vfs_done():da0s1e[READ(offset=411775860736, lengt h=16384)]error = 5 Mar 5 21:44:01 lelu kernel: umass0: BBB bulk-out clear stall failed, IOERROR Mar 5 21:44:01 lelu kernel: umass0: BBB reset failed, IOERROR Mar 5 21:44:01 lelu kernel: umass0: BBB bulk-out clear stall failed, IOERROR Mar 5 21:44:01 lelu kernel: g_vfs_done():da0s1e[READ(offset=411775877120, lengt h=16384)]error = 5 Mar 5 21:44:01 lelu kernel: umass0: BBB reset failed, IOERROR (Sorry about wrapping) This usually ends up with panic, or "disk removed from system" thing (which also usually leads to panic when shutting down or rebooting). The panics look like coming from the filesystem code when a misbehaving disk disappears or locks up or gets confused, and filesystem code freaks out on something it does not expect. It might be triggered by buggy disks or USB/FW controllers, but if that would be true there are lots of bad hardware, as I have not yet found a single working combination. I have had 3 maxtor models and 3 Lacie models. Windows must get around this some way, as I have not seen people reporting this flaky operation under windows (though, it might be more difficult to tell the difference and people might use the drives less actively). Sometimes I also start getting various I/O errors, which disappear when resetting the system. fsck might start spewing out bad block messages, which disappear when resetting the computer and/or disks. Sometimes resetting the computer might not be enough, but also disks needs to be power cycled. This is rare (maybe couple of times a year). I not seen problems like these on Windows or Macs but the usage pattern is very different, mostly occasional backups. The drives causing grief above are native bsd formatted drives. I have not seen this on 2.5 inch external drives or other USB media, but I have never had one in active use like these drives are.