From owner-freebsd-stable@FreeBSD.ORG Wed Apr 25 07:42:42 2007 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E832D16A400 for ; Wed, 25 Apr 2007 07:42:42 +0000 (UTC) (envelope-from johan@stromnet.se) Received: from av9-1-sn3.vrr.skanova.net (av9-1-sn3.vrr.skanova.net [81.228.9.185]) by mx1.freebsd.org (Postfix) with ESMTP id 6133713C457 for ; Wed, 25 Apr 2007 07:42:41 +0000 (UTC) (envelope-from johan@stromnet.se) Received: by av9-1-sn3.vrr.skanova.net (Postfix, from userid 502) id 9D1BE3878F; Wed, 25 Apr 2007 09:42:40 +0200 (CEST) Received: from smtp3-2-sn3.vrr.skanova.net (smtp3-2-sn3.vrr.skanova.net [81.228.9.102]) by av9-1-sn3.vrr.skanova.net (Postfix) with ESMTP id 378E538787 for ; Wed, 25 Apr 2007 09:42:40 +0200 (CEST) Received: from elfi.stromnet.se (90-224-172-102-no129.tbcn.telia.com [90.224.172.102]) by smtp3-2-sn3.vrr.skanova.net (Postfix) with ESMTP id 21E3737E4D for ; Wed, 25 Apr 2007 09:42:39 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by elfi.stromnet.se (Postfix) with ESMTP id 8616061EA6 for ; Wed, 25 Apr 2007 09:42:39 +0200 (CEST) X-Virus-Scanned: amavisd-new at stromnet.se Received: from elfi.stromnet.se ([127.0.0.1]) by localhost (elfi.stromnet.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QrciWf1a7JPI for ; Wed, 25 Apr 2007 09:42:38 +0200 (CEST) Received: from [IPv6:2001:16d8:ff20:1:217:f2ff:fef0:d6b7] (unknown [IPv6:2001:16d8:ff20:1:217:f2ff:fef0:d6b7]) by elfi.stromnet.se (Postfix) with ESMTP id 7458061E85 for ; Wed, 25 Apr 2007 09:42:38 +0200 (CEST) Mime-Version: 1.0 (Apple Message framework v752.3) Content-Transfer-Encoding: quoted-printable Message-Id: Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed To: freebsd-stable@freebsd.org From: =?ISO-8859-1?Q?Johan_Str=F6m?= Date: Wed, 25 Apr 2007 09:41:43 +0200 X-Mailer: Apple Mail (2.752.3) Subject: ATA driver/gmirror problems, multiple boxes... X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Apr 2007 07:42:43 -0000 Hello I got a few boxes, elfi crus and gw-1, running gmirror. These are =20 three completely different boxes, but all are running 6.1. They all =20 have multiple disks which are gmirrored, two of them SATA-only and =20 one has a mirror between one SATA and one ATA. Some times now and then they all have different problems with the =20 mirrors.. All three in different ways.. although elfi being the one =20 crashing most, its also the one with most disk IO so that might be =20 "expected" (not that it crashes but that its the one crashing most =20 often).. First, some HW spec: elfi: FreeBSD elfi.stromnet.se 6.2-RELEASE FreeBSD 6.2-RELEASE #9: Thu Jan =20 18 16:53:20 CET 2007 root@:/usr/obj/usr/src/sys/ELFI i386 atapci1: port =20 0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xdc00-0xdc0f,=20 0xe000-0xe07f irq 21 at device 10.0 on pci0 ad4: 286187MB at ata2-master SATA150 ad6: 286187MB at ata3-master SATA150 Mirror gm0s1 consist of ad4+ad6 crus: FreeBSD crus.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue =20 May 9 20:40:23 CEST 2006 johan@elfi.stromnet.org:/usr/obj/usr/=20 src/sys/GENERIC i386 atapci1: port 0x7480-0x74ff,=20 0x7800-0x78ff mem 0xfebdb000-0xfebdbfff,0xfebe0000-0xfebfffff irq 22 =20 at device 14.0 on pci1 ad8: 305245MB at ata4-master SATA150 ad12: 305245MB at ata6-master SATA150 Mirror gm1 consists of ad8+ad12 gw-1: FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: =20 Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/=20 src/sys/ROUTER.POLLING i386 atapci0: port =20 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 9.0 on pci0 atapci1: port =20 0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0x7f00-0x7f0f,=20= 0x7c00-0x7c7f irq 20 at device 11. ad2: 38166MB at ata1-master UDMA100 ad6: 152627MB at ata3-master SATA150 Mirror gm0 consists of ad6s1+ad2 A typical crash on elfi looks like this: Apr 24 05:20:27 elfi kernel: ad6: FAILURE - device detached Apr 24 05:20:27 elfi kernel: subdisk6: detached Apr 24 05:20:27 elfi kernel: ad6: detached Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6 =20 disconnected. Apr 24 05:20:27 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20 (offset=3D16972791808, length=3D16384)]error =3D 6 This can happen any time of the day, this one was from ~5 in the =20 morning. To recover from this I have to reboot (soft reboot works) =20 the box and then it will rebuild when booted. atacontrol cannot find =20 the disk at all before rebooting. I've tried reinit and detach/attach =20= but no help. A crash on crus can look like this: Apr 23 13:45:49 crus kernel: ad8: TIMEOUT - READ_DMA48 retrying (1 =20 retry left) LBA=3D566657039 Apr 23 13:46:14 crus kernel: ad8: WARNING - READ_DMA48 UDMA ICRC =20 error (retrying request) LBA=3D566657039 Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER =20 MODE taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER =20 MODE taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE =20= taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE =20= taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SET_MULTI taskqueue =20 timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: FAILURE - READ_DMA48 timed out =20 LBA=3D566657039 Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Request failed (error=3D5). =20= ad8[READ(offset=3D290128403968, length=3D16384)] Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 =20 disconnected. This box can do with a gmirror forget followed by a gmirror insert =20 and it will happily rebuild the array. The worst box is gw-1: Apr 20 03:10:59 gw-1 kernel: ad2: timeout waiting to issue command Apr 20 03:10:59 gw-1 kernel: ad2: error issuing WRITE_DMA command Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Request failed (error=3D5). =20= ad2[WRITE(offset=3D37578448384, length=3D16384)] Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Device gm0: provider ad2 =20 disconnected. Apr 20 07:23:57 gw-1 syslogd: kernel boot file is /boot/kernel/kernel Apr 20 07:23:57 gw-1 kernel: Copyright (c) 1992-2007 The FreeBSD =20 Project. Yes.. it fails and then the whole box totally HANGS... No input =20 possible at all.. had to hard-reboot it with the button... Not good =20 at all.. I have been running the disks that are now in elfi in this =20 machine before, and at that time I had the same problem.. disk =20 problems -> total hang.. That was with sata only, this appears to be =20 a problem with the ATA disk too?.. I have never succeeded to force these crashes.. they appear now and =20 then but I can never produce them on demand.. The crashes happens now =20= and then, no regular intervals though.. For elfi: Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6 =20 disconnected. (I actually cant find any other entry in the logs, but judging from =20 IRC logs: march 28, march 12, feb 13, jan 22, jan 18) For crus: Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 =20 disconnected. Apr 13 09:57:49 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 =20 disconnected. I think it has happened once more, but thats it.. For gw-1 it's luckily only once so far.. At least with the current =20 install, it has had problems when the maxtor disks was running in it =20 (and i think it was 6.0 back then) So.. Three different boxes, with three different chipsets... With =20 three different crash scenarios.. But they all have problems.. So =20 where is the actual problem? The HW? The chipset drivers? Gmirror =20 code? I have run SMART tests on the crashing disks, no errors.. I =20 have run powermax (maxtors own test program) a while back on the =20 maxtor disks, no problems.. I have tried changing SATA cables on some =20= of the disks, no difference.. Does anyone have any clue about what can be causing this? What is =20 most likely? How do we hunt this down? Thank you. Johan Str=F6m Stromnet johan@stromnet.se http://www.stromnet.se/