Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Apr 2007 09:41:43 +0200
From:      =?ISO-8859-1?Q?Johan_Str=F6m?= <johan@stromnet.se>
To:        freebsd-stable@freebsd.org
Subject:   ATA driver/gmirror problems, multiple boxes...
Message-ID:  <A5C0BBF4-8954-445B-B691-90358A2DA819@stromnet.se>

next in thread | raw e-mail | index | archive | help
Hello

I got a few boxes, elfi crus and gw-1, running gmirror. These are =20
three completely different boxes, but all are running 6.1. They all =20
have multiple disks which are gmirrored, two of them SATA-only and =20
one has a mirror between one SATA and one ATA.
Some times now and then they all have different problems with the =20
mirrors.. All three in different ways.. although elfi being the one =20
crashing most, its also the one with most disk IO so that might be =20
"expected" (not that it crashes but that its the one crashing most =20
often)..
First, some HW spec:

elfi:
FreeBSD elfi.stromnet.se 6.2-RELEASE FreeBSD 6.2-RELEASE #9: Thu Jan =20
18 16:53:20 CET 2007     root@:/usr/obj/usr/src/sys/ELFI  i386
atapci1: <nVidia nForce3 Pro SATA150 controller> port =20
0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xdc00-0xdc0f,=20
0xe000-0xe07f irq 21 at device 10.0 on pci0
ad4: 286187MB <Maxtor 7L300S0 BANC1G10> at ata2-master SATA150
ad6: 286187MB <Maxtor 7L300S0 BANC1G10> at ata3-master SATA150
Mirror gm0s1 consist of ad4+ad6

crus:
FreeBSD crus.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue =20
May  9 20:40:23 CEST 2006     johan@elfi.stromnet.org:/usr/obj/usr/=20
src/sys/GENERIC  i386
atapci1: <Promise PDC40518 SATA150 controller> port 0x7480-0x74ff,=20
0x7800-0x78ff mem 0xfebdb000-0xfebdbfff,0xfebe0000-0xfebfffff irq 22 =20
at device 14.0 on pci1
ad8: 305245MB <Seagate ST3320620AS 3.AAE> at ata4-master SATA150
ad12: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA150
Mirror gm1 consists of ad8+ad12

gw-1:
FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: =20
Tue Feb 13 18:24:34 CET 2007     johan@elfi.stromnet.se:/usr/obj/usr/=20
src/sys/ROUTER.POLLING  i386
atapci0: <nVidia nForce2 Pro UDMA133 controller> port =20
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 9.0 on pci0
atapci1: <nVidia nForce2 Pro SATA150 controller> port =20
0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0x7f00-0x7f0f,=20=

0x7c00-0x7c7f irq 20 at device 11.
ad2: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata1-master UDMA100
ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150
Mirror gm0 consists of ad6s1+ad2

A typical crash on elfi looks like this:
Apr 24 05:20:27 elfi kernel: ad6: FAILURE - device detached
Apr 24 05:20:27 elfi kernel: subdisk6: detached
Apr 24 05:20:27 elfi kernel: ad6: detached
Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6 =20
disconnected.
Apr 24 05:20:27 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20
(offset=3D16972791808, length=3D16384)]error =3D 6

This can happen any time of the day, this one was from ~5 in the =20
morning. To recover from this I have to reboot (soft reboot works) =20
the box and then it will rebuild when booted. atacontrol cannot find =20
the disk at all before rebooting. I've tried reinit and detach/attach =20=

but no help.

A crash on crus can look like this:
Apr 23 13:45:49 crus kernel: ad8: TIMEOUT - READ_DMA48 retrying (1 =20
retry left) LBA=3D566657039
Apr 23 13:46:14 crus kernel: ad8: WARNING - READ_DMA48 UDMA ICRC =20
error (retrying request) LBA=3D566657039
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER =20
MODE taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER =20
MODE taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE =20=

taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE =20=

taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SET_MULTI taskqueue =20
timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: FAILURE - READ_DMA48 timed out =20
LBA=3D566657039
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Request failed (error=3D5). =20=

ad8[READ(offset=3D290128403968, length=3D16384)]
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 =20
disconnected.

This box can do with a gmirror forget followed by a gmirror insert =20
and it will happily rebuild the array.

The worst box is gw-1:
Apr 20 03:10:59 gw-1 kernel: ad2: timeout waiting to issue command
Apr 20 03:10:59 gw-1 kernel: ad2: error issuing WRITE_DMA command
Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Request failed (error=3D5). =20=

ad2[WRITE(offset=3D37578448384, length=3D16384)]
Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Device gm0: provider ad2 =20
disconnected.
Apr 20 07:23:57 gw-1 syslogd: kernel boot file is /boot/kernel/kernel
Apr 20 07:23:57 gw-1 kernel: Copyright (c) 1992-2007 The FreeBSD =20
Project.

Yes.. it fails and then the whole box totally HANGS... No input =20
possible at all.. had to hard-reboot it with the button... Not good =20
at all.. I have been running the disks that are now in elfi in this =20
machine before, and at that time I had the same problem.. disk =20
problems -> total hang.. That was with sata only, this appears to be =20
a problem with the ATA disk too?..

I have never succeeded to force these crashes.. they appear now and =20
then but I can never produce them on demand.. The crashes happens now =20=

and then, no regular intervals though.. For elfi:
Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6 =20
disconnected.
(I actually cant find any other entry in the logs, but judging from =20
IRC logs: march 28, march 12, feb 13, jan 22, jan 18)

For crus:
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 =20
disconnected.
Apr 13 09:57:49 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 =20
disconnected.
I think it has happened once more, but thats it..

For gw-1 it's luckily only once so far.. At least with the current =20
install, it has had problems when the maxtor disks was running in it =20
(and i think it was 6.0 back then)

So.. Three different boxes, with three different chipsets... With =20
three different crash scenarios.. But they all have problems.. So =20
where is the actual problem? The HW? The chipset drivers? Gmirror =20
code? I have run SMART tests on the crashing disks, no errors.. I =20
have run powermax (maxtors own test program) a while back on the =20
maxtor disks, no problems.. I have tried changing SATA cables on some =20=

of the disks, no difference..

Does anyone have any clue about what can be causing this? What is =20
most likely? How do we hunt this down?

Thank you.

Johan Str=F6m
Stromnet
johan@stromnet.se
http://www.stromnet.se/





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A5C0BBF4-8954-445B-B691-90358A2DA819>