Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 16 Aug 2006 03:28:27 +0200
From:      =?ISO-8859-1?Q?Johan_Str=F6m?= <johan@stromnet.org>
To:        freebsd-stable@freebsd.org
Subject:   Re: ATA problems again ... This time system froze!
Message-ID:  <A30712DF-85D8-4393-AA88-D45732147BFA@stromnet.org>
In-Reply-To: <0B43BAB0-BBF0-4E2C-875D-6E1E00BAB1D4@stromnet.org>
References:  <DAFCD4DC-D2D4-4574-ACBF-367D642D9729@stromnet.org>	<8D08DDB6-6AC1-45B6-B2CE-08782F54968A@stromnet.org>	<884C01BC-3E97-46EC-AA8B-E70C3931F3A4@stromnet.org>	<36895211-2796-4213-B336-6279AB3AC3CB@stromnet.org>	<20060713132357.Y61840@fledge.watson.org>	<44B7EA39.4060509@quip.cz> <6.2.3.4.0.20060716185019.12a29240@64.7.153.2> <44BBAF52.9080007@quip.cz> <0B43BAB0-BBF0-4E2C-875D-6E1E00BAB1D4@stromnet.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jul 28, 2006, at 13:15 , Johan Str=F6m wrote:

>
> On 17 jul 2006, at 17.40, Miroslav Lachman wrote:
>
>> Mike Tancsa wrote:
>> [..]
>>> Install the smartmontools from
>>> /usr/ports/sysutils/smartmontools/
>>> and post the output of
>>> smartctl -a /dev/ad8
>>
>> smartmontools was previously installed and running as daemon =20
>> without any bad reports.
>> I can not run "smartctl -a /dev/ad8" now, because my server =20
>> housing provider replaced HDD with the new one and after an hour =20
>> of synchronization "ad8: FAILURE - device detached". So provider =20
>> replaced whole server, only ad4 is original piece of HW.
>> On new server synchronization was much faster then in previous =20
>> server (1:30 hour compared to 5 hours in previous server) - so I =20
>> think it was HW problem.
>> Now I am running stresstest with copying /usr/ports to another =20
>> partition in infinite loop.
>> I will post results later. (On bad server, test failed after about =20=

>> 30 minutes. On another server the test is running fine second day, =20=

>> so I think if disk will not fail after 1 day, problem is solved)
>>
>> At last - now I think this was not GEOM/gmirror related. I tried =20
>> remove ad8 provider from gmirror (gm0), boot up system from gm0 =20
>> with one provider (ad4) and test ad8 mounted separately - ad8 =20
>> failed again.
>
> Just got another one..
>
> Jul 25 13:30:47 elfi kernel: ad4: FAILURE - device detached
> Jul 25 13:30:47 elfi kernel: subdisk4: detached
> Jul 25 13:30:47 elfi kernel: ad4: detached
> Jul 25 13:30:47 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20
> ad4s1 disconnected.
> Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20
> (offset=3D46318008320, length=3D2048)]error =3D 6
> Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20
> (offset=3D77269614592, length=3D16384)]error =3D 6
>
> 6 days uptime when this occured... Both disks are tested with =20
> PowerMax without a single problem (same with smartctl), both SATA =20
> cables are new. So the only hwproblem that I cant rule out would be =20=

> the mobo, but that is quite new too...
>
> Solutions? Try RELENG_6 as recommended earlier?

Okay still on 6.1-RELEASE:

FreeBSD elfi.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue =20
May  9 20:40:23 CEST 2006     johan@elfi.stromnet.org:/usr/obj/usr/=20
src/sys/GENERIC  i386

Uptime approx 12 days since last reboot for raid fix... Just got home =20=

to meet a box which doesnt respond to SSH.. monitor tells me it has =20
crashed totaly. =46rom /var/log/message:

Aug 16 00:58:37 elfi kernel: ad4: FAILURE - device detached
Aug 16 00:58:37 elfi kernel: subdisk4: detached
Aug 16 00:58:37 elfi kernel: ad4: detached
Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot write metadata on =20
ad4s1 (device=3Dgm0s1, error=3D6).
Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot update metadata on =20
disk ad4s1 (error=3D6).
Aug 16 00:58:37 elfi last message repeated 2 times
Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20
ad4s1 disconnected.
Aug 16 00:58:37 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20
(offset=3D112910630912, length=3D32768)]error =3D 6
Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 not =20
responding, still trying
Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 OK
Aug 16 03:04:21 elfi syslogd: kernel boot file is /boot/kernel/kernel
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2325168128, length=3D16384)]error =3D 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2325184512, length=3D16384)]error =3D 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2325200896, length=3D16384)]error =3D 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2325217280, length=3D16384)]error =3D 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2325233664, length=3D16384)]error =3D 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2325250048, length=3D16384)]error =3D 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2319169536, length=3D2048)]error =3D 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20
(offset=3D2312404992, length=3D16384)]error =3D 6
Aug 16 03:04:21 elfi kernel: Copyright (c) 1992-2006 The FreeBSD =20
Project.
Aug 16 03:04:21 elfi kernel: Copyright (c) 1979, 1980, 1983, 1986, =20
1988, 1989, 1991, 1992, 1993, 1994
Aug 16 03:04:21 elfi kernel: The Regents of the University of =20
California. All rights reserved.
Aug 16 03:04:21 elfi kernel: FreeBSD 6.1-RELEASE #3: Tue May  9 =20
20:40:23 CEST 2006
...(regular boot stuff)...

(labdator is a box with a elfi nfs export mounted)

dmesg shows me some other stuff not in messages:

ad4: FAILURE - device detached
subdisk4: detached
ad4: detached
GEOM_MIRROR: Cannot write metadata on ad4s1 (device=3Dgm0s1, error=3D6).
GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6).
GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6).
GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6).
GEOM_MIRROR: Device gm0s1: provider ad4s1 disconnected.
g_vfs_done():mirror/gm0s1f[READ(offset=3D112910630912, length=3D32768)]=20=

error =3D 6
ad6: FAILURE - device detached
subdisk6: detached
ad6: detached
GEOM_MIRROR: Cannot write metadata on ad6s1 (device=3Dgm0s1, error=3D6).
GEOM_MIRROR: Cannot update metadata on disk ad6s1 (error=3D6).
GEOM_MIRROR: Device gm0s1: provider ad6s1 disconnected.
GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 destroyed.
GEOM_MIRROR: Device gm0s1 destroyed.
g_vfs_done():mirror/gm0s1f[READ(offset=3D27868381184, length=3D32768)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[READ(offset=3D2324807680, length=3D16384)]=20
error =3D 6
g_vfs_done():mirror/gm0s1d[READ(offset=3D2324824064, length=3D16384)]=20
error =3D 6
g_vfs_done():mirror/gm0s1d[READ(offset=3D2324840448, length=3D16384)]=20
error =3D 6
g_vfs_done():mirror/gm0s1d[READ(offset=3D2324856832, length=3D16384)]=20
error =3D 6
g_vfs_done():mirror/gm0s1d[READ(offset=3D2324873216, length=3D16384)]=20
error =3D 6
g_vfs_done():mirror/gm0s1f[READ(offset=3D17173594112, length=3D32768)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325168128, length=3D16384)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325184512, length=3D16384)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325200896, length=3D16384)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325217280, length=3D16384)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325233664, length=3D16384)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325250048, length=3D16384)]=20=

error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2319169536, length=3D2048)]=20
error =3D 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2312404992, length=3D16384)]=20=

error =3D 6
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
         The Regents of the University of California. All rights =20
reserved.
FreeBSD 6.1-RELEASE #3: Tue May  9 20:40:23 CEST 2006
(...boot..)

03:04 was when i got home, from other sources i've been told the box =20
died around ~01:21 (IRC pinged out, maybe this was just logs that =20
failed to write to disk which froze irssi or something).

Ok so this time it didnt just fail the raid (which it have done =20
before, a reboot and it started to rebuild..), this time it took the =20
whole box down with it.. This is the first time it has happened since =20=

I got that new motherboard (read earlier thread)..

Later in boot:

Aug 16 03:04:21 elfi kernel: ad4: 286188MB <Maxtor 7L300S0 BANC1G10> =20
at ata2-master SATA150
Aug 16 03:04:21 elfi kernel: ad6: 286188MB <Maxtor 7L300S0 BANC1G10> =20
at ata3-master SATA150
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1 created =20
(id=3D4118114647).
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20
ad4s1 detected.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20
ad6s1 detected.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Component ad4s1 (device =20
gm0s1) broken, skipping.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20
ad6s1 activated.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20
mirror/gm0s1 launched.

Usually when the box has been rebooted before the failed component =20
has been rebuilt automaticly.. Solved with:

$ gmirror forget
$ gmirror insert gm0s1 ad4s1

And now its rebuilding ad4 again...

Any new hints? Should i try RELENG_6 instead?

Johan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A30712DF-85D8-4393-AA88-D45732147BFA>