Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 Aug 2007 14:15:08 +0200
From:      =?ISO-8859-1?Q?Johan_Str=F6m?= <johan@stromnet.se>
To:        freebsd-geom@freebsd.org, freebsd-stable@freebsd.org
Subject:   Crashed gmirror, single disk marked SYNC and wont boot...
Message-ID:  <8039436E-1824-4C2E-915B-9069DEF23B10@stromnet.se>

next in thread | raw e-mail | index | archive | help
Hi

FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: =20
Tue Feb 13 18:24:34 CET 2007     johan@elfi.stromnet.se:/usr/obj/usr/=20
src/sys/ROUTER.POLLING  i386

(ROUTER.POLLING is GENERIC  + options DEVICE_POLLING  and ALTQ, =20
IPSEC, also pfsync and carp)

This weekend I had a disk failing on me in a machine running gmirror =20
gm0 with 2 providers (ad0 and ad6). The whole box froze with no =20
screen output, and on hard reboot I got some LBA errors etc from ad0, =20=

after a few reboots it got up and running though (I wasnt at the =20
screen, had do do it by phone so couldn't really debug very well).
As soon as the box got up, I removed ad0 from the gmirror, so ad6 was =20=

the only provider. Today I got a new disk that would replace ad0..
Now remeber, ad6 was the only disk in the mirror. I took the box down =20=

fine, replaced the disk. ad0 was now gone and instead I hade ad4 (ad4=20
+6 is SATA, ad0 was IDE). Changed so I booted of the old SATA..  =20
Okay, there came the first problem; the boot loader gave me the usual =20=

options F1 FreeBSD F5 Disk 2 (or whatever it said).. If I pressed F1 =20
i got the same prompt again.. F5 nothing at all.. Funny!... The =20
system refused to load the loader (or whatever the 1-9 menu thingy is =20=

called) kernel or anything..
So I finally plugged the old ad0 disk into the machine to at least =20
get it booted, thinking it would go up on the gmirror.. Nope..:

(got the new ad4 out here)
ad0: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata0-master UDMA100
ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150
GEOM_MIRROR: Device gm0 created (id=3D4029378995).
GEOM_MIRROR: Device gm0: provider ad6 detected.
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
GEOM_MIRROR: Force device gm0 start due to timeout.
Trying to mount root from ufs:/dev/mirror/gm0s1a

Manual root filesystem specification:
   <fstype>:<device>  Mount <device> using filesystem <fstype>
                        eg. ufs:da0s1a
   ?                  List valid disk boot devices
   <empty line>       Abort manual input

mountroot>

Okey... so why wouldnt it load my mirror from ad6 now?? I just did a =20
clean shutdown without problems.. It didnt even recognize any slices =20
on ad6s1 (altough the ad6s1 was found)...
I entered ad0s1 as root and booted from there, ofcourse i got to =20
emergency shell since fstab looked for the gmirror devices, which =20
didnt exist..

Some more digging into gmirror, I did a gmirror dump ad6:

Metadata on /dev/ad6:
      magic: GEOM::MIRROR
    version: 3
       name: gm0
        mid: 4029378995
        did: 449032193
        all: 3
      genid: 0
     syncid: 5
   priority: 0
      slice: 4096
    balance: round-robin
mediasize: 20416757248
sectorsize: 512
syncoffset: 0
     mflags: NONE
     dflags: SYNCHRONIZING
hcprovider:
   provsize: 160041885696
   MD5 hash: 6e1e8ca80a27e0e1b0460feab595c39f

Some googling indicated  that  SYNCHRONIZING means that its not =20
"complete" and wont mount? Is that correct? Why would it be in that =20
state then, I just shut it down fine... And where the f*ck did my =20
slices go??..

Did a sysctl kern.geom.mirror.debug=3D2 and tried to gmirror activate =20=

the mirror:

GEOM_MIRROR[1]: Creating device gm0 (id=3D4029378995).
GEOM_MIRROR[0]: Device gm0 created (id=3D4029378995).
GEOM_MIRROR[1]: root_mount_hold 0xc3539510
GEOM_MIRROR[1]: Adding disk ad6 to gm0.
GEOM_MIRROR[2]: Adding disk ad6.
GEOM_MIRROR[2]: Disk ad6 connected.
GEOM_MIRROR[1]: Disk ad6 state changed from NONE to NEW (device gm0).
GEOM_MIRROR[0]: Device gm0: provider ad6 detected.
GEOM_MIRROR[2]: Tasting ad6s1.
GEOM_MIRROR[0]: Force device gm0 start due to timeout.
GEOM_MIRROR[1]: root_mount_rel[2169] 0xc3539510
GEOM_MIRROR[2]: No I/O requests for gm0, it can be destroyed.
GEOM_MIRROR[2]: Metadata on ad6 updated.
GEOM_MIRROR[2]: Access ad6 r-1w-1e-1 =3D 0
GEOM_MIRROR[0]: Device gm0 destroyed.
GEOM_MIRROR[1]: Thread exiting.
GEOM_MIRROR[1]: Consumer ad6 destroyed.


Soo.. What is going on here? Anyone with some clues? Currently =20
running on the ad0 disk, no raid at all.. Lets hope it doesnt die on =20
me (havent had any signs of that since sunday when it froze and gave =20
boot errors now so I'm hoping..). The data loss from using ad0 =20
instead of ad6 is probably minimal, its a router so its more or less =20
only logging that seems to been lost... For now I just want to get =20
clear about wth happened here and how to prevent it, and how to get =20
back up on a gmirror with ad6 and ad4 (to be plugged in) so I can =20
throw ad0 out...


Thanks

--
Johan Str=F6m
Stromnet
johan@stromnet.se
http://www.stromnet.se/





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8039436E-1824-4C2E-915B-9069DEF23B10>