Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Mar 2000 01:31:32 -0500 (EST)
From:      cjohnson@camelot.com
To:        FreeBSD-gnats-submit@freebsd.org
Subject:   kern/17499: Can't revive VINUM RAID5
Message-ID:  <20000320063132.3D74212C30@galahad.camelot.com>

next in thread | raw e-mail | index | archive | help

>Number:         17499
>Category:       kern
>Synopsis:       Can't revive VINUM RAID5
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Mar 19 22:40:01 PST 2000
>Closed-Date:
>Last-Modified:
>Originator:     Christopher T. Johnson
>Release:        FreeBSD 4.0-RELEASE i386
>Organization:
Paladin Software
>Environment:
	SMP FreeBSD 4.0-RELEASE
	128Mbytes of Memory
	Dual 233MMX Pentiums

su-2.03#  camcontrol devlist
<QUANTUM FIREBALL ST4.3S 0F0C>     at scbus0 target 0 lun 0 (pass0,da0)
<EXABYTE EXB-8200 251K>            at scbus0 target 4 lun 0 (pass1,sa0)
<TOSHIBA CD-ROM XM-5701TA 0167>    at scbus0 target 5 lun 0 (pass2,cd0)
<IMS CDD2000/00 1.26>              at scbus0 target 6 lun 0 (pass3,cd1)
<FUJITSU M2954Q-512 0177>          at scbus1 target 0 lun 0 (pass4,da1)
<SEAGATE ST39140W 1281>            at scbus1 target 1 lun 0 (pass5,da2)
<SEAGATE ST43400N 1022>            at scbus1 target 2 lun 0 (pass6,da3)
<SEAGATE ST43400N 1028>            at scbus1 target 3 lun 0 (pass7,da4)
<QUANTUM FIREBALL_TM2110S 300X>    at scbus1 target 5 lun 0 (pass8,da5)
<SEAGATE ST43400N 1028>            at scbus1 target 6 lun 0 (pass9,da6)

su-2.03# vinum list
5 drives:
D drv3                  State: up       Device /dev/da1s1e      Avail: 1372/4149 MB (33%)
D drv1                  State: up       Device /dev/da3s1a      Avail: 0/2777 MB (0%)
D drv2                  State: up       Device /dev/da4s1a      Avail: 0/2777 MB (0%)
D drv4                  State: up       Device /dev/da5s1e      Avail: 0/2014 MB (0%)
D drv5                  State: up       Device /dev/da6s1e      Avail: 0/2776 MB (0%)

2 volumes:
V quick                 State: up       Plexes:       1 Size:       2013 MB
V myraid                State: up       Plexes:       1 Size:       8329 MB

2 plexes:
P quick.p0            C State: up       Subdisks:     1 Size:       2013 MB
P myraid.p0          R5 State: degraded Subdisks:     4 Size:       8329 MB

5 subdisks:
S quick.p0.s0           State: up       PO:        0  B Size:       2013 MB
S myraid.p0.s0          State: up       PO:        0  B Size:       2776 MB
S myraid.p0.s1          State: up       PO:      128 kB Size:       2776 MB
S myraid.p0.s2          State: R 57%    PO:      256 kB Size:       2776 MB
                        *** Revive process for myraid.p0.s2 has died ***
S myraid.p0.s3          State: up       PO:      384 kB Size:       2776 MB

>Description:
After replacing a bad drive in my RAID5 I've been unable to get the SubDisk
to revive.

After reinstalling the disk drive I used "fdisk -I da1" to put a slice
map on the drive and when disklabel -w -r da1s1 auto failed I used sysinstall
to put a disklabel on the drive.  Then I used disklabel -e da1s1 to add an
"e" partition of type vinum (da1s1e)

With that done I unmounted all my vinum drives and did "vinum stop"
followed by "vinum start".  Once vinum discovered the new drive, I used 
"vinum create" to set "drive name drv3 device /dev/da1s1e".  At this point
drv3 came up and I did "start myraid.p0.s2" to start the revive.

After about a 30 minutes the vinum reported:
	 can't revive myraid.p0.s2: Invalid argument
When I tried to restart the revive with another "start myraid.p0.s2" the
start locked up trying to get a "vlock" or "vrlock" (according to ctrl-t).
I was unable to interupt vinum with ctrl-C or to suspend it with ctrl-Z or
to kill it with a SIGKILL.  After shutdown -r plus pressing the reset button.
I did a low level format on the drive, reinstalled slice and partition maps.
Started Vinum again and had same failure mode.

This time a "vinum stop" and 'vinum start" got things running after the
failure.  The drive was then tested by"
	dd if=/dev/zero of=/dev/da1 bs=1024k

This generated NO errors.

Reinstalled slice/partitions and restart vinum.  Start revive.  Same
failure at the samish point.  Tried again with a different block size
and had samefailure at the samish point

After reading source for a while I checked for debug and when I found it
was turned on, I turned on all the debug flags but for the "big drive"
and "drop into the debugger". "debug 635(?)" then "start -w myraid.p0.s2".
After verifying that it was logging I aborted the revive.  set debug 0
and started reviving again.  when we were close I stopped the revive.
Set the debug flags and started the revive once more.

This looks correct, dev 91,1 is /dev/vinum/myraid and the offset looks good
and the length is correct also.  the 13,131104 looks "ok" but I don't know
if it is "right"  Offset looks good, sd numbers are good.  We read three
"good drives" and write to the "reviving drive".  Everything cool here.

Mar 19 19:18:21 galahad /kernel: Read dev 91.1, offset 0x96c780, length 65536
Mar 19 19:18:21 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324280,
	devoffset 0x324389, length 65536
Mar 19 19:18:21 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324280,
	devoffset 0x324389, length 65536
Mar 19 19:18:21 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324280,
	devoffset 0x324389, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324280,
	devoffset 0x324389, length 65536


The next block looks good two.  The offset goes up by 64K which is right.

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca00, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324300,
	devoffset 0x324409, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324300,
	devoffset 0x324409, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324300,
	devoffset 0x324409, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324300,
	devoffset 0x324409, length 65536

And again, everything looks good.

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca80, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324380,
	devoffset 0x324489, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324380,
	devoffset 0x324489, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324380,
	devoffset 0x324489, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324380,
	devoffset 0x324489, length 65536

And again

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce00, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324400,
	devoffset 0x324509, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324400,
	devoffset 0x324509, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324400,
	devoffset 0x324509, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324400,
	devoffset 0x324509, length 65536

One more good block.

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce80, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324480,
	devoffset 0x324589, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324480,
	devoffset 0x324589, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324480,
	devoffset 0x324589, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324480,
	devoffset 0x324589, length 65536

Where is the "Request: .*" log mesage?  Where is the 91.1 Read request?
0x96cf00 looks like the next block to fix in the PLEX but why are we reading
it for the SUBDISK?  And where did this LENGTH come from!

Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x96cf00,
	devoffset 0x96d009, length 2146172928
Mar 19 19:18:22 galahad /kernel: Read dev 13.131084, sd 3, offset 0x96cf00,
	devoffset 0x96d009, length 2146172928
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x96cf00,
	devoffset 0x96d009, length 2146172928
Mar 19 19:17:53 galahad vinum[963]: reviving myraid.p0.s2
Mar 19 19:18:15 galahad vinum[963]: can't revive myraid.p0.s2: Invalid
	argument

>How-To-Repeat:

	I don't know if this will repeat anywhere else but I can offer
	ssh access to a vouched for player that wants to test on galahad.


>Fix:

	


>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000320063132.3D74212C30>