Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 1 Dec 1999 11:06:35 -0700 (MST)
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        dgilbert@velocet.ca (David Gilbert)
Cc:        stable@FreeBSD.ORG
Subject:   Re: vinum experiences.
Message-ID:  <199912011806.LAA43219@panzer.kdm.org>
In-Reply-To: <14405.8810.777783.992833@trooper.velocet.net> from David Gilbert at "Dec 1, 1999 08:28:10 am"

next in thread | previous in thread | raw e-mail | index | archive | help
[ If you want to comment on SCSI issues, I would suggest mailing the -scsi
list, since you'll get a wider audience of people who know about SCSI. ]

David Gilbert wrote...
> While I'm still chasing the memory corruption bug in vinum, I have a
> couple of observations.
> 
> 1. Removing a device (at least, with the ahc controller) locks the bus 
> even though I have a RAID hot-swap ready chassy (that properly
> isolates the bus between commands).  In my test, I had a completely
> quiet SCSI bus when I removed one of the drives.  I then wrote to the
> RAID array.  I got:
> 
> Nov 30 18:31:51 raid1 /kernel: (da8:ahc1:0:11:0): Invalidating pack
> Nov 30 18:31:51 raid1 /kernel: raid.p0.s6: fatal read I/O error
> Nov 30 18:31:51 raid1 /kernel: vinum: raid.p0.s6 is crashed by force
> Nov 30 18:31:52 raid1 /kernel: vinum: raid.p0 is degraded
> Nov 30 18:31:52 raid1 /kernel: d7: fatal drive I/O error
> Nov 30 18:31:52 raid1 /kernel: vinum: drive d7 is down
> Nov 30 18:31:52 raid1 /kernel: raid.p0.s6: fatal write I/O error
> Nov 30 18:31:52 raid1 /kernel: vinum: raid.p0.s6 is stale by force
> Nov 30 18:31:52 raid1 /kernel: d7: fatal drive I/O error
> Nov 30 18:31:52 raid1 /kernel: biodone: buffer already done

That looks like it may be a vinum issue.  You shouldn't be getting buffers
done twice, as that error message indicates.  Have you talked to Greg at
all about this?  If you're chasing down bugs in Vinum, it would make sense
to contact the author and work with him to either find the problem, or
trace it to some other part of the system.

> Nov 30 18:31:52 raid1 /kernel: (da8:ahc1:0:11:0): Synchronize cache failed, status == 0x4a, scsi status == 0x0
> Nov 30 18:33:16 raid1 /kernel: (da8:ahc1:0:11:0): lost device
> Nov 30 18:33:16 raid1 /kernel: (da8:ahc1:0:11:0): removing device entry
> 
> ... I got more than one of the Synchronize cache failed.  the "lost
> device" was when I "camcontrol rescan 1"  ... I did do a "camcontrol
> reset 1", but it didn't affect things.

All of that is normal.  The synchronize cache failed since there was no
device there to talk to.  You probably got more than one of those because
it was retried.

> The net result is that SCSI bus 1 was wedged after this.  I would
> conjecture that removing a device (and running with this device
> removed is precisely what the chassy was designed to do) should not
> wedge things.

How do you know the bus was wedged?  Could you issue SCSI commands with
camcontrol?  e.g.:

camcontrol tur da10 -v

Will issue a test unit ready to da10.  If it responds, the bus isn't
wedged.

> In fact, since the camcontrol rescan 1 was successful, I suggest that
> it was cam, not the ahc driver that was somehow wedged.

I don't think it's clear at all what wedged.  The fact that you were able
to rescan the bus indicates that the CAM side of things is probably working
properly.  One of the things that a rescan does is send a SCSI inquiry
command to every possible target ID on the bus.  You can't do that if the
bus is wedged.

Ken
-- 
Kenneth Merry
ken@kdm.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199912011806.LAA43219>