Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 10 Feb 2001 22:02:13 -0800 (PST)
From:      Matthew Jacob <mjacob@feral.com>
To:        "Justin T. Gibbs" <gibbs@scsiguy.com>
Cc:        audit@freebsd.org, "Kenneth D. Merry" <ken@kdm.org>, Gerard Roudier <groudier@club-internet.fr>
Subject:   Re: a couple of minor but important changes to SCSI error handling
Message-ID:  <Pine.BSF.4.21.0102102158450.68317-100000@beppo.feral.com>
In-Reply-To: <200102110523.f1B5NbO10383@aslan.scsiguy.com>

next in thread | previous in thread | raw e-mail | index | archive | help


On Sat, 10 Feb 2001, Justin T. Gibbs wrote:

> >
> >First is scsi_all.c:
> 
> This looks fine.  I also verified that the new error recovery code that
> Ken is reviewing right now also gets this right.

Good!

> 
> >Second is scsi_da.c:
> 
> ...
> 
> >10 retries with a .5 second delay between each is still only 5 seconds. 10
> >retries might be more appropriate to a SAN environment with at least a couple
> >of seconds of different initiators spasming the loop.
> 
> Depending on the error, I don't know that we would necessarily delay or not
> here.  If an initiator is spamming the loop, what does the peripheral driver
> see?  A command timeout?  Something reported as a "selection timeout"?  If
> you can be more specific, perhaps we can make the da error handler smarter
> so that certain types of errors get additional retries (similar perhaps to
> how we do a series of TURs for some errors in cam_periph_error()).

Well, the default action for selection timeout is to delay .5 seconds. That's 
what this affects.

There's a bit of uncertainty when a device leaves the loop (or the fabric) as
to really whether it's left for good or just temporarily. I'd like to give a
device we'd seen before a bit more grace before we give up on it. When I did
the Solaris SCSA stuff, I did 30 retries, but I didn't give it enough grace
time- if it's device with mounted filesystems, you should give somebody a
chance to see the message spewing out and enough time for them to go back and
plug the cable back in that they unplugged. So, really, 5 seconds isn't
enough..... this may be more in the new error recovery zone.

Note that this affects the read/write code only- not the probe or sync cache
or read capacity or 'other' code.

-matt




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-audit" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0102102158450.68317-100000>