Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Feb 2011 11:33:51 -0700
From:      "Kenneth D. Merry" <ken@freebsd.org>
To:        Joachim Tingvold <joachim@tingvold.com>
Cc:        freebsd-scsi@freebsd.org, Alexander Motin <mav@freebsd.org>
Subject:   Re: mps0-troubles
Message-ID:  <20110225183351.GA31590@nargothrond.kdm.org>
In-Reply-To: <2E532F21-B969-4216-9765-BC1CC1EAB522@tingvold.com>
References:  <20110204180011.GA38067@nargothrond.kdm.org> <DE11FC96-06DB-479F-8673-B9ACE2805390@tingvold.com> <20110208201310.GA97635@nargothrond.kdm.org> <4A14FA28-6C9E-4F22-B7A3-4295ACD77719@tingvold.com> <20110218171619.GB78796@nargothrond.kdm.org> <318745DD-B5F4-4693-B3F2-22DF8D437349@tingvold.com> <20110221155041.GA37922@nargothrond.kdm.org> <3037190B-6CF2-4C8E-8350-5BA4F13456A8@tingvold.com> <20110221214544.GA43886@nargothrond.kdm.org> <2E532F21-B969-4216-9765-BC1CC1EAB522@tingvold.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Feb 23, 2011 at 04:58:14 +0100, Joachim Tingvold wrote:
> On Mon, Feb 21, 2011, at 22:45:44PM GMT+01:00, Kenneth D. Merry wrote:
> >>>Okay, good.  It looks like it is running as designed.
> >>It is? It still terminating the commands, which I guess it shouldn't?
> >>
> >>	mps0: (0:40:0) terminated ioc 804b scsi 0 state c xfer 0
> >Sorry, I missed that, I was just looking at the first part.
> 
> No worries. (-:
> 
> >I'm still waiting for LSI to look at the SAS analyzer trace I sent  
> >them for
> >the "IOC terminated" bug.
> >
> >It appears to be (at least on my hardware) a backend issue of some  
> >sort,
> >and probably not anything we can fix in the driver.
> 
> I see. Good to know that you're able to reproduce it, since I with  
> good possibility can rule out that it's a hardware-issue on my  
> controller.

And if you're only seeing it occasionally, it's probably not a big worry.
I have the driver set up to retry IOC terminated errors without
decrementing the retry count, so you won't run into any filesystem or other
errors because of it.

The only issue is that if you get into a situation where you're getting
those errors continuously, you'll wind up in an endless retry loop.  (I've
seen that on my test system, but 60 drives behind 8 expanders in multiple
levels is a bit of an exceptional case.)

> >Since you've got an HP branded expander, that makes it a little more
> >difficult to determine whether it's an LSI, Maxim, or some other  
> >expander.
> >Can you try the following on your system?  You'll need the sg3_utils  
> >port:
> >
> >sg_inq -i ses0
> >
> >(I need to update camcontrol to parse page 0x83 output.)
> >
> >[...]
> >
> >Maxim expanders seem to report LUN descriptors in VPD page 0x83  
> >instead of
> >target port descriptors.  We might get a slight clue from the  
> >output, but
> >it's hard to say for certain since HP could have customized the page  
> >0x83
> >values in the expander firmware.
> 
> VPD INQUIRY: Device Identification page
>   Designation descriptor number 1, descriptor length: 12
>     transport: Serial Attached SCSI (SAS)
>     designator_type: NAA,  code_set: Binary
>     associated with the target port
>       NAA 5, IEEE Company_id: 0x1438
>       Vendor Specific Identifier: 0x101a2865
>       [0x50014380101a2865]
>   Designation descriptor number 2, descriptor length: 8
>     transport: Serial Attached SCSI (SAS)
>     designator_type: Relative target port,  code_set: Binary
>     associated with the target port
>       Relative target port: 0x1

Is this a 6Gb or a 3Gb expander?  Since it has a target port descriptor, it
might be OEM LSI expander, but who knows.

> >>It just doesn't display the 'out of chain'-errors, that's all I  
> >>think.
> >
> >Well, if you don't see the 'out of chain' errors with 2048 chain  
> >buffers,
> >that means the condition isn't happening.
> >
> >The cost of going from 1024 to 2048 is only 32K of extra memory,  
> >which is
> >not a big deal, so I think I'll go ahead and bump the limit up and  
> >remove
> >the printfs.  We've now proven the recovery strategy, so it'll just  
> >slow
> >things down slightly if anyone runs into that issue again.
> 
> Good. It has such a small impact, yes, so it shouldn't trouble anyone.

I just checked the change into -current, I'll merge it to -stable next
week.

> >>>What filesystem are you using by the way?
> >>ZFS.
> >Interesting.  I haven't been able to run out of chain elements with  
> >ZFS,
> >but I can use quite a few with UFS.  I had to artificially limit the  
> >number
> >of chain elements to test the change.
> 
> Maybe it's because the amount of disks in the same pool that I have?  
> Or that I have two un-even raidz2 vdev's in the same pool? The latter  
> has to be forced when adding it to the pool, so I guess it's not an  
> "ideal" solution... (but "everyone" does it, it seems).

I don't think that would affect the out of chain problem.  It has a lot
more to do with memory fragmentation and therefore how long the
scatter/gather lists are that get generated by busdma.

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110225183351.GA31590>