Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Jun 2011 20:45:33 -0600
From:      "Kenneth D. Merry" <ken@freebsd.org>
To:        Joachim Tingvold <joachim@tingvold.com>
Cc:        freebsd-scsi@freebsd.org, Alexander Motin <mav@freebsd.org>
Subject:   Re: mps0-troubles
Message-ID:  <20110625024533.GA86406@nargothrond.kdm.org>
In-Reply-To: <8AFCE2C0-A87D-414B-912F-C80C158B6D94@tingvold.com>
References:  <20110208201310.GA97635@nargothrond.kdm.org> <4A14FA28-6C9E-4F22-B7A3-4295ACD77719@tingvold.com> <20110218171619.GB78796@nargothrond.kdm.org> <318745DD-B5F4-4693-B3F2-22DF8D437349@tingvold.com> <20110221155041.GA37922@nargothrond.kdm.org> <3037190B-6CF2-4C8E-8350-5BA4F13456A8@tingvold.com> <20110221214544.GA43886@nargothrond.kdm.org> <2E532F21-B969-4216-9765-BC1CC1EAB522@tingvold.com> <20110225183351.GA31590@nargothrond.kdm.org> <8AFCE2C0-A87D-414B-912F-C80C158B6D94@tingvold.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Jun 25, 2011 at 03:30:37 +0200, Joachim Tingvold wrote:
> On Fri, Feb 25, 2011, at 19:33:51PM GMT+01:00, Kenneth D. Merry wrote:
> >I just checked the change into -current, I'll merge it to -stable  
> >next week.
> 
> I'm back! Missed me? :-D
> 
> After running fine for a while, I decided to do some more testing.  
> Usual 'dd' in a while-loop over the night, and woke up to this;
> 
> ###
> mps0: (0:39:0) terminated ioc 804b scsi 0 state c xfer 65536
> mps0: (0:39:0) terminated ioc 804b scsi 0 state c xfer 65536
> mps0: (0:39:0) terminated ioc 804b scsi 0 state c xfer 65536
> mps0: (0:39:0) terminated ioc 804b scsi 0 state c xfer 65536
> mps0: (0:39:0) terminated ioc 804b scsi 0 state c xfer 0
> mps0: (0:39:0) terminated ioc 804b scsi 0 state 0 xfer 0
> mps0: (0:39:0) terminated ioc 804b scsi 0 state 0 xfer 0
> mps0: (0:39:0) terminated ioc 804b scsi 0 state 0 xfer 0
> mps0: (0:39:0) terminated ioc 804b scsi 0 state 0 xfer 0
> mps0: mpssas_remove_complete on target 0x0027, IOCStatus= 0x0
> (da7:mps0:0:39:0): lost device
> (da7:mps0:0:39:0): Invalidating pack
> (da7:mps0:0:39:0): Invalidating pack
> (da7:mps0:0:39:0): Invalidating pack
> (da7:mps0:0:39:0): Invalidating pack
> (da7:mps0:0:39:0): Synchronize cache failed, status == 0xa, scsi  
> status == 0x0
> (da7:mps0:0:39:0): removing device entry
> da7 at mps0 bus 0 scbus0 target 39 lun 0
> da7: <ATA WDC WD10EACS-00Z 1B01> Fixed Direct Access SCSI-5 device
> da7: 300.000MB/s transfers
> da7: Command Queueing enabled
> da7: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
> ###
> 
> Now, the disk was present at the time I checked, as camcontrol confirms;
> 
> [root@filserver /storage/tmp]# camcontrol devlist|grep da7
> <ATA WDC WD10EACS-00Z 1B01>        at scbus0 target 39 lun 0 (pass8,da7)

Yep, this looks like what I've seen with mps controllers talking to SATA
drives through an expander under high load.

I know I've asked this before, but what brand of expander do you have, and
is it 3Gb or 6Gb?  It looks like the drive is probing at 3Gb in any case.

It looks like the drive went away and came back.

> However, the disk was marked as "REMOVED" by 'zpool status';
> 
> ###
> [jocke@filserver /storage/tmp]$ zpool status
>   pool: storage
>  state: DEGRADED
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	storage     DEGRADED     0     0     0
> 	  raidz2-0  ONLINE       0     0     0
> 	    da8     ONLINE       0     0     0
> 	    da9     ONLINE       0     0     0
> 	    da10    ONLINE       0     0     0
> 	    da11    ONLINE       0     0     0
> 	    da15    ONLINE       0     0     0
> 	    da16    ONLINE       0     0     0
> 	  raidz2-1  DEGRADED     0     0     0
> 	    da0     ONLINE       0     0     0
> 	    da1     ONLINE       0     0     0
> 	    da2     ONLINE       0     0     0
> 	    da3     ONLINE       0     0     0
> 	    da4     ONLINE       0     0     0
> 	    da5     ONLINE       0     0     0
> 	    da6     ONLINE       0     0     0
> 	    da7     REMOVED      0     0     0
> 	    da12    ONLINE       0     0     0
> 	    da13    ONLINE       0     0     0
> 	spares
> 	  da14      AVAIL
> ###
> 
> A quick 'zpool online storage da7' works fine, as suspected, and pool  
> is resilvering at the moment.
> 
> I find it a bit worrisome that a disk was removed like that. It  
> _could_ be that the disk isn't completely good, however, due to my  
> previous experiences with mps, I suspect the disk is fine (smartctl- 
> readouts on the disk seems to be good as well).

The disk is probably fine.  That error tends to happen when you have a lot
of contention under high load.  I wish I knew why.  It is something that
LSI should fix, I was talking to them for a while trying to get an answer
on it, but got nowhere.

With some of the ZFS improvements that Justin is working on in -current,
I think the drive would have probably been automatically put back into
the pool when it came back.

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110625024533.GA86406>