From owner-freebsd-stable@FreeBSD.ORG  Wed Feb  8 22:38:21 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 266A21065674
	for <freebsd-stable@freebsd.org>; Wed,  8 Feb 2012 22:38:21 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta09.emeryville.ca.mail.comcast.net
	(qmta09.emeryville.ca.mail.comcast.net [76.96.30.96])
	by mx1.freebsd.org (Postfix) with ESMTP id 0B7048FC16
	for <freebsd-stable@freebsd.org>; Wed,  8 Feb 2012 22:38:20 +0000 (UTC)
Received: from omta01.emeryville.ca.mail.comcast.net ([76.96.30.11])
	by qmta09.emeryville.ca.mail.comcast.net with comcast
	id XaYS1i0090EPchoA9aeLV7; Wed, 08 Feb 2012 22:38:20 +0000
Received: from koitsu.dyndns.org ([67.180.84.87])
	by omta01.emeryville.ca.mail.comcast.net with comcast
	id XaeK1i00b1t3BNj8MaeKuQ; Wed, 08 Feb 2012 22:38:20 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 3F578102C1E; Wed,  8 Feb 2012 14:38:19 -0800 (PST)
Date: Wed, 8 Feb 2012 14:38:19 -0800
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Alexander Motin <mav@FreeBSD.org>
Message-ID: <20120208223819.GA27488@icarus.home.lan>
References: <4F32E289.4080806@sentex.net>
	<mailpost.1328736521.3202974.81071.mailing.freebsd.stable@FreeBSD.cs.nctu.edu.tw>
	<4F32F5B0.2060203@FreeBSD.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4F32F5B0.2060203@FreeBSD.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-stable@freebsd.org
Subject: Re: siisch1: Error while READ LOG EXT
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Feb 2012 22:38:21 -0000

On Thu, Feb 09, 2012 at 12:22:40AM +0200, Alexander Motin wrote:
> On 08.02.2012 23:27, Jeremy Chadwick wrote:
> >On Wed, Feb 08, 2012 at 04:00:57PM -0500, Mike Tancsa wrote:
> >>I have a 4 port eSata PCIe card with 3 external port multipliers attached on an AMD64 box (8G of RAM), RELENG8 from Feb1st.
> >>
> >>siis0@pci0:5:0:0:       class=0x010400 card=0x71241095 chip=0x31241095 rev=0x02 hdr=0x00
> >>     vendor     = 'Silicon Image Inc (Was: CMD Technology Inc)'
> >>     device     = 'PCI-X to Serial ATA Controller (SiI 3124)'
> >>     class      = mass storage
> >>     subclass   = RAID
> >>     bar   [10] = type Memory, range 64, base 0xb4408000, size 128, enabled
> >>     bar   [18] = type Memory, range 64, base 0xb4400000, size 32768, enabled
> >>     bar   [20] = type I/O Port, range 32, base 0x3000, size 16, enabled
> >>     cap 01[64] = powerspec 2  supports D0 D1 D2 D3  current D0
> >>     cap 07[40] = PCI-X 64-bit supports 133MHz, 2048 burst read, 12 split transactions
> >>     cap 05[54] = MSI supports 1 message, 64 bit enabled with 1 message
> >>
> >>siis0:<SiI3124 SATA controller>  port 0x3000-0x300f mem 0xb4408000-0xb440807f,0xb4400000-0xb4407fff irq 19 at device 0.0 on pci5
> >>siis0: [ITHREAD]
> >>siisch0:<SIIS channel>  at channel 0 on siis0
> >>siisch0: [ITHREAD]
> >>siisch1:<SIIS channel>  at channel 1 on siis0
> >>siisch1: [ITHREAD]
> >>siisch2:<SIIS channel>  at channel 2 on siis0
> >>siisch2: [ITHREAD]
> >>siisch3:<SIIS channel>  at channel 3 on siis0
> >>siisch3: [ITHREAD]
> >>
> >># camcontrol devlist
> >><WDC WD2001FASS-00U0B0 01.00101>    at scbus0 target 0 lun 0 (pass0,ada0)
> >><WDC WD2001FASS-00U0B0 01.00101>    at scbus0 target 1 lun 0 (pass1,ada1)
> >><WDC WD2001FASS-00U0B0 01.00101>    at scbus0 target 2 lun 0 (pass2,ada2)
> >><WDC WD2001FASS-00U0B0 01.00101>    at scbus0 target 3 lun 0 (pass3,ada3)
> >><Port Multiplier 47261095 1f06>     at scbus0 target 15 lun 0 (pass4,pmp1)
> >><WDC WD2002FAEX-007BA0 05.01D05>    at scbus1 target 0 lun 0 (pass5,ada4)
> >><WDC WD2002FAEX-007BA0 05.01D05>    at scbus1 target 1 lun 0 (pass6,ada5)
> >><WDC WD2002FAEX-007BA0 05.01D05>    at scbus1 target 2 lun 0 (pass7,ada6)
> >><WDC WD2002FAEX-007BA0 05.01D05>    at scbus1 target 3 lun 0 (pass8,ada7)
> >><WDC WD2002FAEX-007BA0 05.01D05>    at scbus1 target 4 lun 0 (pass9,ada8)
> >><Port Multiplier 37261095 1706>     at scbus1 target 15 lun 0 (pass10,pmp0)
> >><Areca usrvar R001>                 at scbus4 target 0 lun 0 (pass11,da0)
> >><Areca backup1 R001>                at scbus4 target 0 lun 1 (pass12,da1)
> >><Areca RAID controller R001>        at scbus4 target 16 lun 0 (pass13)
> >><AMCC 9650SE-2LP DISK 4.10>         at scbus5 target 0 lun 0 (pass14,da2)
> >><ST31000333AS SD35>                 at scbus6 target 0 lun 0 (pass15,ada9)
> >><ST31000528AS CC35>                 at scbus7 target 0 lun 0 (pass16,ada10)
> >><ST31000340AS SD1A>                 at scbus8 target 0 lun 0 (pass17,ada11)
> >><WDC WD1002FAEX-00Z3A0 05.01D05>    at scbus11 target 0 lun 0 (pass18,ada12)
> >>
> >>
> >>Ever since I added a new PM, I have been seeing a new error (READ LOG EXT) along with a the odd slot timeout error.
> >>
> >>
> >>Feb  7 23:49:32 backup3 kernel: siisch1:  ... waiting for slots 47000000
> >>Feb  7 23:49:32 backup3 kernel: siisch1: Timeout on slot 26
> >>Feb  7 23:49:32 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000
> >>Feb  7 23:49:32 backup3 kernel: siisch1:  ... waiting for slots 43000000
> >>Feb  7 23:49:34 backup3 kernel: siisch1: Timeout on slot 30
> >>Feb  7 23:49:34 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000
> >>Feb  7 23:49:34 backup3 kernel: siisch1:  ... waiting for slots 03000000
> >>Feb  7 23:49:34 backup3 kernel: siisch1: Timeout on slot 25
> >>Feb  7 23:49:34 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000
> >>Feb  7 23:49:34 backup3 kernel: siisch1:  ... waiting for slots 01000000
> >>Feb  7 23:49:34 backup3 kernel: siisch1: Timeout on slot 24
> >>Feb  7 23:49:34 backup3 kernel: siisch1: siis_timeout is 07040000 ss 7f17e8b9 rs 7f17e8b9 es 00000000 sts 801d2000 serr 00680000
> >
> >This indicates the controller on channel 1 (siisch1) is "stalled"
> >waiting for underlying communication with the device attached to it.
> >
> >>Feb  7 23:57:59 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 00:13:36 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 00:21:53 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 00:22:16 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 00:39:13 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 01:24:25 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 01:33:52 backup3 last message repeated 2 times
> >>Feb  8 01:43:45 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 01:50:31 backup3 last message repeated 2 times
> >>Feb  8 01:55:20 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 02:26:26 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 02:27:24 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 03:16:28 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 03:36:20 backup3 kernel: siisch1: Error while READ LOG EXT
> >>Feb  8 04:04:05 backup3 kernel: siisch1: Error while READ LOG EXT
> >
> >This indicates the underlying device was handed a READ LOG EXT ATA
> >command (command 0x2f) and the device did not respond promptly
> >(resulting in the timeout messages you see).
> 
> There are hours between timeouts and READ LOG EXT errors. they are
> not directly related, but may have the same reason.
> 
> >>smartctl doesnt show any issues on the drives other than one that has some historical errors from a while ago.  What are these errors and do I need to worry about them ? The "READ LOG EXT" ones are new.
> >>
> >>{snipping SMART stats}
> >
> >You're focused heavily on the READ LOG EXT command.  READ LOG EXT is
> >intended for accessing the GP Log section of a drive.  EXT stands for
> >"Extended".  "GP Log" means "General Purpose Log", and is where all
> >sorts of logging information regarding drive performance is stored.
> >It's usually stored within a reserved section of the platters, or in the
> >HPA area.  It's not within a "standard" user-accessible LBA/sector
> >region.  This is a completely separate log from that of SMART logs.
> 
> READ LOG EXT commands here used to fetch status of some failed NCQ
> commands. It is normal (the only) way to get detailed error status
> in that case. Error of the READ LOG EXT commands may mean that it is
> not regular media error, but may be problem with communication,
> firmware or something else.

As usual -- thanks for your fantastic insights, Alexander.

I wasn't aware that to get NCQ error conditions you had to read from the
GPLog region, I was under the impression this was somehow returned via
extended SATA status codes (since I with NCQ the internal ordering CDBs
changes).  Interesting.  NCQ is still something (on an ATA and OS level,
for lack of better terminology) I am not fully familiar with, though
functionally I understand what it provides/gains.  Where in the ata-cam
code is this done?  I'd like to look at it for educational reasons.

Firmware problems are a possibility in this case, but I'm reluctant to
believe problems with communication between controller and drive.  His
CRC error count on the drive itself is zero.  So the only communication
failure I can think of would be between the controller and the port
multiplier (e.g. before anything actually reaches the drive), or some
kind of firmware-level problem with the port multiplier device (I
believe they do have some sort of ASIC on them?  I've never used one).

Mike, can you please provide output from the following commands?

* smartctl -x /dev/ada9
* smartctl -l sataphy /dev/ada9
* smartctl -l devstat /dev/ada9
* smartctl -l gplog,0x10 /dev/ada9

Note that some of these may fail with some error equivalent to "GPLog
0xXX not supported", and that's perfectly okay.

Thank you.

-- 
| Jeremy Chadwick                                 jdc@parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |