Date: Mon, 18 Jun 2012 16:27:28 -0500 From: Dustin Wenz <dustinwenz@ebureau.com> To: freebsd-scsi@freebsd.org Subject: Re: Marginal disks prevent boot with mps(4) Message-ID: <B9718FCA-FFB6-4C23-99E2-CEF4AA606BC8@ebureau.com> In-Reply-To: <4FDC2564.3070501@gneto.com> References: <60F17E0E-EE4A-4F37-9925-055315B987B1@ebureau.com> <20120608215326.GA83721@nargothrond.kdm.org> <551EFA9B-74F7-4CFC-954C-C9E0440E2BDC@ebureau.com> <4FDC2564.3070501@gneto.com>
next in thread | previous in thread | raw e-mail | index | archive | help
We do have current firmware on these drives; CC9D and CC4C - the latter = being updated to the latest (CC4H) from a few weeks ago as we have time. = The original firmware caused all manner of problems. So much so, that = this model was basically unusable until they were updated. - .Dustin On Jun 16, 2012, at 1:19 AM, Martin Nilsson wrote: > Have you checked that you don't have buggy firmware in the ST3000DM001 = drive? >=20 > Seagate have updates for some versions from last fall on their web. >=20 >=20 > On 2012-06-16 01:06, Dustin Wenz wrote: >> I just received a SFF-8088->8087 cable via FedEx this morning, which = allowed me to continue to isolate this problem. >>=20 >> What I discovered is that it makes no difference whether a bad disk = is connected to an expander, or if one is connected directly to the HBA. = So, if this is a hardware bug, it must be present in the LSI = SAS2008-based HBA that I'm using. The firmware on the card was also = upgraded from v11.00.00.00 to v13.00.57.00, which is the latest as far = as I am aware. That did not seem to change the behavior. >>=20 >> I did notice that earlier during startup, I see this message a page = or so before the endless ioc messages start: >> mps0: polling failed >> mpssas_get_sata_identify: poll for page completed with error = 60_mapping_get_dev >> info: failed to compute the hashed SAS address for SATA device = with handle 0x0009 >>=20 >> It seems that the driver knows something is up; even before it gets = stuck later on... >>=20 >> So far, the only way I can get this configuration to boot is to = change the status for MPI2_IOCSTATUS_SCSI_IOC_TERMINATED to = CAM_REQ_CMP_ERR, as Ken mentioned. That change will still cause the = machine to report some "ioc terminated" messages, but will not hang the = startup process indefinitely. However, I'm not sure what the = implications of making that change on a production machine would be. >>=20 >> If this is LSI's problem, I don't see why they would bother to fix = it. As far as I know, they are the only 6Gb SAS/SATA HBA vendor that = works on FreeBSD. We have no choice but to buy their stuff, even if it's = not robust. >>=20 >> - .Dustin >>=20 >> On Jun 8, 2012, at 4:53 PM, Kenneth D. Merry wrote: >>=20 >>> On Fri, Jun 08, 2012 at 16:25:31 -0500, Dustin Wenz wrote: >>>> I just installed a build of 9.0-STABLE in order to test the changes = since release. I was hoping that some of the error-handling in mps would = alter the behavior I've seen with some SATA disks (particularly, Seagate = ST3000DM001 disks) connected through an LSI SAS 9201-16e HBA. >>>>=20 >>> Are you using an expander, or are the disks connected directly to = the HBA? >>>=20 >>> What firmware version are you using on the HBA? Make sure you have = the >>> latest firmware version on the card. >>>=20 >>>> It is apparently possible for these disks to get in a state where = their presence prevents the machine from booting. This problem has = existed for some time, according to some archive-searching I've done, = but there isn't much consensus on how to fix it. >>>>=20 >>>> The disks are good enough that they can be probed at startup, but = some part of initialization cannot complete. This is the message I see = repeated forever upon boot (the probe number does change slightly): >>>>=20 >>>> (probe14:mps0:0:14:0): INQUIRY. CDB: 12 0 0 0 24 0 length 36 = SMID 215 terminated ioc 804b scsi 0 state c xfer 0 >>>>=20 >>>> There is a comment in mps_sas.c which suggests that this error is = usually transient, but that seems not to be the case here. Can anyone = suggest a modification that might permit booting in this state? >>>>=20 >>> There is not a lot that the driver can do in this case. The command = is >>> getting terminated by the firmware in the HBA, and we really don't = have a >>> lot of information to indicate why. >>>=20 >>> You could change the status returned for = MPI2_IOCSTATUS_SCSI_IOC_TERMINATED >>> to CAM_REQ_CMP_ERR, and that would just mean that the probe for that = disk >>> would eventually fail and the kernel would boot. CAM_REQUEUE_REQ = tells >>> CAM to retry the command without decrementing the retry count. That = is >>> why you aren't able to boot. >>>=20 >>> If upgrading the HBA firmware doesn't fix the problem, I would = suggest >>> contacting LSI support, and see if they can get additional = diagnostics off >>> the board to figure out what the problem is. >>>=20 >>> Ken >>> --=20 >>> Kenneth Merry >>> ken@FreeBSD.ORG >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B9718FCA-FFB6-4C23-99E2-CEF4AA606BC8>