Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 18 Jun 2012 16:35:58 -0500
From:      Dustin Wenz <dustinwenz@ebureau.com>
To:        freebsd-scsi@freebsd.org
Subject:   Re: Marginal disks prevent boot with mps(4)
Message-ID:  <1165F6D3-3207-4CEC-9D6C-4615FBEBE13A@ebureau.com>
In-Reply-To: <bgukri3lj2bv5ix5krybgmx8.1339803933926@email.android.com>
References:  <bgukri3lj2bv5ix5krybgmx8.1339803933926@email.android.com>

next in thread | previous in thread | raw e-mail | index | archive | help
What part of cam would be responsible for managing disk conditions such =
as this? I've looked through the cam(4) docs, and some of the options =
that are configurable, but none of it seems like it might help. It's =
possible that I've overlooked something, but I'm not sure what.

It would be very helpful if there was a way to remove a device entry =
using camcontrol without it hanging. That would at least me allow me to =
deal with these failures until a fix is found/created.

	- .Dustin

On Jun 15, 2012, at 6:45 PM, Kyle Creyts wrote:

> Iirc, this is a camctl problem.
>=20
> Dustin Wenz <dustinwenz@ebureau.com> wrote:
>=20
> I just received a SFF-8088->8087 cable via FedEx this morning, which =
allowed me to continue to isolate this problem.
>=20
> What I discovered is that it makes no difference whether a bad disk is =
connected to an expander, or if one is connected directly to the HBA. =
So, if this is a hardware bug, it must be present in the LSI =
SAS2008-based HBA that I'm using. The firmware on the card was also =
upgraded from v11.00.00.00 to v13.00.57.00, which is the latest as far =
as I am aware. That did not seem to change the behavior.
>=20
> I did notice that earlier during startup, I see this message a page or =
so before the endless ioc messages start:
> 	mps0: polling failed
> 	mpssas_get_sata_identify: poll for page completed with error =
60_mapping_get_dev
> 	info: failed to compute the hashed SAS address for SATA device =
with handle 0x0009
>=20
> It seems that the driver knows something is up; even before it gets =
stuck later on...
>=20
> So far, the only way I can get this configuration to boot is to change =
the status for MPI2_IOCSTATUS_SCSI_IOC_TERMINATED to CAM_REQ_CMP_ERR, as =
Ken mentioned. That change will still cause the machine to report some =
"ioc terminated" messages, but will not hang the startup process =
indefinitely. However, I'm not sure what the implications of making that =
change on a production machine would be.
>=20
> If this is LSI's problem, I don't see why they would bother to fix it. =
As far as I know, they are the only 6Gb SAS/SATA HBA vendor that works =
on FreeBSD. We have no choice but to buy their stuff, even if it's not =
robust.
>=20
> 	- .Dustin
>=20
> On Jun 8, 2012, at 4:53 PM, Kenneth D. Merry wrote:
>=20
>> On Fri, Jun 08, 2012 at 16:25:31 -0500, Dustin Wenz wrote:
>>> I just installed a build of 9.0-STABLE in order to test the changes =
since release. I was hoping that some of the error-handling in mps would =
alter the behavior I've seen with some SATA disks (particularly, Seagate =
ST3000DM001 disks) connected through an LSI SAS 9201-16e HBA.
>>>=20
>>=20
>> Are you using an expander, or are the disks connected directly to the =
HBA?
>>=20
>> What firmware version are you using on the HBA?  Make sure you have =
the
>> latest firmware version on the card.
>>=20
>>> It is apparently possible for these disks to get in a state where =
their presence prevents the machine from booting. This problem has =
existed for some time, according to some archive-searching I've done, =
but there isn't much consensus on how to fix it.
>>>=20
>>> The disks are good enough that they can be probed at startup, but =
some part of initialization cannot complete. This is the message I see =
repeated forever upon boot (the probe number does change slightly):
>>>=20
>>> 	(probe14:mps0:0:14:0): INQUIRY. CDB: 12 0 0 0 24 0 length 36 =
SMID 215 terminated ioc 804b scsi 0 state c xfer 0
>>>=20
>>> There is a comment in mps_sas.c which suggests that this error is =
usually transient, but that seems not to be the case here. Can anyone =
suggest a modification that might permit booting in this state?
>>>=20
>>=20
>> There is not a lot that the driver can do in this case.  The command =
is
>> getting terminated by the firmware in the HBA, and we really don't =
have a
>> lot of information to indicate why.
>>=20
>> You could change the status returned for =
MPI2_IOCSTATUS_SCSI_IOC_TERMINATED
>> to CAM_REQ_CMP_ERR, and that would just mean that the probe for that =
disk
>> would eventually fail and the kernel would boot.  CAM_REQUEUE_REQ =
tells
>> CAM to retry the command without decrementing the retry count.  That =
is
>> why you aren't able to boot.
>>=20
>> If upgrading the HBA firmware doesn't fix the problem, I would =
suggest
>> contacting LSI support, and see if they can get additional =
diagnostics off
>> the board to figure out what the problem is.
>>=20
>> Ken
>> --=20
>> Kenneth Merry
>> ken@FreeBSD.ORG
>=20
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1165F6D3-3207-4CEC-9D6C-4615FBEBE13A>