Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Jul 2018 09:38:36 -0400
From:      Ken Merry <ken@freebsd.org>
To:        Oliver Sech <crimsonthunder@gmx.net>
Cc:        Stephen Mcconnell <stephen.mcconnell@broadcom.com>, FreeBSD-scsi <freebsd-scsi@freebsd.org>
Subject:   Re: problems with SAS JBODs 2
Message-ID:  <7C1E630B-65AD-4FE8-BFDF-F13068070B5E@freebsd.org>
In-Reply-To: <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net>
References:  <trinity-14d18077-ea73-40f6-9e87-d2d4000b1f7e-1530620937871@3c-app-gmx-bs01> <CAOtMX2h8r31AeNCKyckK2P0VLn1CKFogo9bWom2So1x2ngpa4A@mail.gmail.com> <237f77ab-89e2-188b-b2b1-84c6d88609b0@gmx.net> <b785fe02-9242-c95f-56cb-2130f90e17f5@gmx.net> <3caf8ccd6fde8cfc4db25bae5327c46b@mail.gmail.com> <0af047d477d15ec364140653bd967c89@mail.gmail.com> <54B10B7C-CDCE-4428-B584-59CE8F38B120@freebsd.org> <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net>

next in thread | previous in thread | raw e-mail | index | archive | help

> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder@gmx.net> =
wrote:
>=20
> On 07/11/2018 10:35 PM, Ken Merry wrote:
>> Oliver, what happens when you try to do I/O to the devices that =
don=E2=80=99t go away after you pull the cable?  Does that cause the =
devices to go away?
>=20
> I tried to 'dd if=3D/dev/daX of=3D/dev/null bs=3D1k count=3D1' and at =
least the "da" device disappears.

Ok, that=E2=80=99s good.  Can you send the dmesg output and check with =
=E2=80=98camcontrol devlist -v=E2=80=99 to make sure the device has =
fully gone away?

The reason I ask is that I have spent lots of time over the years =
debugging device arrival and departure problems in CAM, GEOM and devfs, =
and I want to make sure we aren=E2=80=99t running into any non-SAS =
related problems.

>=20
>> Looking at the mprutil output, it also shows the devices sticking =
around from the adapter=E2=80=99s standpoint.
>>=20
>> You can also try a =E2=80=98camcontrol rescan all=E2=80=99 or a =
=E2=80=98camcontrol rescan N=E2=80=99 (where N is the scbus number shown =
by =E2=80=98camcontrol devlist -v=E2=80=99).  That will do some basic =
probes for each of the devices and should in theory cause them to go =
away if they aren=E2=80=99t accessible.
>>=20
>> It seems like the adapter may not be recognizing that the devices in =
question have gone.
>=20
>=20
> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. =
While I not sure anymore if that cleans up the non-working devices, I'm =
sure that no new devices were added.

If doing a read from the device with dd makes it go away, =E2=80=98camcont=
rol rescan all=E2=80=99 should make it go away as well.  It sends =
command to every device, and if the mpr(4) driver tells CAM the drive is =
no longer there, it=E2=80=99ll get removed.

If it doesn=E2=80=99t cause the device to get removed (and the rescan =
doesn=E2=80=99t hang), it means that you=E2=80=99re getting a response =
from a device that is no longer physically connected to the machine, =
which is impossible with SAS.

>=20
> Unfortunately I haven't gotten yet to Steves 'clear controller =
mapping' script but I did a few other things:

Steve=E2=80=99s email made it sound like he was going to send it.  I =
just sent it to you separately.

> * The last time I tried to upgrade the firmware I had all sorts of =
problems. "sas3flash" reported bad checksums while flashing some of the =
files.
> So I reflashed both controllers with the DOS version of sas3flash. =
This was basically a challenge in itself because the DOS version of this =
utility does not seem to run on computers of this decade. (ERROR:  =
Failed to initialize PAL.  Exiting program.)
> The equivalent sas3flash.EFI version seems to be out of date and =
caused the checksum problems described before.
> (This time I wiped them before flashing with "sas3flash -o -e 6=E2=80=9D=
.)

That is unfortunate=E2=80=A6perhaps Steve has some insight.

>=20
> * I tried to change mpr tuneable "use_phy_num" after that but this has =
not improved the situation. I will retry and collect logs with Steves =
script.

Changed it to what?  I think it defaults to 1.  Did you try 0?

> * I retried with the latest "mpr.ko" from the broadcom download page. =
(Same problems, no "use_phy_num" tuneable.)
>=20
> * I retested this hardware with Linux (4.15 and 4.17)
> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 =
disks disappear, 45 disks reappear)
> ** The newest shelf 2 disks were missing after the replugging (ie: 44 =
disks show up, 44 disks disappear, 42 disks reappear) (kernel log =
mpt3sas_cm0: "device is not present handle)
>=20
> * I tired a different controller
> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) =
(Firmware 16.00.01.00 or 15.00.00.00)
> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI =
9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something =
similar with 09*))
> With the new controller everything seems work on Linux. It might be =
the old Firmware?...
> It is better with the new controller on FreeBSD in that sense that I =
at least get one out of two /dev/sesX devices back. But disks are still =
missing and are not getting completely cleaned up=E2=80=A6

It does sound a bit like a mapping table problem.  Clearing it might =
help, we=E2=80=99ll see.

> This whole thing is a bit frustrating, especially since up until now I =
thought that HBAs are kind of "connect and forget" devices. Next step is =
to set up a separate test environment and try to get it to work there. I =
will keep you updated and try provide log for all FreeBSD related =
problems.

Thanks for debugging this.  Unfortunately there are a number of ways it =
can go wrong.  The mapping code has been the source of some problems, =
sometimes enclosure vendors do the wrong thing, and sometimes there are =
other bugs.

Ken =20




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7C1E630B-65AD-4FE8-BFDF-F13068070B5E>