Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 09 Dec 2014 09:25:10 +0000
From:      Steven Hartland <killing@multiplay.co.uk>
To:        freebsd-stable@freebsd.org
Subject:   Re: 10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
Message-ID:  <5486BFF6.10309@multiplay.co.uk>
In-Reply-To: <20141209093405.6dd2c268@orwell>
References:  <20141106003240.344dedf6@orwell> <545AB64F.1060502@multiplay.co.uk> <20141106012739.509b96b5@orwell> <545ACCEF.5000300@multiplay.co.uk> <20141209093405.6dd2c268@orwell>

next in thread | previous in thread | raw e-mail | index | archive | help

On 09/12/2014 08:34, Kai Gallasch wrote:
> Am Thu, 06 Nov 2014 01:20:47 +0000
> schrieb Steven Hartland <killing@multiplay.co.uk>:
>
>> Try recabling and re-seating, if it still happens try to identify if
>> its the disk or backplane by moving it in the chassis. We had a
>> machine here recently where it was backplane issue and simply
>> replacing it fixed the issue.
> Steven.
>
> In the last weeks I took some time to single out the reason for the AHCI
> timeouts with the two Samsung SSD drives.
>
> Just for the record, my original post on the FreeBSD mailing
> list archive:
>
> http://lists.freebsd.org/pipermail/freebsd-stable/2014-November/080914.html
>
>
> I changed / tried the following to get rid of the AHCI timouts, but no
> chance, they still show :-/
>
> Hardware:
>
> - Changed all four SATA cables with cables of an identical spare server
> - Changed all four SATA cables with certified SATA3 cables
> - Replaced the 2.5" -> 3.5" drive converters with ones of another
>    manufacturer
> - Replaced the drive backplane of the server
> - Directly hooking the two SSDs up to the SATA connectors on the
>    mainboard
> - Experimentally put an LSI 9212-4i4e PCIe SATA/SAS Controller into the
>    server and and connected the SATA cables to it.
> - Same as before, but using the certified SATA3 cables
> - Same as before, but this time connecting the two SSDs directly to the
>    9212-4i4e
> - Same as before, connecting the two SSD directly to the 9212-4i4e, but
>    this time with the original SATA cables
>
>
> BIOS:
> - Temporarily disabled Power Management
> - Tried disabling "Enable Hot Plug" Option
>
>
> The difference between using the SATA connectors of the mainboard and
> using the LSI 9212-4i4e is, that the LSI controller seems to be more
> picky about CRC errors on the SATA bus and bus problems even show
> without starting a zfs scrub. When doing a scrub using the LSI
> controller there are plenty of timeouts and in one test, one of the SSD
> drives even disappeard from the SATA bus.
>
> Of course all the time during testing the two Hitachi non-SSD SATA
> drives did not show any problems at all - although also connected to
> the mainboard or the LSI controller during the testing.
>
> So I now think the whole problem centers around the Samsung 850 PRO
> 512GB SSDs. Too bad I do not have the budget to just buy two Intel (or
> other) SSDs of similar size and see if the timeouts disappear..
>
> I wonder if this is a firmware issue with the drive or just some
> misguided fancy energy saving feature of this particular drive
> model causing the whole trouble.
>
> Both drives have serial numbers not far apart and smartctl claims there
> are no errors on the SSDs.
>
> Any ideas (left) ?
>
Have you tried dropping the speed on the ahci controller e.g.
hint.ahcich.0.sata_rev="2"

I can't say I've used Samsung 850 Pro's but we do have plenty of 840's, 
which are attached to LSI controllers in service here and never had an 
issue.

The other issue might have is bad MB, where the issue is actually 
occurring in memory. Given you have replaced the controller and cables 
this might be your issue. So try replacing the Memory, MB, CPU etc. The 
last time I had really bad corruption issues the problem turned out to 
be dodgy Intel CPU.

Also someone posted on the list not yesterday they had constant CKSUM 
errors from ZFS and it turned out to be their power causing the issue 
and running the server of a sign wave based UPS made the problem go away.

     Regards
     Steve



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5486BFF6.10309>