Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 01 Feb 2012 19:09:25 +0100
From:      Willem Jan Withagen <wjw@digiware.nl>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        "stable@freebsd.org" <stable@freebsd.org>
Subject:   Re: Troube with SSD
Message-ID:  <4F297FD5.7090809@digiware.nl>
In-Reply-To: <20120201163556.GA97343@icarus.home.lan>
References:  <4F2940C1.10901@digiware.nl> <20120201143942.GA96012@icarus.home.lan> <4F2960A7.8040705@digiware.nl> <20120201163556.GA97343@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2012-02-01 17:35, Jeremy Chadwick wrote:
> On Wed, Feb 01, 2012 at 04:56:23PM +0100, Willem Jan Withagen wrote:
>> On 2012-02-01 15:39, Jeremy Chadwick wrote:
>>> On Wed, Feb 01, 2012 at 02:40:17PM +0100, Willem Jan Withagen wrote:
>>>> The device is a Corsair 60Gb Force GT. And thusfar I have not found any
>>>> suggestions that that serie of devices is prone to doing this.
>>>
>>> Can you please provide the following output when that SSD is attached
>>> to the system?  You will need to install ports/sysutils/smartmontools
>>> for this (please make sure it's version 5.42 or newer).
>>>
>>> * smartctl -a /dev/whatever
>>> * smartctl -l devstat /dev/whatever
>>> * smartctl -l sataphy /dev/whatever
>>> * smartctl -l ssd /dev/whatever

......

> It is extremely taxing for me to track all of these things, because 99%
> of people do not write down/track what it is they do when they start
> moving hardware around/etc..  I'm not necessarily lecturing you, I'm
> more or less ranting -- I go through this situation two or three times a
> week with people I help online, and it wastes a lot of time.  I have
> another individual in a private Email who asked me for help with 2 disks
> (one SSD, one HD), and kept moving the drives around between 3 different
> machines, giving me random output from each one (behaviour differed per
> box).  I cannot deal with this kind of situation.

I know what you mean. I used to run an ISP in the nighties, and was
usually the last person standing for the real hard problems. By that
time nothing of the original problem was in place.


>> The output of the first one command, but it contains some real weird
>> values.....??
> 
> All the values below look fine to me.  I will try my best to explain.

>> SMART Attributes Data Structure revision number: 10
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>>   1 Raw_Read_Error_Rate     0x000f   082   082   050    Pre-fail  Always       -       897651373777
>>   5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       -       0
>>   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       121242631799621
>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
>> 171 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
>> 172 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
>> 174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       19
>> 177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       0
>> 181 Program_Fail_Cnt_Total  0x0032   000   000   000    Old_age   Always       -       0
>> 182 Erase_Fail_Count_Total  0x0032   000   000   000    Old_age   Always       -       0
>> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
>> 194 Temperature_Celsius     0x0022   026   035   000    Old_age   Always       -       26 (Min/Max 21/35)
>> 195 Hardware_ECC_Recovered  0x001c   120   120   000    Old_age   Offline      -       897651373777
>> 196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
>> 201 Soft_Read_Error_Rate    0x001c   120   120   000    Old_age   Offline      -       897651373777
>> 204 Soft_ECC_Correction     0x001c   120   120   000    Old_age   Offline      -       897651373777
>> 230 Head_Amplitude          0x0013   100   100   000    Pre-fail  Always       -       429496729700
>> 231 Temperature_Celsius     0x0013   100   100   010    Pre-fail  Always       -       0
>> 233 Media_Wearout_Indicator 0x0000   000   000   000    Old_age   Offline      -       1260
>> 234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       1925
>> 241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       1925
>> 242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       1032
> 
> These values all look acceptable/excellent as best I can tell.  The only
> attribute above that interests me is attribute 174.  smartmontools
> doesn't know what this is, but I am curious to know what value "19"
> (which to me appears to be a counter or gauge) actually represents.
> Also, just for note: I think it's cool that Corsair put a thermistor or
> DTS inside of their drive for temperature readings.  Wise of them!
> 
> What you probably meant by "real weird values" are the extremely high
> numbers in the RAW_VALUE column.  This is a sign of an individual who
> lacks familiarity with SMART and does not know how to properly interpret
> attributes.  :-)

Count me in....
I'll leave it in for history as well

> I will make it crystal clear (since this is a mailing list and I'm sure
> someone will read this in the future): you cannot look at RAW_VALUE and
> assume it is a raw integer/counter or gauge.
> 
> SMART attributes and their associated 6-byte data values are not defined
> per ATA standard.  Thus, each vendor can implement them or store the
> data in the RAW_VALUE portion in any format they wish.
> 
> Common vendors who do this are Seagate and Hitachi, and apparently
> Corsair.  The behaviour varies from vendor to vendor, drive model to
> drive model, and firmware to firmware.
> 
> Vendor-encoded values often appear very large or "look scary" to the
> uneducated eye.  smartmontools can decode some of these, but the drive
> has to be in the smartmontools database (drivedb.h), **and** the code
> has to be written in smartmontools to properly decode the data.
> 
> Since the attributes are proprietary, figuring out the format is
> virtually impossible without help from the vendor.  Some (most) vendors
> choose not to disclose this information.  In the case of some Seagate
> drives, the smartmontools folks either got "tips" from someone within
> Seagate, or somehow managed to figure out how to decode some (not all)
> on their own.
> 
> You should probably start digging around on the Corsair forums, or
> within any online documentation you can find from Corsair, to see if
> they document what their SMART attributes are in their drives.  For
> example, Intel documents all of their SMART attributes in an official
> PDF.
> 
>> SMART Error Log not supported
> 
> Well that's disappointing.  That means that any kind of LBA (read/write)
> error inside of the drive will not be logged within the drive itself.
> Thus, the only kind of I/O errors or anomalies you'll be able to verify
> are purely OS-level.  Oh well, there isn't anything anyone can do about
> this.
> 
> So let's recap the original OS errors you saw in FreeBSD:
> 
>> Jan  7 10:04:24 zfs kernel: ahcich3: Timeout on slot 27 port 0
>> Jan  7 10:04:24 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 38000000 rs 38000000 tfd c0 serr 00000000 cmd 0004dd17
>> Jan  7 10:04:56 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
>> Jan  7 10:05:26 zfs kernel: ahcich3: Timeout on slot 29 port 0
>> Jan  7 10:05:26 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17
>> Jan  7 10:05:57 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
>> Jan  7 10:06:27 zfs kernel: ahcich3: Timeout on slot 29 port 0
>> Jan  7 10:06:27 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17
>> Jan  7 10:06:27 zfs kernel: (ada2:ahcich3:0:0:0): lost device
>> Jan  7 10:06:58 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
>> Jan  7 10:07:28 zfs kernel: ahcich3: Timeout on slot 29 port 0
>> Jan  7 10:07:28 zfs kernel: ahcich3: is 00000000 cs e0000000 ss e0000000 rs e0000000 tfd 80 serr 00000000 cmd 0004dd17
>> Jan  7 10:08:16 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
>> Jan  7 10:08:16 zfs kernel: ahcich3: Poll timeout on slot 31 port 0
>> Jan  7 10:08:16 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17
>> Jan  7 10:08:46 zfs kernel: ahcich3: Timeout on slot 31 port 0
>> Jan  7 10:08:46 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17
>> Jan  7 10:08:48 zfs kernel: (ada2:ahcich3:0:0:0): removing device entry
>> Jan  7 10:09:33 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
>> Jan  7 10:09:33 zfs kernel: ahcich3: Poll timeout on slot 31 port 0
>> Jan  7 10:09:33 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17
> 
> What is shown here appears to be the SSD disk simply falling off the
> SATA bus.  Do not get confused about the "slots" and how the numbers
> there change; that has nothing to do with SATA ports or anything like
> that, it's an internal AHCI protocol thing.  (I believe FreeBSD supports
> distributing commands across multiple slots or spreading them across
> multiple slots for added benefits).
> 
> Everything above indicates that after 30 seconds (well, 31 seconds
> exactly, but I imagine it's 30 seconds plus 1 extra second due to how
> the timeout loop might be written) the drive stopped responding to
> commands on the AHCI protocol level.

So you say that in that 30 secs, it did respond to other commands?

I would read it as:
	issue command
	wait for it to return with result
	BANG, after 30 secs timeout, because there was no result
	 returned

And then finally gave up, due to to many tries.

> This could be caused by a multitude of things, and it is very difficult
> for me remotely to diagnose any of these:
> 
> - Power supply issues (voltage ripple, not enough amps on that port,
>   shoddy or loose SATA power connector)

It's a super-micro chassis, with 16 hotswap bays...
only 11 are in use
Could be power, but it's a relative new box.

> - SATA cable issues (cable too long, possibly some broken copper within
>   the cable itself (very unlikely though), etc.)

too long, might be. The sata cable runs the full width of a 19" server.

> - SATA port (physical) problems; dust in connectors, etc.

???, I'll check, but the room it is in is reasonably clean.

> - SSD-level issues.  There are so many possibilities here (more than
>   on a MHDD) that it's almost impossible to list them all off:
>   -- Internal garbage collection mechanism (this is different than TRIM)
>      on drive may be overly aggressive and stalls all I/O to drive
>      during heavy GC.  This would be classified as a firmware bug
>   -- Power circuitry on PCB may be flaky
>   -- Drive may have locked up hard due to other firmware bugs or some
>      form of very low-level electrical/electronic error
>   -- Internal SSD SATA + NAND flash I/O controller failure

That was why I was asking if there was anybody else out there that ran
into something similar with this type.

> For those considering the remote possibility of interoperability issues
> between the Corsair SSD and the AHCI controller -- it's possible, but
> highly unlikely.  The controller itself is an Intel ICH9, which FreeBSD has
> excellent support for and is very reliable.  So, the controller here is
> probably not at fault.  I imagine if there were incompatibilities of
> this sort (between ICH9 and Corsair), we'd have heard about it.

My idea.

> I have seen many drives in my time (many means hundreds, no
> exaggeration) "lock up" or fall off the bus, both on SCSI and SATA.
> It's very difficult to troubleshoot these kinds of issues as I said, and
> usually requires someone with extensive knowledge to figure it out.
> General "Tier 1" Technical Support from companies do not have this level
> of expertise, so don't expect that from Corsair.
> 
> I look forward to seeing the output from the below 3 commands, as they
> may provide more insights to what actually transpired.  Whether or not
> Corsair chose to implement these in the General Purpose Log area of
> SMART is unknown, however.  Furthermore, they may actually implement
> them, but stick them in a non-common place (e.g. different GPLog
> offsets), but PLEASE DO NOT go tinkering around with -l gplog,0xXX
> values.
> 
>>> * smartctl -l devstat /dev/whatever
>>> * smartctl -l sataphy /dev/whatever
>>> * smartctl -l ssd /dev/whatever

It'll have to wait until later this evening, before I able to swap the
disk back in the old box.

Thanx for explaining,
--WjW






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F297FD5.7090809>