Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Apr 2020 12:06:50 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Stefan Bethke <stb@lassitu.de>
Cc:        freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: nvme0 error
Message-ID:  <CANCZdfqKPH-xxd2AWnAFmq9opNs=_-7T2d=txCVPVHDsvFxQ_g@mail.gmail.com>
In-Reply-To: <636DB3B3-E4C7-4A17-AB79-8AFDC6352712@lassitu.de>
References:  <636DB3B3-E4C7-4A17-AB79-8AFDC6352712@lassitu.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Apr 30, 2020 at 11:48 AM Stefan Bethke <stb@lassitu.de> wrote:

> nvme0: async event occurred (type 0x1, info 0x00, page 0x02)
> nvme0: device reliability degraded
>

type 1: SMART event
info 0: reliability error
page 2: look at what's up here

1.4 standard says:
NVM subsystem Reliability: NVM subsystem reliability has been compromised.
This may be due to significant media errors, an internal error, the media
being placed in read only mode, or a volatile memory backup device failing.
This status value shall not be used if the read-only condition on the media
is due to a change in the write protection state of a namespace (refer to
section 8.19.1).

Should I be concerned? I'm using this Samsung SSD as cache and log for ZFS
> on a 12-stable machine.
>
> nvd0: <SAMSUNG MZVPW128HEGM-00000> NVMe namespace
> nvd0: 122104MB (250069680 512 byte sectors)
>
> # nvmecontrol logpage -p 2 nvme0
> SMART/Health Information Log
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
> Critical Warning State:         0x04
>  Available spare:               0
>  Temperature:                   0
>  Device reliability:            1
>  Read only:                     0
>  Volatile memory backup:        0
> Temperature:                    311 K, 37.85 C, 100.13 F
> Available spare:                100
> Available spare threshold:      10
> Percentage used:                110
> Data units (512,000 byte) read: 18417596
> Data units written:             164091845
> Host read commands:             499986873
> Host write commands:            1491808067
> Controller busy time (minutes): 48315
> Power cycles:                   59
> Power on hours:                 20432
> Unsafe shutdowns:               26
> Media errors:                   0
> No. error info log entries:     22
> Warning Temp Composite Time:    0
> Error Temp Composite Time:      0
> Temperature Sensor 1:           311 K, 37.85 C, 100.13 F
> Temperature Sensor 2:           330 K, 56.85 C, 134.33 F
> Temperature 1 Transition Count: 0
> Temperature 2 Transition Count: 0
> Total Time For Temperature 1:   0
> Total Time For Temperature 2:   0
>

I'm thinking percent used 110 may be the thing it's alerting on, the
standard says:

Percentage Used: Contains a vendor specific estimate of the percentage of
NVM subsystem life used based on the actual usage and the manufacturer=E2=
=80=99s
prediction of NVM life. A value of 100 indicates that the estimated
endurance of the NVM in the NVM subsystem has been consumed, but may not
indicate an NVM subsystem failure. The value is allowed to exceed 100.
Percentages greater than 254 shall be represented as 255. This value shall
be updated once per power-on hour (when the controller is not in a sleep
state). Refer to the JEDEC JESD218A standard for SSD device life and
endurance measurement techniques.

Warner


>
> Stefan
>
> --
> Stefan Bethke <stb@lassitu.de>   Fon +49 151 14070811
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqKPH-xxd2AWnAFmq9opNs=_-7T2d=txCVPVHDsvFxQ_g>