Date: Fri, 27 Mar 2020 17:39:33 -0700 From: David Christensen <dpchrist@holgerdanske.com> To: freebsd-questions@freebsd.org Subject: Re: drive selection for disk arrays Message-ID: <1bcd7aa2-31e5-91f1-5151-926c9d16e16e@holgerdanske.com> In-Reply-To: <20200327104555.1d6d7cd9.freebsd@edvax.de> References: <20200325081814.GK35528@mithril.foucry.net> <713db821-8f69-b41a-75b7-a412a0824c43@holgerdanske.com> <20200326124648725158537@bob.proulx.com> <alpine.BSF.2.21.9999.2003261630030.47777@mail2.nber.org> <20200327104555.1d6d7cd9.freebsd@edvax.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-03-27 02:45, Polytropon wrote: > When a drive _reports_ bad sectors, at least in the past > it was an indication that it already _has_ lots of them. > The drive's firmware will remap bad sectors to spare > sectors, so "no error" so far. If a drive detects an error, my guess is that it will report the error to the OS; regardless of the outcome of a particular I/O operation (data read, data written, data lost) or internal actions taken (block marked bad, block remapped, etc.). It is then up to the OS to decide what to do next. RAID and/or ZFS offer the means for shielding the application from I/O and drive failures. > When errors are being > reported "upwards" ("read error" or "write error" > visible to the OS), it's a sign that the disk has run > out of spare sectors, and the firmware cannot silently > remap _new_ bad sectors... > > Is this still the case with modern drives? > > How transparently can ZFS handle drive errors when the > drives only report the "top results" (i. e., cannot cope > with bad sectors internally anymore)? Do SMART tools help > here, for example, by reading certain firmware-provided > values that indicate how many sectors _actually_ have > been marked as "bad sector", remapped internally, and > _not_ reported to the controller / disk I/O subsystem / > filesystem yet? This should be a good indicator of "will > fail soon", so a replacement can be done while no data > loss or other problems appears. I have been using smartctl(8) occasionally for many years. The "SMART Attributes Data Structure" report would seem to hold statistics that should be useful for predicting failures. This is my SOHO server: 2020-03-27 17:20:00 toor@f3 ~ # freebsd-version ; uname -a 12.1-RELEASE-p2 FreeBSD f3.tracy.holgerdanske.com 12.1-RELEASE-p2 FreeBSD 12.1-RELEASE-p2 GENERIC amd64 This is a data drive: 2020-03-27 17:20:05 toor@f3 ~ # geom disk list ada1 Geom name: ada1 Providers: 1. Name: ada1 Mediasize: 3000592982016 (2.7T) Sectorsize: 512 Mode: r1w1e3 descr: SEAGATE ST33000650NS lunid: 5000c5004e7ce23f ident: <redacted> rotationrate: 7200 fwsectors: 63 fwheads: 16 2020-03-27 17:20:08 toor@f3 ~ # smartctl -x /dev/ada1 | grep -A 30 'SMART Attributes Data Structure' SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 078 066 044 - 78783152 3 Spin_Up_Time PO---- 092 091 000 - 0 4 Start_Stop_Count -O--CK 100 100 020 - 20 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 7 Seek_Error_Rate POSR-- 066 060 030 - 4532285 9 Power_On_Hours -O--CK 100 100 000 - 612 10 Spin_Retry_Count PO--C- 100 100 097 - 0 12 Power_Cycle_Count -O--CK 100 100 020 - 20 184 End-to-End_Error -O--CK 100 100 099 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 188 Command_Timeout -O--CK 100 100 000 - 0 189 High_Fly_Writes -O-RCK 100 100 000 - 0 190 Airflow_Temperature_Cel -O---K 051 046 045 - 49 (Min/Max 39/54) 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 192 Power-Off_Retract_Count -O--CK 100 100 000 - 6 193 Load_Cycle_Count -O--CK 100 100 000 - 20 194 Temperature_Celsius -O---K 049 054 000 - 49 (0 21 0 0 0) 195 Hardware_ECC_Recovered -O-RC- 033 031 000 - 78783152 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning The following attributes look like they may be related to drive failure, but I do not know the engineering definition of these attributes nor the engineering definition of the values reported: Reallocated_Sector_Ct Seek_Error_Rate End-to-End_Error Reported_Uncorrect Hardware_ECC_Recovered Offline_Uncorrectable UDMA_CRC_Error_Count I do feel the need to implemented automated SMART monitoring, but have yet to embark on that journey. David
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1bcd7aa2-31e5-91f1-5151-926c9d16e16e>