Date: Sun, 11 Oct 2009 15:41:51 +0100 From: Alex Trull <alextzfs@googlemail.com> To: freebsd-fs@freebsd.org Cc: pjd@freebsd.org Subject: zraid2 loses a single disk and becomes difficult to recover Message-ID: <4d98b5320910110741w794c154cs22b527485c1938da@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Hi All, My zraid2 has broken this morning on releng_7 zfs13. System failed this morning and came back without pool - having lost a disk. This is how I found the system: pool: fatman state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scrub: none requested config: NAME STATE READ WRITE CKSUM fatman FAULTED 0 0 1 corrupted data raidz2 DEGRADED 0 0 6 da2 FAULTED 0 0 0 corrupted data ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad20 ONLINE 0 0 0 ad22 ONLINE 0 0 0 ad17 ONLINE 0 0 0 da2 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad16 ONLINE 0 0 0 Initialy it complained that da3 had gone to da2 (da2 had failed and was no longer seen) I replaced the original da2 and bumped what was originaly da3 back up to da3 using the controllers ordering. [root@potjie /dev]# zpool status pool: fatman state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scrub: none requested config: NAME STATE READ WRITE CKSUM fatman FAULTED 0 0 1 corrupted data raidz2 ONLINE 0 0 6 da2 UNAVAIL 0 0 0 corrupted data ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad20 ONLINE 0 0 0 ad22 ONLINE 0 0 0 ad17 ONLINE 0 0 0 da3 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad16 ONLINE 0 0 0 Issue looks very similar to this (JMR's issue) : http://freebsd.monkey.org/freebsd-fs/200902/msg00017.html I've tried the methods there without much result. Using JMR's patches/debugs to see what is going on, this is what I got: JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46475459 ub_timestamp=1255231780 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46475458 ub_timestamp=1255231750 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46481472 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 But JMR's patch still doesn't let me import even with a decremented txg I then had a look around the drives using zdb and some dirty script: [root@potjie /dev]# ls /dev/ad* /dev/da2 /dev/da3 | awk '{print "echo "$1";zdb -l "$1" |grep txg"}' | sh /dev/ad10 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad16 txg=46408223 <- old TXGid ? txg=46408223 txg=46408223 txg=46408223 /dev/ad17 txg=46408223 <- old TXGid ? txg=46408223 txg=46408223 txg=46408223 /dev/ad18 (ssd) /dev/ad19 (spare drive, removed from pool some time ago) txg=0 create_txg=0 txg=0 create_txg=0 txg=0 create_txg=0 txg=0 create_txg=0 /dev/ad20 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad22 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad4 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad6 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/da2 < new drive replaced broken da2 /dev/da3 txg=46488654 txg=46488654 txg=46488654 txg=46488654 I did not see any checksums or other issues on ad16 and ad17 previously, and I do check regularly. Any thoughts on what to try next ? Regards, Alex
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4d98b5320910110741w794c154cs22b527485c1938da>