Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 17 Feb 2018 07:41:49 +0000 (UTC)
From:      James Phillips <anti_spam256@yahoo.ca>
To:        <freebsd-fs@freebsd.org>
Subject:   ZFS trashed by bad import
Message-ID:  <366879496.203508.1518853309719@mail.yahoo.com>
References:  <366879496.203508.1518853309719.ref@mail.yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Was considering posting this on the forum: but the rules on topic selection suggested really specific things should be on the mailing list.

Short version (reconstructed from notes):

On a fresh 11.1 install:

# zpool import  -> shows list of available pools, including degraded striped mirror.
# zpool import -f 8255478166520290766 granny
# zpool status (any zpool command cause same error):
     internal error: failed to initialize ZFS library

Upon reboot, I was not able to switch VT consoles or log in.

Tried telling the BIOS to boot from my old installation (granny), and it failed after kernel device 
detection.

*Background*:

Granny was originally a 160GB ZFS mirrored with FreeBSD 10. I later expanded the pool with a mirrored pair of 80GB drives.

I had successfully tested booting with a simulated controller failure. (each mirror was on a different disk controller + all drives had a boot partition set)

About a week ago, one of my drives appeared to fail:

(ada1:ata2:0:1:0): Error 5. retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) label/granny3p1.eli[READ(offset=32083968, lengt...
swap_pager: I/O error - pagein failed; blkno 2367129, size 4096, erro 5
va_fault: pager read error, pid 91969 (xfdesktop)
(ada1:ata2:0:1:0): READ_DMA. ACB: c8 00 32 9b 00 40 00 00 00 00 18 00
(ada1:ata2:0:1:0): CAM status: ATA Status Error
(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada1:ata2:0:1:0): RES: 51 40 42 9b 00 00 00 00 00 08
(ada1:ata2:0:1:0): Retrying command
(ada1:ata2:0:1:0): READ_DMA. ACB: c8 00 32 9b 00 40 00 00 00 00 18 00
...

I was able to (temporarily) use my computer again by pulling one of the IDE cables. (by luck guessed which side the first time -- did not notice the label above until I typed this.)
Was a little surprised it was not the drive re-certified by manufacturer software after throwing errors (years ago).

I decided to resolve the problem by moving to a ZFS mirror on a pair of 2TB drives. Incidentally, I accidentally deleted pkg while trying to update the ports collection, so decided a fresh BSD 11 install may be a good idea as well.

*Confounding variables*:

While pulling the defective half of the mirror, I tentatively ruled heat death due to dust build up on air intake. However, I also noticed the Northbridge heatsink was loose due to a broken clip.

Because my "real" machine (with ECC RAM even) is going to be delayed at least a week, I decided to do a temporary board swap with an older machine I had laying around. This machine was overclocked by under-volting, and pushing thermal limits of the CPU (while under-clocking RAM), then backing off a bit to tolerate summer heat.

I mention the over-clocking because the system failed to boot properly after installation. I bumped the voltage a little, but it may have had to do with BIOS Booting from an unexpected drive instead. (the 2TB disks were seen as ad2 and ad3). The Over-clock was stable when that machine went in storage around a year ago. However it is now in a case with a different PSU (same wattage, more efficient), and more drives.

Tried all the ZFS options in the BSD 11 install wizard:
    2 disk mirror
    4k sectors - GPT partition
    Encrypted disks - 50GB swap (large for the memory: 3200MB)
    Mirror swap - Encrypt swap
-> Note: granny only had encrypted (non mirrored) swap: could not get encrypted striping to work.

System hardening:
- clean /tmp on startup
- disable opening sylogd network socket

At the time of the failure, I was running mprime (prime95) in the back-ground, and periodically monitoring
CPU temperature and fan speed. This implies that ZFS had only ~1600MB to work with (3200MB-1600MB used by mprime)

*Next Steps*:

1. image all 4 drives (one at a time) onto a third 2TB drive with the System Rescue CD and dd-rescue.
2. Try to import the degraded mirror with a BSD live DVD (and re-export if successful, I guess)

Depending on results of step 2:
- find machine with ECC RAM, put granny3 on a fresh drive, and tell ZFS to scrub?
- copy over boot partitions that may have been clobbered by BSD 11 install?

If all else fails, I did do a full export in the last 90 days.

Regards,

James Phillips

Note: not subscribed to the list.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?366879496.203508.1518853309719>