From owner-freebsd-stable@FreeBSD.ORG Sat Oct 2 23:23:21 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1666D106564A for ; Sat, 2 Oct 2010 23:23:21 +0000 (UTC) (envelope-from dan@langille.org) Received: from nyi.unixathome.org (nyi.unixathome.org [64.147.113.42]) by mx1.freebsd.org (Postfix) with ESMTP id B8E4C8FC13 for ; Sat, 2 Oct 2010 23:23:20 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by nyi.unixathome.org (Postfix) with ESMTP id 1E0BB508D8; Sun, 3 Oct 2010 00:23:20 +0100 (BST) X-Virus-Scanned: amavisd-new at unixathome.org Received: from nyi.unixathome.org ([127.0.0.1]) by localhost (nyi.unixathome.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UthwVOVZhghc; Sun, 3 Oct 2010 00:23:19 +0100 (BST) Received: from smtp-auth.unixathome.org (smtp-auth.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) by nyi.unixathome.org (Postfix) with ESMTPSA id C6052508AD ; Sun, 3 Oct 2010 00:23:19 +0100 (BST) Message-ID: <4CA7BEE4.9050201@langille.org> Date: Sat, 02 Oct 2010 19:23:16 -0400 From: Dan Langille Organization: The FreeBSD Diary User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: Jeremy Chadwick References: <4CA73702.5080203@langille.org> <20101002141921.GC70283@icarus.home.lan> <4CA7AD95.9040703@langille.org> <20101002223626.GB78136@icarus.home.lan> In-Reply-To: <20101002223626.GB78136@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable Subject: Re: out of HDD space - zfs degraded X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 23:23:21 -0000 On 10/2/2010 6:36 PM, Jeremy Chadwick wrote: > On Sat, Oct 02, 2010 at 06:09:25PM -0400, Dan Langille wrote: >> On 10/2/2010 10:19 AM, Jeremy Chadwick wrote: >>> On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote: >>>> Overnight I was running a zfs send | zfs receive (both within the >>>> same system / zpool). The system ran out of space, a drive went off >>>> line, and the system is degraded. >>>> >>>> This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18 >>>> 23:43:48 EDT 2010. >>>> >>>> The following logs are also available at >>>> http://www.langille.org/tmp/zfs-space.txt<- no line wrapping >>>> >>>> This is what was running: >>>> >>>> # time zfs send storage/bacula@transfer | mbuffer | zfs receive >>>> storage/compressed/bacula-mbuffer >>>> in @ 0.0 kB/s, out @ 0.0 kB/s, 3670 GB total, buffer 100% >>>> fullcannot receive new filesystem stream: out of space >>>> mbuffer: error: outputThread: error writing to at offset >>>> 0x395917c4000: Broken pipe >>>> >>>> summary: 3670 GByte in 10 h 40 min 97.8 MB/s >>>> mbuffer: warning: error during output to: Broken pipe >>>> warning: cannot send 'storage/bacula@transfer': Broken pipe >>>> >>>> real 640m48.423s >>>> user 8m52.660s >>>> sys 211m40.862s >>>> >>>> >>>> Looking in the logs, I see this: >>>> >>>> Oct 2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device >>>> Oct 2 00:50:54 kraken kernel: siisch0: Timeout on slot 30 >>>> Oct 2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss >>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >>>> Oct 2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT >>>> Oct 2 00:50:55 kraken kernel: siisch0: Timeout on slot 30 >>>> Oct 2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss >>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >>>> Oct 2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT >>>> Oct 2 00:50:56 kraken kernel: siisch0: Timeout on slot 30 >>>> Oct 2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss >>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >>>> Oct 2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT >>>> Oct 2 00:50:57 kraken kernel: siisch0: Timeout on slot 30 >>>> Oct 2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss >>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >>>> Oct 2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT >>>> Oct 2 00:50:58 kraken kernel: siisch0: Timeout on slot 30 >>>> Oct 2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss >>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >>>> Oct 2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT >>>> Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage >>>> path=/dev/gpt/disk06-live offset=270336 size=8192 error=6 >>>> >>>> Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize >>>> cache failed >>>> Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry >>>> >>>> Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage >>>> path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6 >>>> Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage >>>> path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6 >>>> >>>> $ zpool status >>>> pool: storage >>>> state: DEGRADED >>>> scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go >>>> config: >>>> >>>> NAME STATE READ WRITE CKSUM >>>> storage DEGRADED 0 0 0 >>>> raidz2 DEGRADED 0 0 0 >>>> gpt/disk01-live ONLINE 0 0 0 >>>> gpt/disk02-live ONLINE 0 0 0 >>>> gpt/disk03-live ONLINE 0 0 0 >>>> gpt/disk04-live ONLINE 0 0 0 >>>> gpt/disk05-live ONLINE 0 0 0 >>>> gpt/disk06-live REMOVED 0 0 0 >>>> gpt/disk07-live ONLINE 0 0 0 >>>> >>>> $ zfs list >>>> NAME USED AVAIL REFER MOUNTPOINT >>>> storage 6.97T 1.91T 1.75G /storage >>>> storage/bacula 4.72T 1.91T 4.29T /storage/bacula >>>> storage/compressed 2.25T 1.91T 46.9K /storage/compressed >>>> storage/compressed/bacula 2.25T 1.91T 42.7K /storage/compressed/bacula >>>> storage/pgsql 5.50G 1.91T 5.50G /storage/pgsql >>>> >>>> $ sudo camcontrol devlist >>>> Password: >>>> at scbus2 target 0 lun 0 (pass1,ada1) >>>> at scbus3 target 0 lun 0 (pass2,ada2) >>>> at scbus4 target 0 lun 0 (pass3,ada3) >>>> at scbus5 target 0 lun 0 (pass4,ada4) >>>> at scbus6 target 0 lun 0 (pass5,ada5) >>>> at scbus7 target 0 lun 0 (pass6,ada6) >>>> at scbus8 target 0 lun 0 (pass7,ada7) >>>> at scbus9 target 0 lun 0 (cd0,pass8) >>>> at scbus10 target 0 lun 0 (pass9,ada8) >>>> >>>> I'm not yet sure if the drive is fully dead or not. This is not a >>>> hot-swap box. >>> >>> It looks to me like the disk labelled gpt/disk06-live literally stopped >>> responding to commands. The errors you see are coming from the OS and >>> the siis(4) controller, and both indicate the actual hard disk isn't >>> responding to the ATA command READ LOG EXT. error=6 means Device not >>> configured. >>> >>> I can't see how/why running out of space would cause this. It looks >>> more like that you had a hardware issue of some sort happen during the >>> course of the operations you were running. It may not have happened >>> until now because you hadn't utilised writes to that area of the disk >>> (could have bad sectors there, or physical media/platter problems). >>> >>> Please provide smartctl -a output for the drive that's gpt/disk06-live, >>> which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't >>> it? Sigh...). Please put the results up on the web somewhere, not >>> copy-pasted, otherwise I have to do a bunch of manual work with regarsd >>> to line wrapping/etc... I'll provide an analysis of SMART stats for >>> you, to see if anything crazy happened to the disk itself. >> >> It is ada0, I'm sure, based on the 'lost device' mentioned in >> /var/log/messages above. >> >> I'm getting nowhere. /dev/ada0 does not exist so there is nothing >> for smartctl to work on. >> >> [...] >> >> $ ls -l /dev/ada0* >> ls: /dev/ada0*: No such file or directory > > Okay, so gpt/disk06-live is /dev/ada0. (I won't ask why the label is > called "disk06", but whatever. :-) ) > >> I am tempted to reboot or do a camontrol scan. > > DO NOT REBOOT.You can try the following -- I'm not sure whether to > use scbus0 or scbus1 as the argument however, since I don't know what > scbusX number ada0 was attached to previously. "dmesg" from when the > machine booted would show this. > > camcontrol reset scbusX > camcontrol rescan scbusX I see this in /var/run/dmesg.boot: ada0 at siisch0 bus 0 scbus0 target 0 lun 0 ada0: ATA-8 SATA 2.x device ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled $ sudo camcontrol reset scbus0 Password: Reset of bus 0 was successful $ sudo camcontrol rescan scbus0 Re-scan of bus 0 was successful > If the disk comes back, please smartctl -a it. I didn't come back: $ ls /dev/ada* /dev/ada1 /dev/ada2p1 /dev/ada4 /dev/ada5p1 /dev/ada7 /dev/ada1p1 /dev/ada3 /dev/ada4p1 /dev/ada6 /dev/ada8 /dev/ada2 /dev/ada3p1 /dev/ada5 /dev/ada6p1 FYI, there's nothing new in /var/log/messages as a results of those commands. -- Dan Langille - http://langille.org/