From owner-freebsd-stable@FreeBSD.ORG  Sat Oct  2 23:23:21 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1666D106564A
	for <freebsd-stable@freebsd.org>; Sat,  2 Oct 2010 23:23:21 +0000 (UTC)
	(envelope-from dan@langille.org)
Received: from nyi.unixathome.org (nyi.unixathome.org [64.147.113.42])
	by mx1.freebsd.org (Postfix) with ESMTP id B8E4C8FC13
	for <freebsd-stable@freebsd.org>; Sat,  2 Oct 2010 23:23:20 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by nyi.unixathome.org (Postfix) with ESMTP id 1E0BB508D8;
	Sun,  3 Oct 2010 00:23:20 +0100 (BST)
X-Virus-Scanned: amavisd-new at unixathome.org
Received: from nyi.unixathome.org ([127.0.0.1])
	by localhost (nyi.unixathome.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id UthwVOVZhghc; Sun,  3 Oct 2010 00:23:19 +0100 (BST)
Received: from smtp-auth.unixathome.org (smtp-auth.unixathome.org [10.4.7.7])
	(Authenticated sender: hidden)
	by nyi.unixathome.org (Postfix) with ESMTPSA id C6052508AD  ;
	Sun,  3 Oct 2010 00:23:19 +0100 (BST)
Message-ID: <4CA7BEE4.9050201@langille.org>
Date: Sat, 02 Oct 2010 19:23:16 -0400
From: Dan Langille <dan@langille.org>
Organization: The FreeBSD Diary
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US;
	rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4
MIME-Version: 1.0
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
References: <4CA73702.5080203@langille.org>
	<20101002141921.GC70283@icarus.home.lan>
	<4CA7AD95.9040703@langille.org>
	<20101002223626.GB78136@icarus.home.lan>
In-Reply-To: <20101002223626.GB78136@icarus.home.lan>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-stable <freebsd-stable@freebsd.org>
Subject: Re: out of HDD space - zfs degraded
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Oct 2010 23:23:21 -0000

On 10/2/2010 6:36 PM, Jeremy Chadwick wrote:
> On Sat, Oct 02, 2010 at 06:09:25PM -0400, Dan Langille wrote:
>> On 10/2/2010 10:19 AM, Jeremy Chadwick wrote:
>>> On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote:
>>>> Overnight I was running a zfs send | zfs receive (both within the
>>>> same system / zpool).  The system ran out of space, a drive went off
>>>> line, and the system is degraded.
>>>>
>>>> This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18
>>>> 23:43:48 EDT 2010.
>>>>
>>>> The following logs are also available at
>>>> http://www.langille.org/tmp/zfs-space.txt<- no line wrapping
>>>>
>>>> This is what was running:
>>>>
>>>> # time zfs send storage/bacula@transfer | mbuffer | zfs receive
>>>> storage/compressed/bacula-mbuffer
>>>> in @  0.0 kB/s, out @  0.0 kB/s, 3670 GB total, buffer 100%
>>>> fullcannot receive new filesystem stream: out of space
>>>> mbuffer: error: outputThread: error writing to<stdout>   at offset
>>>> 0x395917c4000: Broken pipe
>>>>
>>>> summary: 3670 GByte in 10 h 40 min 97.8 MB/s
>>>> mbuffer: warning: error during output to<stdout>: Broken pipe
>>>> warning: cannot send 'storage/bacula@transfer': Broken pipe
>>>>
>>>> real    640m48.423s
>>>> user    8m52.660s
>>>> sys     211m40.862s
>>>>
>>>>
>>>> Looking in the logs, I see this:
>>>>
>>>> Oct  2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device
>>>> Oct  2 00:50:54 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:55 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:56 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:57 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:58 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>> path=/dev/gpt/disk06-live offset=270336 size=8192 error=6
>>>>
>>>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize
>>>> cache failed
>>>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry
>>>>
>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>> path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6
>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>> path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6
>>>>
>>>> $ zpool status
>>>>    pool: storage
>>>>   state: DEGRADED
>>>>   scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go
>>>> config:
>>>>
>>>>          NAME                 STATE     READ WRITE CKSUM
>>>>          storage              DEGRADED     0     0     0
>>>>            raidz2             DEGRADED     0     0     0
>>>>              gpt/disk01-live  ONLINE       0     0     0
>>>>              gpt/disk02-live  ONLINE       0     0     0
>>>>              gpt/disk03-live  ONLINE       0     0     0
>>>>              gpt/disk04-live  ONLINE       0     0     0
>>>>              gpt/disk05-live  ONLINE       0     0     0
>>>>              gpt/disk06-live  REMOVED      0     0     0
>>>>              gpt/disk07-live  ONLINE       0     0     0
>>>>
>>>> $ zfs list
>>>> NAME                        USED  AVAIL  REFER  MOUNTPOINT
>>>> storage                    6.97T  1.91T  1.75G  /storage
>>>> storage/bacula             4.72T  1.91T  4.29T  /storage/bacula
>>>> storage/compressed         2.25T  1.91T  46.9K  /storage/compressed
>>>> storage/compressed/bacula  2.25T  1.91T  42.7K  /storage/compressed/bacula
>>>> storage/pgsql              5.50G  1.91T  5.50G  /storage/pgsql
>>>>
>>>> $ sudo camcontrol devlist
>>>> Password:
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus2 target 0 lun 0 (pass1,ada1)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus3 target 0 lun 0 (pass2,ada2)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus4 target 0 lun 0 (pass3,ada3)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus5 target 0 lun 0 (pass4,ada4)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus6 target 0 lun 0 (pass5,ada5)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus7 target 0 lun 0 (pass6,ada6)
>>>> <ST380815AS 4.AAB>                   at scbus8 target 0 lun 0 (pass7,ada7)
>>>> <TSSTcorp CDDVDW SH-S223C SB01>      at scbus9 target 0 lun 0 (cd0,pass8)
>>>> <WDC WD1600AAJS-75M0A0 02.03E02>     at scbus10 target 0 lun 0 (pass9,ada8)
>>>>
>>>> I'm not yet sure if the drive is fully dead or not.  This is not a
>>>> hot-swap box.
>>>
>>> It looks to me like the disk labelled gpt/disk06-live literally stopped
>>> responding to commands.  The errors you see are coming from the OS and
>>> the siis(4) controller, and both indicate the actual hard disk isn't
>>> responding to the ATA command READ LOG EXT.  error=6 means Device not
>>> configured.
>>>
>>> I can't see how/why running out of space would cause this.  It looks
>>> more like that you had a hardware issue of some sort happen during the
>>> course of the operations you were running.  It may not have happened
>>> until now because you hadn't utilised writes to that area of the disk
>>> (could have bad sectors there, or physical media/platter problems).
>>>
>>> Please provide smartctl -a output for the drive that's gpt/disk06-live,
>>> which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't
>>> it?  Sigh...).  Please put the results up on the web somewhere, not
>>> copy-pasted, otherwise I have to do a bunch of manual work with regarsd
>>> to line wrapping/etc...  I'll provide an analysis of SMART stats for
>>> you, to see if anything crazy happened to the disk itself.
>>
>> It is ada0, I'm sure, based on the 'lost device' mentioned in
>> /var/log/messages above.
>>
>> I'm getting nowhere.  /dev/ada0 does not exist so there is nothing
>> for smartctl to work on.
>>
>> [...]
>>
>> $ ls -l /dev/ada0*
>> ls: /dev/ada0*: No such file or directory
>
> Okay, so gpt/disk06-live is /dev/ada0.  (I won't ask why the label is
> called "disk06", but whatever.  :-) )
>
>> I am tempted to reboot or do a camontrol scan.
>
> DO NOT REBOOT.You can try the following -- I'm not sure whether to
> use scbus0 or scbus1 as the argument however, since I don't know what
> scbusX number ada0 was attached to previously.  "dmesg" from when the
> machine booted would show this.
>
> camcontrol reset scbusX
> camcontrol rescan scbusX

I see this in /var/run/dmesg.boot:

ada0 at siisch0 bus 0 scbus0 target 0 lun 0
ada0: <Hitachi HDS722020ALA330 JKAOA28A> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled

$ sudo camcontrol reset scbus0
Password:
Reset of bus 0 was successful

$ sudo camcontrol rescan scbus0
Re-scan of bus 0 was successful

> If the disk comes back, please smartctl -a it.

I didn't come back:

$ ls /dev/ada*
/dev/ada1   /dev/ada2p1 /dev/ada4   /dev/ada5p1 /dev/ada7
/dev/ada1p1 /dev/ada3   /dev/ada4p1 /dev/ada6   /dev/ada8
/dev/ada2   /dev/ada3p1 /dev/ada5   /dev/ada6p1

FYI, there's nothing new in /var/log/messages as a results of those 
commands.


-- 
Dan Langille - http://langille.org/