From owner-freebsd-stable@FreeBSD.ORG Sat Oct 2 23:50:25 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D148A106566C for ; Sat, 2 Oct 2010 23:50:25 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta11.emeryville.ca.mail.comcast.net (qmta11.emeryville.ca.mail.comcast.net [76.96.27.211]) by mx1.freebsd.org (Postfix) with ESMTP id B563B8FC15 for ; Sat, 2 Oct 2010 23:50:25 +0000 (UTC) Received: from omta19.emeryville.ca.mail.comcast.net ([76.96.30.76]) by qmta11.emeryville.ca.mail.comcast.net with comcast id DzpT1f0041eYJf8ABzqRnK; Sat, 02 Oct 2010 23:50:25 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta19.emeryville.ca.mail.comcast.net with comcast id DzqQ1f00A3LrwQ201zqQqd; Sat, 02 Oct 2010 23:50:25 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 47C7C9B418; Sat, 2 Oct 2010 16:50:24 -0700 (PDT) Date: Sat, 2 Oct 2010 16:50:24 -0700 From: Jeremy Chadwick To: Dan Langille Message-ID: <20101002235024.GA80643@icarus.home.lan> References: <4CA73702.5080203@langille.org> <20101002141921.GC70283@icarus.home.lan> <4CA7AD95.9040703@langille.org> <20101002223626.GB78136@icarus.home.lan> <4CA7BEE4.9050201@langille.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4CA7BEE4.9050201@langille.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable Subject: Re: out of HDD space - zfs degraded X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 23:50:26 -0000 On Sat, Oct 02, 2010 at 07:23:16PM -0400, Dan Langille wrote: > On 10/2/2010 6:36 PM, Jeremy Chadwick wrote: > >On Sat, Oct 02, 2010 at 06:09:25PM -0400, Dan Langille wrote: > >>On 10/2/2010 10:19 AM, Jeremy Chadwick wrote: > >>>On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote: > >>>>Overnight I was running a zfs send | zfs receive (both within the > >>>>same system / zpool). The system ran out of space, a drive went off > >>>>line, and the system is degraded. > >>>> > >>>>This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18 > >>>>23:43:48 EDT 2010. > >>>> > >>>>The following logs are also available at > >>>>http://www.langille.org/tmp/zfs-space.txt<- no line wrapping > >>>> > >>>>This is what was running: > >>>> > >>>># time zfs send storage/bacula@transfer | mbuffer | zfs receive > >>>>storage/compressed/bacula-mbuffer > >>>>in @ 0.0 kB/s, out @ 0.0 kB/s, 3670 GB total, buffer 100% > >>>>fullcannot receive new filesystem stream: out of space > >>>>mbuffer: error: outputThread: error writing to at offset > >>>>0x395917c4000: Broken pipe > >>>> > >>>>summary: 3670 GByte in 10 h 40 min 97.8 MB/s > >>>>mbuffer: warning: error during output to: Broken pipe > >>>>warning: cannot send 'storage/bacula@transfer': Broken pipe > >>>> > >>>>real 640m48.423s > >>>>user 8m52.660s > >>>>sys 211m40.862s > >>>> > >>>> > >>>>Looking in the logs, I see this: > >>>> > >>>>Oct 2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device > >>>>Oct 2 00:50:54 kraken kernel: siisch0: Timeout on slot 30 > >>>>Oct 2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss > >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 > >>>>Oct 2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT > >>>>Oct 2 00:50:55 kraken kernel: siisch0: Timeout on slot 30 > >>>>Oct 2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss > >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 > >>>>Oct 2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT > >>>>Oct 2 00:50:56 kraken kernel: siisch0: Timeout on slot 30 > >>>>Oct 2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss > >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 > >>>>Oct 2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT > >>>>Oct 2 00:50:57 kraken kernel: siisch0: Timeout on slot 30 > >>>>Oct 2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss > >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 > >>>>Oct 2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT > >>>>Oct 2 00:50:58 kraken kernel: siisch0: Timeout on slot 30 > >>>>Oct 2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss > >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 > >>>>Oct 2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT > >>>>Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage > >>>>path=/dev/gpt/disk06-live offset=270336 size=8192 error=6 > >>>> > >>>>Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize > >>>>cache failed > >>>>Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry > >>>> > >>>>Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage > >>>>path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6 > >>>>Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage > >>>>path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6 > >>>> > >>>>$ zpool status > >>>> pool: storage > >>>> state: DEGRADED > >>>> scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go > >>>>config: > >>>> > >>>> NAME STATE READ WRITE CKSUM > >>>> storage DEGRADED 0 0 0 > >>>> raidz2 DEGRADED 0 0 0 > >>>> gpt/disk01-live ONLINE 0 0 0 > >>>> gpt/disk02-live ONLINE 0 0 0 > >>>> gpt/disk03-live ONLINE 0 0 0 > >>>> gpt/disk04-live ONLINE 0 0 0 > >>>> gpt/disk05-live ONLINE 0 0 0 > >>>> gpt/disk06-live REMOVED 0 0 0 > >>>> gpt/disk07-live ONLINE 0 0 0 > >>>> > >>>>$ zfs list > >>>>NAME USED AVAIL REFER MOUNTPOINT > >>>>storage 6.97T 1.91T 1.75G /storage > >>>>storage/bacula 4.72T 1.91T 4.29T /storage/bacula > >>>>storage/compressed 2.25T 1.91T 46.9K /storage/compressed > >>>>storage/compressed/bacula 2.25T 1.91T 42.7K /storage/compressed/bacula > >>>>storage/pgsql 5.50G 1.91T 5.50G /storage/pgsql > >>>> > >>>>$ sudo camcontrol devlist > >>>>Password: > >>>> at scbus2 target 0 lun 0 (pass1,ada1) > >>>> at scbus3 target 0 lun 0 (pass2,ada2) > >>>> at scbus4 target 0 lun 0 (pass3,ada3) > >>>> at scbus5 target 0 lun 0 (pass4,ada4) > >>>> at scbus6 target 0 lun 0 (pass5,ada5) > >>>> at scbus7 target 0 lun 0 (pass6,ada6) > >>>> at scbus8 target 0 lun 0 (pass7,ada7) > >>>> at scbus9 target 0 lun 0 (cd0,pass8) > >>>> at scbus10 target 0 lun 0 (pass9,ada8) > >>>> > >>>>I'm not yet sure if the drive is fully dead or not. This is not a > >>>>hot-swap box. > >>> > >>>It looks to me like the disk labelled gpt/disk06-live literally stopped > >>>responding to commands. The errors you see are coming from the OS and > >>>the siis(4) controller, and both indicate the actual hard disk isn't > >>>responding to the ATA command READ LOG EXT. error=6 means Device not > >>>configured. > >>> > >>>I can't see how/why running out of space would cause this. It looks > >>>more like that you had a hardware issue of some sort happen during the > >>>course of the operations you were running. It may not have happened > >>>until now because you hadn't utilised writes to that area of the disk > >>>(could have bad sectors there, or physical media/platter problems). > >>> > >>>Please provide smartctl -a output for the drive that's gpt/disk06-live, > >>>which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't > >>>it? Sigh...). Please put the results up on the web somewhere, not > >>>copy-pasted, otherwise I have to do a bunch of manual work with regarsd > >>>to line wrapping/etc... I'll provide an analysis of SMART stats for > >>>you, to see if anything crazy happened to the disk itself. > >> > >>It is ada0, I'm sure, based on the 'lost device' mentioned in > >>/var/log/messages above. > >> > >>I'm getting nowhere. /dev/ada0 does not exist so there is nothing > >>for smartctl to work on. > >> > >>[...] > >> > >>$ ls -l /dev/ada0* > >>ls: /dev/ada0*: No such file or directory > > > >Okay, so gpt/disk06-live is /dev/ada0. (I won't ask why the label is > >called "disk06", but whatever. :-) ) > > > >>I am tempted to reboot or do a camontrol scan. > > > >DO NOT REBOOT.You can try the following -- I'm not sure whether to > >use scbus0 or scbus1 as the argument however, since I don't know what > >scbusX number ada0 was attached to previously. "dmesg" from when the > >machine booted would show this. > > > >camcontrol reset scbusX > >camcontrol rescan scbusX > > I see this in /var/run/dmesg.boot: > > ada0 at siisch0 bus 0 scbus0 target 0 lun 0 > ada0: ATA-8 SATA 2.x device > ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > ada0: Command Queueing enabled > > $ sudo camcontrol reset scbus0 > Password: > Reset of bus 0 was successful > > $ sudo camcontrol rescan scbus0 > Re-scan of bus 0 was successful > > >If the disk comes back, please smartctl -a it. > > I didn't come back: > > $ ls /dev/ada* > /dev/ada1 /dev/ada2p1 /dev/ada4 /dev/ada5p1 /dev/ada7 > /dev/ada1p1 /dev/ada3 /dev/ada4p1 /dev/ada6 /dev/ada8 > /dev/ada2 /dev/ada3p1 /dev/ada5 /dev/ada6p1 > > FYI, there's nothing new in /var/log/messages as a results of those > commands. Then I would recommend power-cycling (not rebooting or pressing of the reset button) the machine. There's a good chance the ada0 disk has fallen off the bus and needs a full power-cycle, since a LUN scan didn't result in its reappearance. I see this kind of problem on a weekly basis at my workplace, in 3 different datacenters, with Fujitsu SCSI-3 disks. A system reboot doesn't make the disk reappear on on the bus, nor does a reset (pressing of the reset button). Only a full power-cycle works. And when I say weekly, I'm not exaggerating. I realise your disks are Hitachi not Fujitsu, and are SATA not SCSI, but it really doesn't matter -- there are cases where the drive firmware is wedged so hard that a physical power-cycle is required. If a power-cycle works, smartctl -a /dev/ada0 the disk and save the SMART stats somewhere. If the same disk fails in this way again, I strongly recommend advance RMA'ing it (to ensure you get a completely different disk). Good luck! -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |