From owner-freebsd-stable@FreeBSD.ORG Tue Oct 15 04:01:12 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id A43A113E; Tue, 15 Oct 2013 04:01:12 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 48A732D61; Tue, 15 Oct 2013 04:01:12 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r9F41AO4023339; Tue, 15 Oct 2013 00:01:10 -0400 (EDT) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r9F41Av9023336; Tue, 15 Oct 2013 00:01:10 -0400 (EDT) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21084.48646.196295.776944@hergotha.csail.mit.edu> Date: Tue, 15 Oct 2013 00:01:10 -0400 From: Garrett Wollman To: freebsd-stable@freebsd.org Subject: How to unstick ZFS resilver? X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (hergotha.csail.mit.edu [127.0.0.1]); Tue, 15 Oct 2013 00:01:10 -0400 (EDT) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Oct 2013 04:01:12 -0000 I have a large (88-drive) zpool in which a drive was recently replaced. (The pool has a bunch of duff Toshiba MK2001TRKB drives -- never ever pay money for these! -- and I'm trying to replace them one by one before they fail completely.) The resilver on the first drive replacement has been taking much much too long, and currently it's stuck in this state: pool: export state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Oct 9 14:54:47 2013 86.5T scanned out of 86.8T at 1/s, (scan is slow, no estimated time) 982G resilvered, 99.62% done The overall progress hasn't changed in twelve hours, even across a reboot, and the server is fairly lightly loaded. Searching the Web is no help; can anyone suggest a remedial action? (This is on 9.1-RELEASE, with our local patches, and all the drives are SAS.) In exchange, I offer the following DTrace script which I used to identify the slow SAS drives: #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option dynvarsize=2m inline int TOO_SLOW = 100000000; /* 100 ms */ dtrace:::BEGIN { printf("Tracing... Hit Ctrl-C to end.\n"); } fbt::dastrategy:entry { start_time[(struct buf *)arg0] = timestamp; } fbt::dadone:entry /(this->bp = (struct buf *)args[1]->ccb_h.periph_priv.entries[1].ptr) && start_time[this->bp] && (timestamp - start_time[this->bp]) > TOO_SLOW/ { @[strjoin("da", lltostr(args[0]->unit_number))] = count(); start_time[this->bp] = 0; } -GAWollman