From owner-freebsd-questions@FreeBSD.ORG Tue Oct 4 07:06:44 2011 Return-Path: Delivered-To: questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 39E1A106566B for ; Tue, 4 Oct 2011 07:06:44 +0000 (UTC) (envelope-from syshackmin@gmail.com) Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id CE33B8FC08 for ; Tue, 4 Oct 2011 07:06:43 +0000 (UTC) Received: by wwe3 with SMTP id 3so223271wwe.31 for ; Tue, 04 Oct 2011 00:06:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=HwdWVBaEb1tCX1HvJ9eYj+lpKcI7PLAY0WhwLfRQmgc=; b=dDIiHl2h4dgLNKsogqttb1KCIH3q2zEqCKNf09rcgEb8Y+CTcm5rXOcmR3vw+6f0F6 /5v2AuGZnH9pDCig0lpo5V2GyzkO+kPLPMp1CXOW1pQMkU64bzm64+b5hpK8+Pft6GBm yEQxdYe+doIvsrQMwn+NpMsmIZdbZpgwIacMM= MIME-Version: 1.0 Received: by 10.216.80.91 with SMTP id j69mr1021742wee.21.1317710625905; Mon, 03 Oct 2011 23:43:45 -0700 (PDT) Received: by 10.216.53.21 with HTTP; Mon, 3 Oct 2011 23:43:45 -0700 (PDT) Date: Tue, 4 Oct 2011 02:43:45 -0400 Message-ID: From: Dave Cundiff To: questions@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: Subject: ZFS Write Lockup X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Oct 2011 07:06:44 -0000 Hi, I'm running 8.2-RELEASE and running into an IO lockup on ZFS that is happening pretty regularly. The system is stock except for the following set in loader.conf vm.kmem_size="30G" vfs.zfs.arc_max="22G" kern.hz=100 I know the kmem settings aren't SUPPOSED to be necessary now, buy my ZFS boxes were crashing until I added them. The machine has 24 gigs of RAM. The kern.hz=100 was to stretch out the l2arc bug that pops up at 28days with it set to 1000. [root@san2 ~]# zpool status pool: san state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM san ONLINE 0 0 0 da1 ONLINE 0 0 0 logs mirror ONLINE 0 0 0 ad6s1b ONLINE 0 0 0 ad14s1b ONLINE 0 0 0 cache ad6s1d ONLINE 0 0 0 ad14s1d ONLINE 0 0 0 errors: No known data errors Here's a zpool iostat from a machine in trouble. san 9.08T 3.55T 0 0 0 7.92K san 9.08T 3.55T 0 447 0 5.77M san 9.08T 3.55T 0 309 0 2.83M san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 62 0 2.22M 0 san 9.08T 3.55T 0 2 0 23.5K san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 0 254 0 6.62M san 9.08T 3.55T 0 249 0 3.16M san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 34 0 491K 0 san 9.08T 3.55T 0 6 0 62.7K san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 0 85 0 6.59M san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 0 452 0 4.88M san 9.08T 3.55T 109 0 3.12M 0 san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 0 0 0 7.84K san 9.08T 3.55T 0 434 0 6.41M san 9.08T 3.55T 0 0 0 0 san 9.08T 3.55T 0 304 0 2.90M san 9.08T 3.55T 37 0 628K 0 Its supposed to look like san 9.07T 3.56T 162 167 3.75M 6.09M san 9.07T 3.56T 5 0 47.4K 0 san 9.07T 3.56T 19 0 213K 0 san 9.07T 3.56T 120 0 3.26M 0 san 9.07T 3.56T 92 0 741K 0 san 9.07T 3.56T 114 0 2.86M 0 san 9.07T 3.56T 72 0 579K 0 san 9.07T 3.56T 14 0 118K 0 san 9.07T 3.56T 24 0 213K 0 san 9.07T 3.56T 25 0 324K 0 san 9.07T 3.56T 8 0 126K 0 san 9.07T 3.56T 28 0 505K 0 san 9.07T 3.56T 15 0 126K 0 san 9.07T 3.56T 11 0 158K 0 san 9.07T 3.56T 19 0 356K 0 san 9.07T 3.56T 198 0 3.55M 0 san 9.07T 3.56T 21 0 173K 0 san 9.07T 3.56T 18 0 150K 0 san 9.07T 3.56T 23 0 260K 0 san 9.07T 3.56T 9 0 78.3K 0 san 9.07T 3.56T 21 0 173K 0 san 9.07T 3.56T 2 4.59K 16.8K 142M san 9.07T 3.56T 12 0 103K 0 san 9.07T 3.56T 26 454 312K 4.35M san 9.07T 3.56T 111 0 3.34M 0 san 9.07T 3.56T 28 0 870K 0 san 9.07T 3.56T 75 0 3.88M 0 san 9.07T 3.56T 43 0 1.22M 0 san 9.07T 3.56T 26 0 270K 0 I don't know what triggers the problem but I know how to fix it. If I perform a couple snapshot deletes the IO will come back in line every single time. Fortunately I have LOTS of snapshots to delete. [root@san2 ~]# zfs list -r -t snapshot | wc -l 5236 [root@san2 ~]# zfs list -r -t volume | wc -l 17 Being fairly new to FreeBSD and ZFS I'm pretty clueless on where to begin tracking this down. I've been staring at gstat trying to see if a zvol is getting a big burst of writes that may be flooding the drive controller but I haven't caught anything yet. top -S -H shows zio_write_issue threads consuming massive amounts of CPU during the lockup. Normally they sit around 5-10%. Any suggestions on where I could start to track this down would be greatly appreciated. Thanks, -- Dave Cundiff System Administrator A2Hosting, Inc http://www.a2hosting.com