From owner-freebsd-fs@FreeBSD.ORG Sun Jul 14 07:55:15 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 92F6C112 for ; Sun, 14 Jul 2013 07:55:15 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-vc0-x231.google.com (mail-vc0-x231.google.com [IPv6:2607:f8b0:400c:c03::231]) by mx1.freebsd.org (Postfix) with ESMTP id 5ACBB8A4 for ; Sun, 14 Jul 2013 07:55:15 +0000 (UTC) Received: by mail-vc0-f177.google.com with SMTP id hv10so8526306vcb.22 for ; Sun, 14 Jul 2013 00:55:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=du1WtNY51RXPEOaCCtF1o5QrMpg3h3hi0MF/r3rLfxY=; b=0YFtDsoq/PJfNm6CwLPoz5QUGH5VdV2c2l6//hFa3vpjxBcQUYT/1IAKbsx8Zq+VpQ m6DOtW8HUhRiaD6ocOkT2lVSbnNCWOweNV8vt+O6UTy9qyYM5QsQZFFTjLvbrlfbdaGx w20kZlmv+epZsiUk62BR73mOXLM9f7eJJMrlf1Wxikg/0psRBnM1ML2gJnNC178SCO3K FWJxh0leYIprnKXZQ0hNNS7XpjTDzcXDiAQNEddG2oO+T8IdjI/iO1LSiBPztXBVRZCE /zmTaLz4oSfthhhie0nY+vnwiQVnraixhpl6DTfVvijG92vxXyjNLx/xa6SfnB9BGTzD 4ZrQ== MIME-Version: 1.0 X-Received: by 10.220.168.141 with SMTP id u13mr26953401vcy.23.1373788514638; Sun, 14 Jul 2013 00:55:14 -0700 (PDT) Received: by 10.221.22.199 with HTTP; Sun, 14 Jul 2013 00:55:14 -0700 (PDT) Date: Sun, 14 Jul 2013 03:55:14 -0400 Message-ID: Subject: Efficiency of ZFS ZVOLs. From: Zaphod Beeblebrox To: freebsd-fs Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Jul 2013 07:55:15 -0000 I have a ZFS pool that consists of 9 1.5T drives (Z1) and 8 2T drives (Z1). I know this is not exactly recommended, but this is more a home machine that provides some backup space rather than a production machine --- and thus it gets what it gets. Anyways... a typical filesystem looks like: [1:7:307]root@virtual:~> zfs list vr2/tmp NAME USED AVAIL REFER MOUNTPOINT vr2/tmp 74.3G 7.31T 74.3G /vr2/tmp ... that is "tmp" uses 74.3G and the whole mess has 7.31T available. If tmp had children, "USED" could be larger than "REFER" because the children account for the rest Now... consider: [1:3:303]root@virtual:~> zfs list -rt all vr2/Steam NAME USED AVAIL REFER MOUNTPOINT vr2/Steam 3.25T 9.27T 1.18T - vr2/Steam@20130528-0029 255M - 1.18T - vr2/Steam@20130529-0221 172M - 1.18T - vr2/Steam is a ZVOL exported by iSCSI to my desktop and it contains an NTFS filesystem which is mounted into C:\Program Files (x86)\Steam. Windows sees this drive as a 1.99T drive of which 1.02T is used. Now... the value of "REFER" seems quite right: 1.18T vs. 1.02T is pretty good... but the value of "USED" seems _way_ out. 3.25T ... even regarding that more of the disk might have been "touched" (ie: used from the ZVOL's impression) than is used, it seems too large. Neither is it 1.18T + 255M + 172M. Now... I understand that the smallest effective "block" is 7x512 or 8x512 (depending on which part of the disk is in play) --- but does that really account for it? A quick google check says that NTFS uses a default cluster of 4096 (or larger). Is there a fundamental inefficiency in the way ZVOLs are stored on wide (or wide-ish) RAID stripes? From owner-freebsd-fs@FreeBSD.ORG Sun Jul 14 08:50:20 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 958D76E3 for ; Sun, 14 Jul 2013 08:50:20 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 5B2799BF for ; Sun, 14 Jul 2013 08:50:18 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id 41DEA2BC19D; Sun, 14 Jul 2013 10:50:11 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.3 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 0F1802BC196 for ; Sun, 14 Jul 2013 10:50:11 +0200 (CEST) Message-ID: <51E26632.8030907@platinum.linux.pl> Date: Sun, 14 Jul 2013 10:49:54 +0200 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Efficiency of ZFS ZVOLs. References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Jul 2013 08:50:20 -0000 On 2013-07-14 09:55, Zaphod Beeblebrox wrote: > [1:3:303]root@virtual:~> zfs list -rt all vr2/Steam > NAME USED AVAIL REFER MOUNTPOINT > vr2/Steam 3.25T 9.27T 1.18T - > vr2/Steam@20130528-0029 255M - 1.18T - > vr2/Steam@20130529-0221 172M - 1.18T - > > vr2/Steam is a ZVOL exported by iSCSI to my desktop and it contains an NTFS > filesystem which is mounted into C:\Program Files (x86)\Steam. Windows > sees this drive as a 1.99T drive of which 1.02T is used. > > Now... the value of "REFER" seems quite right: 1.18T vs. 1.02T is pretty > good... but the value of "USED" seems _way_ out. 3.25T ... even regarding > that more of the disk might have been "touched" (ie: used from the ZVOL's > impression) than is used, it seems too large. Neither is it 1.18T + 255M + > 172M. This is how much space would be required to store the snapshots plus 2TB volume with no shared blocks between any of the snapshots. 1.18T from snapshots + 2T reservation = 3.18T, just about the 3.25T displayed. You can remove the reservation with 'zfs set refreservation=none vr2/Steam'. From owner-freebsd-fs@FreeBSD.ORG Mon Jul 15 09:51:11 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id EA5EB70; Mon, 15 Jul 2013 09:51:11 +0000 (UTC) (envelope-from godders@gmail.com) Received: from mail-qc0-x22d.google.com (mail-qc0-x22d.google.com [IPv6:2607:f8b0:400d:c01::22d]) by mx1.freebsd.org (Postfix) with ESMTP id A2C1BC24; Mon, 15 Jul 2013 09:51:11 +0000 (UTC) Received: by mail-qc0-f173.google.com with SMTP id l10so6170420qcy.18 for ; Mon, 15 Jul 2013 02:51:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=bqPWOLlgO2DSqvvPF54slTA6ermS9eEi494GtoekWZk=; b=TXKAifsNQUfyW+hlWGlP0+FtnICKhuLSK4B28smJzaQOddkP0gf1Iy3EmBu1qKmXn7 xaKUgG50YbHv8Vx8Fd2c5AlL/srrlVBBGrPyy74NE9UYjI7wo6JIEl9FotSVIKimo96U tbwMXK5OecU8MqxPs+uiZ14J9y21CXUk6vmIZIX/j2ybZi67TSK22r4a50ARQbbPo5uS VNIDaUadTBJgwMbOz4m4obnVuDccs7irErHt3VOp2wmJUbBug8scRm5Qw0r9CdWv1Cvz qHMELBusYb4x/+AC0cF5ZkQ4EGz3YMMT3OUjzhNHT7hOZNferAtOS7S7dpWehROzKWvH 7p3w== MIME-Version: 1.0 X-Received: by 10.49.85.4 with SMTP id d4mr50212452qez.10.1373881870442; Mon, 15 Jul 2013 02:51:10 -0700 (PDT) Received: by 10.49.52.65 with HTTP; Mon, 15 Jul 2013 02:51:10 -0700 (PDT) In-Reply-To: <201306110017.r5B0HFct074482@chez.mckusick.com> References: <51B5A277.2060904@FreeBSD.org> <201306110017.r5B0HFct074482@chez.mckusick.com> Date: Mon, 15 Jul 2013 10:51:10 +0100 Message-ID: Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) From: Dan Thomas To: Kirk McKusick Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs@freebsd.org, Palle Girgensohn , Jeff Roberson , Julian Akehurst X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Jul 2013 09:51:12 -0000 On 11 June 2013 01:17, Kirk McKusick wrote: > OK, good to have it narrowed down. I will look to devise some > additional diagnostics that hopefully will help tease out the > bug. I'll hopefully get back to you soon. Hi, Is there any news on this issue? We're still running several servers that are exhibiting this problem (most recently, one that seems to be leaking around 10gb/hour), and it's getting to the point where we're looking at moving to a different OS until it's resolved. We have access to several production systems with this problem and (at least from time to time) will have systems with a significant leak on them that we can experiment with. Is there any way we can assist with tracking this down? Any diagnostics or testing that would be useful? Thanks, Dan From owner-freebsd-fs@FreeBSD.ORG Mon Jul 15 10:44:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 83230AA8 for ; Mon, 15 Jul 2013 10:44:33 +0000 (UTC) (envelope-from thomas@gibfest.dk) Received: from mail.tyknet.dk (mail.tyknet.dk [176.9.9.186]) by mx1.freebsd.org (Postfix) with ESMTP id 45C6EE71 for ; Mon, 15 Jul 2013 10:44:33 +0000 (UTC) Received: from [10.10.1.214] (217.71.4.82.static.router4.bolignet.dk [217.71.4.82]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mail.tyknet.dk (Postfix) with ESMTPSA id 60B0B15FA0F for ; Mon, 15 Jul 2013 12:36:59 +0200 (CEST) X-DKIM: OpenDKIM Filter v2.5.2 mail.tyknet.dk 60B0B15FA0F DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=gibfest.dk; s=default; t=1373884619; bh=jd3CNUoCqlon3YHs4YejpnVqhZlmmBrQMGF0O04lUUI=; h=Date:From:To:Subject:References:In-Reply-To; b=Hkelfq0gPNbNKFCbY99cOT570mEqswhubpijN8wn2VjFlCnnUp7yaEC9Cp70roQMn zF9Oeid6CKIPa9Orz8AKuVlWywlbdjcyKclUGS1CYAIFhdoXpz9e7kQiqssZRJ8qQU on/0A7rR3buMMayTRTyCiZ81LyWaeEPd1iofyeG8= Message-ID: <51E3D0C3.9020205@gibfest.dk> Date: Mon, 15 Jul 2013 12:36:51 +0200 From: Thomas Steen Rasmussen User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Reproducible ZFS jailed dataset panic after upgrading to latest 9-stable References: <51C97EAF.3000901@gibfest.dk> In-Reply-To: <51C97EAF.3000901@gibfest.dk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Jul 2013 10:44:33 -0000 On 25-06-2013 13:27, Thomas Steen Rasmussen wrote: > Hello, > > To fix the mmap vulnerability I've upgraded one of my jail hosts from: > "FreeBSD 9.1-STABLE #1: Sun Mar 17 08:48:35 UTC 2013" > to: > "FreeBSD 9.1-STABLE #3: Tue Jun 18 12:49:39 UTC 2013" > > One of the jails on this machine has a jailed zfs dataset: > > $ zfs get jailed gelipool/backups > NAME PROPERTY VALUE SOURCE > gelipool/backups jailed on local > $ > > After the upgrade, when I start the jail, the machine panics. > > This is a remote zfs-only machine with swap on zfs, so far I have > been unable to get a proper coredump. I have access to the > console of the machine, and I have taken a couple of screenshots: > > http://imgur.com/2V0PBlf and http://imgur.com/OopP9Sp > > Any ideas what might have caused this ? It worked great before the > upgrade to latest 9-STABLE. This is a production server, but I am > willing to try any suggestions to get it working again. > Hello all, I just wanted to confirm that since the MFC in r252524 this has been fixed in stable/9: http://svnweb.freebsd.org/base?view=revision&revision=252524 Thanks! Thomas Steen Rasmussen From owner-freebsd-fs@FreeBSD.ORG Mon Jul 15 11:06:42 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D7A16F41 for ; Mon, 15 Jul 2013 11:06:42 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id CAD8DFBB for ; Mon, 15 Jul 2013 11:06:42 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r6FB6gkf084410 for ; Mon, 15 Jul 2013 11:06:42 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r6FB6gfT084408 for freebsd-fs@FreeBSD.org; Mon, 15 Jul 2013 11:06:42 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 15 Jul 2013 11:06:42 GMT Message-Id: <201307151106.r6FB6gfT084408@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-fs@FreeBSD.org Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Jul 2013 11:06:43 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/180438 fs [smbfs] [patch] mount_smbfs fails on arm because of wr p kern/180236 fs [zfs] [nullfs] Leakage free space using ZFS with nullf o kern/178854 fs [ufs] FreeBSD kernel crash in UFS o kern/178713 fs [nfs] [patch] Correct WebNFS support in NFS server and o kern/178412 fs [smbfs] Coredump when smbfs mounted o kern/178388 fs [zfs] [patch] allow up to 8MB recordsize o kern/178349 fs [zfs] zfs scrub on deduped data could be much less see o kern/178329 fs [zfs] extended attributes leak o kern/178238 fs [nullfs] nullfs don't release i-nodes on unlink. f kern/178231 fs [nfs] 8.3 nfsv4 client reports "nfsv4 client/server pr o kern/178103 fs [kernel] [nfs] [patch] Correct support of index files o kern/177985 fs [zfs] disk usage problem when copying from one zfs dat o kern/177971 fs [nfs] FreeBSD 9.1 nfs client dirlist problem w/ nfsv3, o kern/177966 fs [zfs] resilver completes but subsequent scrub reports o kern/177658 fs [ufs] FreeBSD panics after get full filesystem with uf o kern/177536 fs [zfs] zfs livelock (deadlock) with high write-to-disk o kern/177445 fs [hast] HAST panic o kern/177240 fs [zfs] zpool import failed with state UNAVAIL but all d o kern/176978 fs [zfs] [panic] zfs send -D causes "panic: System call i o kern/176857 fs [softupdates] [panic] 9.1-RELEASE/amd64/GENERIC panic o bin/176253 fs zpool(8): zfs pool indentation is misleading/wrong o kern/176141 fs [zfs] sharesmb=on makes errors for sharenfs, and still o kern/175950 fs [zfs] Possible deadlock in zfs after long uptime o kern/175897 fs [zfs] operations on readonly zpool hang o kern/175449 fs [unionfs] unionfs and devfs misbehaviour o kern/175179 fs [zfs] ZFS may attach wrong device on move o kern/175071 fs [ufs] [panic] softdep_deallocate_dependencies: unrecov o kern/174372 fs [zfs] Pagefault appears to be related to ZFS o kern/174315 fs [zfs] chflags uchg not supported o kern/174310 fs [zfs] root point mounting broken on CURRENT with multi o kern/174279 fs [ufs] UFS2-SU+J journal and filesystem corruption o kern/173830 fs [zfs] Brain-dead simple change to ZFS error descriptio o kern/173718 fs [zfs] phantom directory in zraid2 pool f kern/173657 fs [nfs] strange UID map with nfsuserd o kern/173363 fs [zfs] [panic] Panic on 'zpool replace' on readonly poo o kern/173136 fs [unionfs] mounting above the NFS read-only share panic o kern/172942 fs [smbfs] Unmounting a smb mount when the server became o kern/172348 fs [unionfs] umount -f of filesystem in use with readonly o kern/172334 fs [unionfs] unionfs permits recursive union mounts; caus o kern/171626 fs [tmpfs] tmpfs should be noisier when the requested siz o kern/171415 fs [zfs] zfs recv fails with "cannot receive incremental o kern/170945 fs [gpt] disk layout not portable between direct connect o bin/170778 fs [zfs] [panic] FreeBSD panics randomly o kern/170680 fs [nfs] Multiple NFS Client bug in the FreeBSD 7.4-RELEA o kern/170497 fs [xfs][panic] kernel will panic whenever I ls a mounted o kern/169945 fs [zfs] [panic] Kernel panic while importing zpool (afte o kern/169480 fs [zfs] ZFS stalls on heavy I/O o kern/169398 fs [zfs] Can't remove file with permanent error o kern/169339 fs panic while " : > /etc/123" o kern/169319 fs [zfs] zfs resilver can't complete o kern/168947 fs [nfs] [zfs] .zfs/snapshot directory is messed up when o kern/168942 fs [nfs] [hang] nfsd hangs after being restarted (not -HU o kern/168158 fs [zfs] incorrect parsing of sharenfs options in zfs (fs o kern/167979 fs [ufs] DIOCGDINFO ioctl does not work on 8.2 file syste o kern/167977 fs [smbfs] mount_smbfs results are differ when utf-8 or U o kern/167688 fs [fusefs] Incorrect signal handling with direct_io o kern/167685 fs [zfs] ZFS on USB drive prevents shutdown / reboot o kern/167612 fs [portalfs] The portal file system gets stuck inside po o kern/167272 fs [zfs] ZFS Disks reordering causes ZFS to pick the wron o kern/167260 fs [msdosfs] msdosfs disk was mounted the second time whe o kern/167109 fs [zfs] [panic] zfs diff kernel panic Fatal trap 9: gene o kern/167105 fs [nfs] mount_nfs can not handle source exports wiht mor o kern/167067 fs [zfs] [panic] ZFS panics the server o kern/167065 fs [zfs] boot fails when a spare is the boot disk o kern/167048 fs [nfs] [patch] RELEASE-9 crash when using ZFS+NULLFS+NF o kern/166912 fs [ufs] [panic] Panic after converting Softupdates to jo o kern/166851 fs [zfs] [hang] Copying directory from the mounted UFS di o kern/166477 fs [nfs] NFS data corruption. o kern/165950 fs [ffs] SU+J and fsck problem o kern/165521 fs [zfs] [hang] livelock on 1 Gig of RAM with zfs when 31 o kern/165392 fs Multiple mkdir/rmdir fails with errno 31 o kern/165087 fs [unionfs] lock violation in unionfs o kern/164472 fs [ufs] fsck -B panics on particular data inconsistency o kern/164370 fs [zfs] zfs destroy for snapshot fails on i386 and sparc o kern/164261 fs [nullfs] [patch] fix panic with NFS served from NULLFS o kern/164256 fs [zfs] device entry for volume is not created after zfs o kern/164184 fs [ufs] [panic] Kernel panic with ufs_makeinode o kern/163801 fs [md] [request] allow mfsBSD legacy installed in 'swap' o kern/163770 fs [zfs] [hang] LOR between zfs&syncer + vnlru leading to o kern/163501 fs [nfs] NFS exporting a dir and a subdir in that dir to o kern/162944 fs [coda] Coda file system module looks broken in 9.0 o kern/162860 fs [zfs] Cannot share ZFS filesystem to hosts with a hyph o kern/162751 fs [zfs] [panic] kernel panics during file operations o kern/162591 fs [nullfs] cross-filesystem nullfs does not work as expe o kern/162519 fs [zfs] "zpool import" relies on buggy realpath() behavi o kern/161968 fs [zfs] [hang] renaming snapshot with -r including a zvo o kern/161864 fs [ufs] removing journaling from UFS partition fails on o kern/161579 fs [smbfs] FreeBSD sometimes panics when an smb share is o kern/161533 fs [zfs] [panic] zfs receive panic: system ioctl returnin o kern/161438 fs [zfs] [panic] recursed on non-recursive spa_namespace_ o kern/161424 fs [nullfs] __getcwd() calls fail when used on nullfs mou o kern/161280 fs [zfs] Stack overflow in gptzfsboot o kern/161205 fs [nfs] [pfsync] [regression] [build] Bug report freebsd o kern/161169 fs [zfs] [panic] ZFS causes kernel panic in dbuf_dirty o kern/161112 fs [ufs] [lor] filesystem LOR in FreeBSD 9.0-BETA3 o kern/160893 fs [zfs] [panic] 9.0-BETA2 kernel panic f kern/160860 fs [ufs] Random UFS root filesystem corruption with SU+J o kern/160801 fs [zfs] zfsboot on 8.2-RELEASE fails to boot from root-o o kern/160790 fs [fusefs] [panic] VPUTX: negative ref count with FUSE o kern/160777 fs [zfs] [hang] RAID-Z3 causes fatal hang upon scrub/impo o kern/160706 fs [zfs] zfs bootloader fails when a non-root vdev exists o kern/160591 fs [zfs] Fail to boot on zfs root with degraded raidz2 [r o kern/160410 fs [smbfs] [hang] smbfs hangs when transferring large fil o kern/160283 fs [zfs] [patch] 'zfs list' does abort in make_dataset_ha o kern/159930 fs [ufs] [panic] kernel core o kern/159402 fs [zfs][loader] symlinks cause I/O errors o kern/159357 fs [zfs] ZFS MAXNAMELEN macro has confusing name (off-by- o kern/159356 fs [zfs] [patch] ZFS NAME_ERR_DISKLIKE check is Solaris-s o kern/159351 fs [nfs] [patch] - divide by zero in mountnfs() o kern/159251 fs [zfs] [request]: add FLETCHER4 as DEDUP hash option o kern/159077 fs [zfs] Can't cd .. with latest zfs version o kern/159048 fs [smbfs] smb mount corrupts large files o kern/159045 fs [zfs] [hang] ZFS scrub freezes system o kern/158839 fs [zfs] ZFS Bootloader Fails if there is a Dead Disk o kern/158802 fs amd(8) ICMP storm and unkillable process. o kern/158231 fs [nullfs] panic on unmounting nullfs mounted over ufs o f kern/157929 fs [nfs] NFS slow read o kern/157399 fs [zfs] trouble with: mdconfig force delete && zfs strip o kern/157179 fs [zfs] zfs/dbuf.c: panic: solaris assert: arc_buf_remov o kern/156797 fs [zfs] [panic] Double panic with FreeBSD 9-CURRENT and o kern/156781 fs [zfs] zfs is losing the snapshot directory, p kern/156545 fs [ufs] mv could break UFS on SMP systems o kern/156193 fs [ufs] [hang] UFS snapshot hangs && deadlocks processes o kern/156039 fs [nullfs] [unionfs] nullfs + unionfs do not compose, re o kern/155615 fs [zfs] zfs v28 broken on sparc64 -current o kern/155587 fs [zfs] [panic] kernel panic with zfs p kern/155411 fs [regression] [8.2-release] [tmpfs]: mount: tmpfs : No o kern/155199 fs [ext2fs] ext3fs mounted as ext2fs gives I/O errors o bin/155104 fs [zfs][patch] use /dev prefix by default when importing o kern/154930 fs [zfs] cannot delete/unlink file from full volume -> EN o kern/154828 fs [msdosfs] Unable to create directories on external USB o kern/154491 fs [smbfs] smb_co_lock: recursive lock for object 1 p kern/154228 fs [md] md getting stuck in wdrain state o kern/153996 fs [zfs] zfs root mount error while kernel is not located o kern/153753 fs [zfs] ZFS v15 - grammatical error when attempting to u o kern/153716 fs [zfs] zpool scrub time remaining is incorrect o kern/153695 fs [patch] [zfs] Booting from zpool created on 4k-sector o kern/153680 fs [xfs] 8.1 failing to mount XFS partitions o kern/153418 fs [zfs] [panic] Kernel Panic occurred writing to zfs vol o kern/153351 fs [zfs] locking directories/files in ZFS o bin/153258 fs [patch][zfs] creating ZVOLs requires `refreservation' s kern/153173 fs [zfs] booting from a gzip-compressed dataset doesn't w o bin/153142 fs [zfs] ls -l outputs `ls: ./.zfs: Operation not support o kern/153126 fs [zfs] vdev failure, zpool=peegel type=vdev.too_small o kern/152022 fs [nfs] nfs service hangs with linux client [regression] o kern/151942 fs [zfs] panic during ls(1) zfs snapshot directory o kern/151905 fs [zfs] page fault under load in /sbin/zfs o bin/151713 fs [patch] Bug in growfs(8) with respect to 32-bit overfl o kern/151648 fs [zfs] disk wait bug o kern/151629 fs [fs] [patch] Skip empty directory entries during name o kern/151330 fs [zfs] will unshare all zfs filesystem after execute a o kern/151326 fs [nfs] nfs exports fail if netgroups contain duplicate o kern/151251 fs [ufs] Can not create files on filesystem with heavy us o kern/151226 fs [zfs] can't delete zfs snapshot o kern/150503 fs [zfs] ZFS disks are UNAVAIL and corrupted after reboot o kern/150501 fs [zfs] ZFS vdev failure vdev.bad_label on amd64 o kern/150390 fs [zfs] zfs deadlock when arcmsr reports drive faulted o kern/150336 fs [nfs] mountd/nfsd became confused; refused to reload n o kern/149208 fs mksnap_ffs(8) hang/deadlock o kern/149173 fs [patch] [zfs] make OpenSolaris installa o kern/149015 fs [zfs] [patch] misc fixes for ZFS code to build on Glib o kern/149014 fs [zfs] [patch] declarations in ZFS libraries/utilities o kern/149013 fs [zfs] [patch] make ZFS makefiles use the libraries fro o kern/148504 fs [zfs] ZFS' zpool does not allow replacing drives to be o kern/148490 fs [zfs]: zpool attach - resilver bidirectionally, and re o kern/148368 fs [zfs] ZFS hanging forever on 8.1-PRERELEASE o kern/148138 fs [zfs] zfs raidz pool commands freeze o kern/147903 fs [zfs] [panic] Kernel panics on faulty zfs device o kern/147881 fs [zfs] [patch] ZFS "sharenfs" doesn't allow different " o kern/147420 fs [ufs] [panic] ufs_dirbad, nullfs, jail panic (corrupt o kern/146941 fs [zfs] [panic] Kernel Double Fault - Happens constantly o kern/146786 fs [zfs] zpool import hangs with checksum errors o kern/146708 fs [ufs] [panic] Kernel panic in softdep_disk_write_compl o kern/146528 fs [zfs] Severe memory leak in ZFS on i386 o kern/146502 fs [nfs] FreeBSD 8 NFS Client Connection to Server o kern/145750 fs [unionfs] [hang] unionfs locks the machine s kern/145712 fs [zfs] cannot offline two drives in a raidz2 configurat o kern/145411 fs [xfs] [panic] Kernel panics shortly after mounting an f bin/145309 fs bsdlabel: Editing disk label invalidates the whole dev o kern/145272 fs [zfs] [panic] Panic during boot when accessing zfs on o kern/145246 fs [ufs] dirhash in 7.3 gratuitously frees hashes when it o kern/145238 fs [zfs] [panic] kernel panic on zpool clear tank o kern/145229 fs [zfs] Vast differences in ZFS ARC behavior between 8.0 o kern/145189 fs [nfs] nfsd performs abysmally under load o kern/144929 fs [ufs] [lor] vfs_bio.c + ufs_dirhash.c p kern/144447 fs [zfs] sharenfs fsunshare() & fsshare_main() non functi o kern/144416 fs [panic] Kernel panic on online filesystem optimization s kern/144415 fs [zfs] [panic] kernel panics on boot after zfs crash o kern/144234 fs [zfs] Cannot boot machine with recent gptzfsboot code o kern/143825 fs [nfs] [panic] Kernel panic on NFS client o bin/143572 fs [zfs] zpool(1): [patch] The verbose output from iostat o kern/143212 fs [nfs] NFSv4 client strange work ... o kern/143184 fs [zfs] [lor] zfs/bufwait LOR o kern/142878 fs [zfs] [vfs] lock order reversal o kern/142597 fs [ext2fs] ext2fs does not work on filesystems with real o kern/142489 fs [zfs] [lor] allproc/zfs LOR o kern/142466 fs Update 7.2 -> 8.0 on Raid 1 ends with screwed raid [re o kern/142306 fs [zfs] [panic] ZFS drive (from OSX Leopard) causes two o kern/142068 fs [ufs] BSD labels are got deleted spontaneously o kern/141950 fs [unionfs] [lor] ufs/unionfs/ufs Lock order reversal o kern/141897 fs [msdosfs] [panic] Kernel panic. msdofs: file name leng o kern/141463 fs [nfs] [panic] Frequent kernel panics after upgrade fro o kern/141091 fs [patch] [nullfs] fix panics with DIAGNOSTIC enabled o kern/141086 fs [nfs] [panic] panic("nfs: bioread, not dir") on FreeBS o kern/141010 fs [zfs] "zfs scrub" fails when backed by files in UFS2 o kern/140888 fs [zfs] boot fail from zfs root while the pool resilveri o kern/140661 fs [zfs] [patch] /boot/loader fails to work on a GPT/ZFS- o kern/140640 fs [zfs] snapshot crash o kern/140068 fs [smbfs] [patch] smbfs does not allow semicolon in file o kern/139725 fs [zfs] zdb(1) dumps core on i386 when examining zpool c o kern/139715 fs [zfs] vfs.numvnodes leak on busy zfs p bin/139651 fs [nfs] mount(8): read-only remount of NFS volume does n o kern/139407 fs [smbfs] [panic] smb mount causes system crash if remot o kern/138662 fs [panic] ffs_blkfree: freeing free block o kern/138421 fs [ufs] [patch] remove UFS label limitations o kern/138202 fs mount_msdosfs(1) see only 2Gb o kern/137588 fs [unionfs] [lor] LOR nfs/ufs/nfs o kern/136968 fs [ufs] [lor] ufs/bufwait/ufs (open) o kern/136945 fs [ufs] [lor] filedesc structure/ufs (poll) o kern/136944 fs [ffs] [lor] bufwait/snaplk (fsync) o kern/136873 fs [ntfs] Missing directories/files on NTFS volume o kern/136865 fs [nfs] [patch] NFS exports atomic and on-the-fly atomic p kern/136470 fs [nfs] Cannot mount / in read-only, over NFS o kern/135546 fs [zfs] zfs.ko module doesn't ignore zpool.cache filenam o kern/135469 fs [ufs] [panic] kernel crash on md operation in ufs_dirb o kern/135050 fs [zfs] ZFS clears/hides disk errors on reboot o kern/134491 fs [zfs] Hot spares are rather cold... o kern/133676 fs [smbfs] [panic] umount -f'ing a vnode-based memory dis p kern/133174 fs [msdosfs] [patch] msdosfs must support multibyte inter o kern/132960 fs [ufs] [panic] panic:ffs_blkfree: freeing free frag o kern/132397 fs reboot causes filesystem corruption (failure to sync b o kern/132331 fs [ufs] [lor] LOR ufs and syncer o kern/132237 fs [msdosfs] msdosfs has problems to read MSDOS Floppy o kern/132145 fs [panic] File System Hard Crashes o kern/131441 fs [unionfs] [nullfs] unionfs and/or nullfs not combineab o kern/131360 fs [nfs] poor scaling behavior of the NFS server under lo o kern/131342 fs [nfs] mounting/unmounting of disks causes NFS to fail o bin/131341 fs makefs: error "Bad file descriptor" on the mount poin o kern/130920 fs [msdosfs] cp(1) takes 100% CPU time while copying file o kern/130210 fs [nullfs] Error by check nullfs o kern/129760 fs [nfs] after 'umount -f' of a stale NFS share FreeBSD l o kern/129488 fs [smbfs] Kernel "bug" when using smbfs in smbfs_smb.c: o kern/129231 fs [ufs] [patch] New UFS mount (norandom) option - mostly o kern/129152 fs [panic] non-userfriendly panic when trying to mount(8) o kern/127787 fs [lor] [ufs] Three LORs: vfslock/devfs/vfslock, ufs/vfs o bin/127270 fs fsck_msdosfs(8) may crash if BytesPerSec is zero o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126973 fs [unionfs] [hang] System hang with unionfs and init chr o kern/126553 fs [unionfs] unionfs move directory problem 2 (files appe o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125895 fs [ffs] [panic] kernel: panic: ffs_blkfree: freeing free s kern/125738 fs [zfs] [request] SHA256 acceleration in ZFS o kern/123939 fs [msdosfs] corrupts new files o bin/123574 fs [unionfs] df(1) -t option destroys info for unionfs (a o kern/122380 fs [ffs] ffs_valloc:dup alloc (Soekris 4801/7.0/USB Flash o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121898 fs [nullfs] pwd(1)/getcwd(2) fails with Permission denied o kern/121385 fs [unionfs] unionfs cross mount -> kernel panic o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o kern/120483 fs [ntfs] [patch] NTFS filesystem locking changes o kern/120482 fs [ntfs] [patch] Sync style changes between NetBSD and F o kern/118912 fs [2tb] disk sizing/geometry problem with large array o kern/118713 fs [minidump] [patch] Display media size required for a k o kern/118318 fs [nfs] NFS server hangs under special circumstances o bin/118249 fs [ufs] mv(1): moving a directory changes its mtime o kern/118126 fs [nfs] [patch] Poor NFS server write performance o kern/118107 fs [ntfs] [panic] Kernel panic when accessing a file at N o kern/117954 fs [ufs] dirhash on very large directories blocks the mac o bin/117315 fs [smbfs] mount_smbfs(8) and related options can't mount o kern/117158 fs [zfs] zpool scrub causes panic if geli vdevs detach on o bin/116980 fs [msdosfs] [patch] mount_msdosfs(8) resets some flags f o conf/116931 fs lack of fsck_cd9660 prevents mounting iso images with o kern/116583 fs [ffs] [hang] System freezes for short time when using o bin/115361 fs [zfs] mount(8) gets into a state where it won't set/un o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/113852 fs [smbfs] smbfs does not properly implement DFS referral o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/111843 fs [msdosfs] Long Names of files are incorrectly created o kern/111782 fs [ufs] dump(8) fails horribly for large filesystems s bin/111146 fs [2tb] fsck(8) fails on 6T filesystem o bin/107829 fs [2TB] fdisk(8): invalid boundary checking in fdisk / w o kern/106107 fs [ufs] left-over fsck_snapshot after unfinished backgro o kern/104406 fs [ufs] Processes get stuck in "ufs" state under persist o kern/104133 fs [ext2fs] EXT2FS module corrupts EXT2/3 filesystems o kern/103035 fs [ntfs] Directories in NTFS mounted disc images appear o kern/101324 fs [smbfs] smbfs sometimes not case sensitive when it's s o kern/99290 fs [ntfs] mount_ntfs ignorant of cluster sizes s bin/97498 fs [request] newfs(8) has no option to clear the first 12 o kern/97377 fs [ntfs] [patch] syntax cleanup for ntfs_ihash.c o kern/95222 fs [cd9660] File sections on ISO9660 level 3 CDs ignored o kern/94849 fs [ufs] rename on UFS filesystem is not atomic o bin/94810 fs fsck(8) incorrectly reports 'file system marked clean' o kern/94769 fs [ufs] Multiple file deletions on multi-snapshotted fil o kern/94733 fs [smbfs] smbfs may cause double unlock o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D o kern/92272 fs [ffs] [hang] Filling a filesystem while creating a sna o kern/91134 fs [smbfs] [patch] Preserve access and modification time a kern/90815 fs [smbfs] [patch] SMBFS with character conversions somet o kern/88657 fs [smbfs] windows client hang when browsing a samba shar o kern/88555 fs [panic] ffs_blkfree: freeing free frag on AMD 64 o bin/87966 fs [patch] newfs(8): introduce -A flag for newfs to enabl o kern/87859 fs [smbfs] System reboot while umount smbfs. o kern/86587 fs [msdosfs] rm -r /PATH fails with lots of small files o bin/85494 fs fsck_ffs: unchecked use of cg_inosused macro etc. o kern/80088 fs [smbfs] Incorrect file time setting on NTFS mounted vi o bin/74779 fs Background-fsck checks one filesystem twice and omits o kern/73484 fs [ntfs] Kernel panic when doing `ls` from the client si o bin/73019 fs [ufs] fsck_ufs(8) cannot alloc 607016868 bytes for ino o kern/71774 fs [ntfs] NTFS cannot "see" files on a WinXP filesystem o bin/70600 fs fsck(8) throws files away when it can't grow lost+foun o kern/68978 fs [panic] [ufs] crashes with failing hard disk, loose po o kern/67326 fs [msdosfs] crash after attempt to mount write protected o kern/65920 fs [nwfs] Mounted Netware filesystem behaves strange o kern/65901 fs [smbfs] [patch] smbfs fails fsx write/truncate-down/tr o kern/61503 fs [smbfs] mount_smbfs does not work as non-root o kern/55617 fs [smbfs] Accessing an nsmb-mounted drive via a smb expo o kern/51685 fs [hang] Unbounded inode allocation causes kernel to loc o kern/36566 fs [smbfs] System reboot with dead smb mount and umount o bin/27687 fs fsck(8) wrapper is not properly passing options to fsc o kern/18874 fs [2TB] 32bit NFS servers export wrong negative values t o kern/9619 fs [nfs] Restarting mountd kills existing mounts 326 problems total. From owner-freebsd-fs@FreeBSD.ORG Mon Jul 15 19:32:34 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 33361F8D; Mon, 15 Jul 2013 19:32:34 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [IPv6:2001:5a8:4:7e72:4a5b:39ff:fe12:452]) by mx1.freebsd.org (Postfix) with ESMTP id 0B13F303; Mon, 15 Jul 2013 19:32:33 +0000 (UTC) Received: from chez.mckusick.com (localhost [127.0.0.1]) by chez.mckusick.com (8.14.3/8.14.3) with ESMTP id r6FJWSxM087108; Mon, 15 Jul 2013 12:32:28 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201307151932.r6FJWSxM087108@chez.mckusick.com> To: Dan Thomas Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) In-reply-to: Date: Mon, 15 Jul 2013 12:32:28 -0700 From: Kirk McKusick Cc: freebsd-fs@freebsd.org, Palle Girgensohn , Jeff Roberson , Julian Akehurst X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Jul 2013 19:32:34 -0000 > Date: Mon, 15 Jul 2013 10:51:10 +0100 > Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) > From: Dan Thomas > To: Kirk McKusick > Cc: Palle Girgensohn , freebsd-fs@freebsd.org, > Jeff Roberson , > Julian Akehurst > X-ASK-Info: Message Queued (2013/07/15 02:51:22) > X-ASK-Info: Confirmed by User (2013/07/15 02:55:04) > > On 11 June 2013 01:17, Kirk McKusick wrote: > > OK, good to have it narrowed down. I will look to devise some > > additional diagnostics that hopefully will help tease out the > > bug. I'll hopefully get back to you soon. > > Hi, > > Is there any news on this issue? We're still running several servers > that are exhibiting this problem (most recently, one that seems to be > leaking around 10gb/hour), and it's getting to the point where we're > looking at moving to a different OS until it's resolved. > > We have access to several production systems with this problem and (at > least from time to time) will have systems with a significant leak on > them that we can experiment with. Is there any way we can assist with > tracking this down? Any diagnostics or testing that would be useful? > > Thanks, > Dan Hi Dan (and Palle), Sorry for the long delay with no help / news. I have gotten side-tracked on several projects and have had little time to try and devise some tests that would help find the cause of the lost space. It almost certainly is a one-line fix (a missing vput or vrele probably in some error path), but finding where it goes is the hard part :-) I have had little success in inserting code that tracks reference counts (too many false positives). So, I am going to need some help from you to narrow it down. My belief is that there is some set of filesystem operations (system calls) that are leading to the problem. Notably, a file is being created, data put into it, then the file is deleted (either before or after being closed). Somehow a reference to that file is persisting despite there being no valid reference to it. Hence the filesystem thinks it is still live and is not deleting it. When you do the forcible unmount, these files get cleared and the space shows back up. What I need to devise is a small test program doing the set of system calls that cause this to happen. The way that I would like to try and get it is to have you `ktrace -i' your application and then run your application just long enough to create at least one of these lost files. The goal is to minimize the amount of ktrace data through which we need to sift. In preparation for doing this test you need to have a kernel compiled with `option DIAGNOSTIC' or if you prefer, just add `#define DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you have at least one offending file when you try to unmount the affected filesystem and find it busy. Before doing the `umount -f', enable busy printing using `sysctl debug.busyprt=1'. Then capture the console output which will show the details of all the vnodes that had to be forcibly flushed. Hopefully we will then be able to correlate them back to the files (NAMI in the ktrace output) with which they were associated. We may need to augment the NAMI data with the inode number of the associated file to make the association with the busyprt output. Anyway, once we have that, we can look at all the system calls done on those files and create a small test program that exhibits the problem. Given a small test program, Jeff or I can track down the offending system call path and nail this pernicious bug once and for all. Kirk McKusick From owner-freebsd-fs@FreeBSD.ORG Mon Jul 15 21:59:53 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 52805F8 for ; Mon, 15 Jul 2013 21:59:53 +0000 (UTC) (envelope-from rmh.aybabtu@gmail.com) Received: from mail-qa0-x22d.google.com (mail-qa0-x22d.google.com [IPv6:2607:f8b0:400d:c00::22d]) by mx1.freebsd.org (Postfix) with ESMTP id 9AB30D5B for ; Mon, 15 Jul 2013 21:59:52 +0000 (UTC) Received: by mail-qa0-f45.google.com with SMTP id ci6so1899976qab.4 for ; Mon, 15 Jul 2013 14:59:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=rQHm/TzTV4kzSn6ocp6i6epGPTb1pK1fOz+JbMzYE2w=; b=C88GYWK6XZaSVk4dvb1vJ/K96A/ThJebD5KQIW02e9jDGSWs5KboC5syor8vPad3EF XTS6KGSgAoWZszca59MpfhQUW46bQvgIFd8L57mpRlg0+F3QMbSrY/BWV3xgA73oNlhr jGZSvffAwTX14i4eILSp10hIlvqRXQ/TlGnc0AiPRH+i2axrWIvInBVvKfyFODwwGPHx E216VbTkrREf3NVtI50PgAJr1HhY41cTxsZ4y/E+IYq4ppzsaagX1QPn+gb/b1aVtFr+ uAm6A2nK4pD2BbPZspRn8xcSjo8gTuogFiWxu+UOKa7uO0XLhDMJx0NDIf2RwNLsszMp 6TVw== MIME-Version: 1.0 X-Received: by 10.49.24.52 with SMTP id r20mr52491305qef.54.1373925592152; Mon, 15 Jul 2013 14:59:52 -0700 (PDT) Sender: rmh.aybabtu@gmail.com Received: by 10.49.26.193 with HTTP; Mon, 15 Jul 2013 14:59:52 -0700 (PDT) In-Reply-To: <201307132225.r6DMPP7p002100@chez.mckusick.com> References: <201307132225.r6DMPP7p002100@chez.mckusick.com> Date: Mon, 15 Jul 2013 23:59:52 +0200 X-Google-Sender-Auth: Ume9qtTXxYp2-xTaegFff6uUa5E Message-ID: Subject: Re: Compatibility options for mount(8) From: Robert Millan To: Kirk McKusick Content-Type: text/plain; charset=UTF-8 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Jul 2013 21:59:53 -0000 2013/7/14 Kirk McKusick : > OK to leave it. Committed then, thanks everyone :-) -- Robert Millan From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 11:41:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 33946236 for ; Tue, 16 Jul 2013 11:41:44 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) by mx1.freebsd.org (Postfix) with ESMTP id BF224F9C for ; Tue, 16 Jul 2013 11:41:43 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [193.68.6.1]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r6GBfVaG010630 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Tue, 16 Jul 2013 14:41:32 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <51E5316B.9070201@digsys.bg> Date: Tue, 16 Jul 2013 14:41:31 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130627 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs Subject: ZFS vdev I/O questions Content-Type: text/plain; charset=windows-1251; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 11:41:44 -0000 I am observing some "strange" behaviour with I/O spread on ZFS vdevs and thought I might ask if someone has observed it too. The system hardware is an Supermicro X8DTH-6F board with integrated LSI2008 controller, two Xeon E5620 CPUs and 72GB or RAM (6x4 + 6x8 GB modules). Runs 9-stable r252690. It has currently 18 drive zpool, split on three 6 drive raidz2 vdevs, plus ZIL and L2ARC on separate SSDs (240GB Intel 520). The ZIL consists of two partitions of the boot SSDs (Intel 320), not mirrored. The zpool layout is pool: storage state: ONLINE scan: scrub canceled on Thu Jul 11 17:14:50 2013 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gpt/disk00 ONLINE 0 0 0 gpt/disk01 ONLINE 0 0 0 gpt/disk02 ONLINE 0 0 0 gpt/disk03 ONLINE 0 0 0 gpt/disk04 ONLINE 0 0 0 gpt/disk05 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 gpt/disk06 ONLINE 0 0 0 gpt/disk07 ONLINE 0 0 0 gpt/disk08 ONLINE 0 0 0 gpt/disk09 ONLINE 0 0 0 gpt/disk10 ONLINE 0 0 0 gpt/disk11 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 gpt/disk12 ONLINE 0 0 0 gpt/disk13 ONLINE 0 0 0 gpt/disk14 ONLINE 0 0 0 gpt/disk15 ONLINE 0 0 0 gpt/disk16 ONLINE 0 0 0 gpt/disk17 ONLINE 0 0 0 logs ada0p2 ONLINE 0 0 0 ada1p2 ONLINE 0 0 0 cache da20p2 ONLINE 0 0 0 zdb output storage: version: 5000 name: 'storage' state: 0 txg: 5258772 pool_guid: 17094379857311239400 hostid: 3505628652 hostname: 'a1.register.bg' vdev_children: 5 vdev_tree: type: 'root' id: 0 guid: 17094379857311239400 children[0]: type: 'raidz' id: 0 guid: 2748500753748741494 nparity: 2 metaslab_array: 33 metaslab_shift: 37 ashift: 12 asize: 18003521961984 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 5074824874132816460 path: '/dev/gpt/disk00' phys_path: '/dev/gpt/disk00' whole_disk: 1 DTL: 378 create_txg: 4 children[1]: type: 'disk' id: 1 guid: 14410366944090513563 path: '/dev/gpt/disk01' phys_path: '/dev/gpt/disk01' whole_disk: 1 DTL: 53 create_txg: 4 children[2]: type: 'disk' id: 2 guid: 3526681390841761237 path: '/dev/gpt/disk02' phys_path: '/dev/gpt/disk02' whole_disk: 1 DTL: 52 create_txg: 4 children[3]: type: 'disk' id: 3 guid: 3773850995072323004 path: '/dev/gpt/disk03' phys_path: '/dev/gpt/disk03' whole_disk: 1 DTL: 51 create_txg: 4 children[4]: type: 'disk' id: 4 guid: 16528489666301728411 path: '/dev/gpt/disk04' phys_path: '/dev/gpt/disk04' whole_disk: 1 DTL: 50 create_txg: 4 children[5]: type: 'disk' id: 5 guid: 11222774817699257051 path: '/dev/gpt/disk05' phys_path: '/dev/gpt/disk05' whole_disk: 1 DTL: 44147 create_txg: 4 children[1]: type: 'raidz' id: 1 guid: 614220834244218709 nparity: 2 metaslab_array: 39 metaslab_shift: 37 ashift: 12 asize: 18003521961984 is_log: 0 create_txg: 40 children[0]: type: 'disk' id: 0 guid: 8076478524731550200 path: '/dev/gpt/disk06' phys_path: '/dev/gpt/disk06' whole_disk: 1 DTL: 2914 create_txg: 40 children[1]: type: 'disk' id: 1 guid: 1689851194543981566 path: '/dev/gpt/disk07' phys_path: '/dev/gpt/disk07' whole_disk: 1 DTL: 48 create_txg: 40 children[2]: type: 'disk' id: 2 guid: 9743236178648200269 path: '/dev/gpt/disk08' phys_path: '/dev/gpt/disk08' whole_disk: 1 DTL: 47 create_txg: 40 children[3]: type: 'disk' id: 3 guid: 10157617457760516410 path: '/dev/gpt/disk09' phys_path: '/dev/gpt/disk09' whole_disk: 1 DTL: 46 create_txg: 40 children[4]: type: 'disk' id: 4 guid: 5035981195206926078 path: '/dev/gpt/disk10' phys_path: '/dev/gpt/disk10' whole_disk: 1 DTL: 45 create_txg: 40 children[5]: type: 'disk' id: 5 guid: 4975835521778875251 path: '/dev/gpt/disk11' phys_path: '/dev/gpt/disk11' whole_disk: 1 DTL: 44149 create_txg: 40 children[2]: type: 'raidz' id: 2 guid: 7453512836015019221 nparity: 2 metaslab_array: 38974 metaslab_shift: 37 ashift: 12 asize: 18003521961984 is_log: 0 create_txg: 4455560 children[0]: type: 'disk' id: 0 guid: 11182458869377968267 path: '/dev/gpt/disk12' phys_path: '/dev/gpt/disk12' whole_disk: 1 DTL: 45059 create_txg: 4455560 children[1]: type: 'disk' id: 1 guid: 5844283175515272344 path: '/dev/gpt/disk13' phys_path: '/dev/gpt/disk13' whole_disk: 1 DTL: 44145 create_txg: 4455560 children[2]: type: 'disk' id: 2 guid: 13095364699938843583 path: '/dev/gpt/disk14' phys_path: '/dev/gpt/disk14' whole_disk: 1 DTL: 44144 create_txg: 4455560 children[3]: type: 'disk' id: 3 guid: 5196507898996589388 path: '/dev/gpt/disk15' phys_path: '/dev/gpt/disk15' whole_disk: 1 DTL: 44143 create_txg: 4455560 children[4]: type: 'disk' id: 4 guid: 12809770017318709512 path: '/dev/gpt/disk16' phys_path: '/dev/gpt/disk16' whole_disk: 1 DTL: 44142 create_txg: 4455560 children[5]: type: 'disk' id: 5 guid: 7339883019925920701 path: '/dev/gpt/disk17' phys_path: '/dev/gpt/disk17' whole_disk: 1 DTL: 44141 create_txg: 4455560 children[3]: type: 'disk' id: 3 guid: 18011869864924559827 path: '/dev/ada0p2' phys_path: '/dev/ada0p2' whole_disk: 1 metaslab_array: 16675 metaslab_shift: 26 ashift: 12 asize: 8585216000 is_log: 1 DTL: 86787 create_txg: 5182360 children[4]: type: 'disk' id: 4 guid: 1338775535758010670 path: '/dev/ada1p2' phys_path: '/dev/ada1p2' whole_disk: 1 metaslab_array: 16693 metaslab_shift: 26 ashift: 12 asize: 8585216000 is_log: 1 DTL: 86788 create_txg: 5182377 features_for_read: Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware CC24) -- this is an 4k sector drive of a new generation (notice the relatively 'old' firmware, that can't be upgraded). Drives da6-da17 are also Seagate Barracuda 7200.14 (AF) but (ST3000DM001-9YN166, firmware CC4H) -- the more "normal" part number. Some have firmware CC4C which I replace drive by drive (but other than the excessive load counts no other issues so far). The only ZFS related tuning is in /etc/sysctl.conf # improve ZFS resilver vfs.zfs.resilver_delay=0 vfs.zfs.scrub_delay=0 vfs.zfs.top_maxinflight=128 vfs.zfs.resilver_min_time_ms=5000 vfs.zfs.vdev.max_pending=24 # L2ARC: vfs.zfs.l2arc_norw=0 vfs.zfs.l2arc_write_max=83886080 vfs.zfs.l2arc_write_boost=83886080 The pool of course had dedup and had serious dedup ratios, like over 10x. In general, with the ZIL and L2ARC, the only trouble I have seen with dedup is when deleting lots of data... which this server has seen plenty of. During this experiment, I have moved most data to other server and un-dedup the last remaining TBs. While doing zfs destroy on an 2-3TB dataset, I observe very annoying behaviour. The pool would stay mostly idle, accepting almost no I/O and doing small random reads, like this $ zpool iostat storage 10 storage 45.3T 3.45T 466 0 1.82M 0 storage 45.3T 3.45T 50 0 203K 0 storage 45.3T 3.45T 45 25 183K 1.70M storage 45.3T 3.45T 49 0 199K 0 storage 45.3T 3.45T 50 0 202K 0 storage 45.3T 3.45T 51 0 204K 0 storage 45.3T 3.45T 57 0 230K 0 storage 45.3T 3.45T 65 0 260K 0 storage 45.3T 3.45T 68 25 274K 1.70M storage 45.3T 3.45T 65 0 260K 0 storage 45.3T 3.45T 64 0 260K 0 storage 45.3T 3.45T 67 0 272K 0 storage 45.3T 3.45T 66 0 266K 0 storage 45.3T 3.45T 64 0 258K 0 storage 45.3T 3.45T 62 25 250K 1.70M storage 45.3T 3.45T 57 0 231K 0 storage 45.3T 3.45T 58 0 235K 0 storage 45.3T 3.45T 66 0 267K 0 storage 45.3T 3.45T 64 0 257K 0 storage 45.3T 3.45T 60 0 241K 0 storage 45.3T 3.45T 50 0 203K 0 storage 45.3T 3.45T 52 25 209K 1.70M storage 45.3T 3.45T 54 0 217K 0 storage 45.3T 3.45T 51 0 205K 0 storage 45.3T 3.45T 54 0 216K 0 storage 45.3T 3.45T 55 0 222K 0 storage 45.3T 3.45T 56 0 226K 0 storage 45.3T 3.45T 65 0 264K 0 storage 45.3T 3.45T 71 0 286K 0 The write peaks are from processes syncing data to the pool - in this state it does not do reads (the data the sync process deals with is already in ARC). Then it goes into writing back to the pool (perhaps DDT metadata) storage 45.3T 3.45T 17 24.4K 69.6K 97.5M storage 45.3T 3.45T 0 19.6K 0 78.5M storage 45.3T 3.45T 0 14.2K 0 56.8M storage 45.3T 3.45T 0 7.90K 0 31.6M storage 45.3T 3.45T 0 7.81K 0 32.8M storage 45.3T 3.45T 0 9.54K 0 38.2M storage 45.3T 3.45T 0 7.07K 0 28.3M storage 45.3T 3.45T 0 7.70K 0 30.8M storage 45.3T 3.45T 0 6.19K 0 24.8M storage 45.3T 3.45T 0 5.45K 0 21.8M storage 45.3T 3.45T 0 5.78K 0 24.7M storage 45.3T 3.45T 0 5.29K 0 21.2M storage 45.3T 3.45T 0 5.69K 0 22.8M storage 45.3T 3.45T 0 5.52K 0 22.1M storage 45.3T 3.45T 0 3.26K 0 13.1M storage 45.3T 3.45T 0 1.77K 0 7.10M storage 45.3T 3.45T 0 1.63K 0 8.14M storage 45.3T 3.45T 0 1.41K 0 5.64M storage 45.3T 3.45T 0 1.22K 0 4.88M storage 45.3T 3.45T 0 1.27K 0 5.09M storage 45.3T 3.45T 0 1.06K 0 4.26M storage 45.3T 3.45T 0 1.07K 0 4.30M storage 45.3T 3.45T 0 979 0 3.83M storage 45.3T 3.45T 0 1002 0 3.91M storage 45.3T 3.45T 0 1010 0 3.95M storage 45.3T 3.45T 0 948 2.40K 3.71M storage 45.3T 3.45T 0 939 0 3.67M storage 45.3T 3.45T 0 1023 0 7.10M storage 45.3T 3.45T 0 1.01K 4.80K 4.04M storage 45.3T 3.45T 0 822 0 3.22M storage 45.3T 3.45T 0 434 0 1.70M storage 45.3T 3.45T 0 398 2.40K 1.56M For quite some time, there are no reads from the pool. When that happens, gstat (gstat -f 'da[0-9]*$') displays something like this: dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 24 1338 0 0 0.0 1338 12224 17.8 100.9| da0 24 6888 0 0 0.0 6888 60720 3.5 100.0| da1 24 6464 0 0 0.0 6464 71997 3.7 100.0| da2 24 6117 0 0 0.0 6117 82386 3.9 99.9| da3 24 6455 0 0 0.0 6455 66822 3.7 100.0| da4 24 6782 0 0 0.0 6782 69207 3.5 100.0| da5 24 698 0 0 0.0 698 27533 34.1 99.6| da6 24 590 0 0 0.0 590 21627 40.9 99.7| da7 24 561 0 0 0.0 561 21031 42.8 100.2| da8 24 724 0 0 0.0 724 25583 33.1 99.9| da9 24 567 0 0 0.0 567 22965 41.4 98.0| da10 24 566 0 0 0.0 566 21834 42.4 99.9| da11 24 586 0 0 0.0 586 4899 43.5 100.2| da12 24 487 0 0 0.0 487 4008 49.3 100.9| da13 24 628 0 0 0.0 628 5007 38.9 100.2| da14 24 714 0 0 0.0 714 5706 33.8 99.9| da15 24 595 0 0 0.0 595 4831 39.8 99.8| da16 24 485 0 0 0.0 485 3932 49.2 100.1| da17 0 0 0 0 0.0 0 0 0.0 0.0| da18 0 0 0 0 0.0 0 0 0.0 0.0| da19 0 0 0 0 0.0 0 0 0.0 0.0| da20 0 0 0 0 0.0 0 0 0.0 0.0| ada0 0 0 0 0 0.0 0 0 0.0 0.0| ada1 (drives da8 and 19 are spares, da20 is the L2ARC SSD drive, ada0 and ada0 are the boot SSDs in separate zpool) Now, here comes the weird part. the gpart display would show intensive writes to all vdevs (da0-da5, da6-da11,da12-da17) then one of the vdevs would complete writing, and stop writing, while other vdevs continue, at the end only one vdev writes until as it seems, data is completely written to all vdevs (this can be observed in the zfs iostat output above with the decreasing write IOPS each 10 seconds), then there is a few seconds "do nothing" period and then we are back to small reads. The other observation I have is with the first vdev: the 512b drives do a lot of I/O fast, complete first and then sit idle, while da0 continues to write for many more seconds. They consistently show many more IOPS than the other drives for this type of activity -- on streaming writes all drives behave more or less the same. It is only on this un-dedup scenario where the difference is so much pronounced. All the vdevs in the pool are with ashift=12 so the theory that ZFS actually issues 512b writes to these drives can't be true, can it? Another worry is this Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly under-performing. Does anyone know if it is so different from the ST3000DM001-9YN166 drives? Might be, I should just replace it? My concern is the bursty and irregular nature of writing to vdevs. As it is now, an write operation to the pool needs to wait for all of the vdev writes to complete which is this case takes tens of seconds. A single drive in an vdev that underperforms will slow down the entire pool. Perhaps ZFS could prioritize vdev usage based on the vdev troughput, similar to how it prioritizes writes based on how much it is full. Also, what is ZFS doing during the idle periods? Are there some timeouts involved? It is certainly not using any CPU... The small random I/O is certainly not loading the disks. Then, I have 240GB L2ARC and secondarycache=metadata for the pool. Yet, the DDT apparently does not want to go there... Is there a way to "force" it to be loaded to L2ARC? Before the last big delete, I had zdb -D storage DDT-sha256-zap-duplicate: 19907778 entries, size 1603 on disk, 259 in core DDT-sha256-zap-unique: 30101659 entries, size 1428 on disk, 230 in core dedup = 1.98, compress = 1.00, copies = 1.03, dedup * compress / copies = 1.92 With time, the in core values stay more or less the same. I also discovered, that the L2ARC drive apparently is not subject to TRIM for some reason. TRIM works on the boot drives, but these are connected to the motherboard SATA ports). # sysctl kern.cam.da.20 kern.cam.da.20.delete_method: ATA_TRIM kern.cam.da.20.minimum_cmd_size: 6 kern.cam.da.20.sort_io_queue: 0 kern.cam.da.20.error_inject: 0 # sysctl -a | grep trim vfs.zfs.vdev.trim_on_init: 1 vfs.zfs.vdev.trim_max_pending: 64 vfs.zfs.vdev.trim_max_bytes: 2147483648 vfs.zfs.trim.enabled: 1 vfs.zfs.trim.max_interval: 1 vfs.zfs.trim.timeout: 30 vfs.zfs.trim.txg_delay: 32 kstat.zfs.misc.zio_trim.bytes: 139489971200 kstat.zfs.misc.zio_trim.success: 628351 kstat.zfs.misc.zio_trim.unsupported: 622819 kstat.zfs.misc.zio_trim.failed: 0 Yet, I don't observe any BIO_DELETE activity to this drive with gstat -d Wasn't TRIM supposed to work on drives attached to LSI2008 in 9-stable? Daniel From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 11:53:12 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6778B4AC for ; Tue, 16 Jul 2013 11:53:12 +0000 (UTC) (envelope-from feld@freebsd.org) Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25]) by mx1.freebsd.org (Postfix) with ESMTP id 4266AC0 for ; Tue, 16 Jul 2013 11:53:11 +0000 (UTC) Received: from compute6.internal (compute6.nyi.mail.srv.osa [10.202.2.46]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 50E0D20F78 for ; Tue, 16 Jul 2013 07:53:10 -0400 (EDT) Received: from frontend2 ([10.202.2.161]) by compute6.internal (MEProxy); Tue, 16 Jul 2013 07:53:10 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=date:from:to:subject:message-id :references:mime-version:content-type:in-reply-to; s=smtpout; bh=qX1EwoWbVQPKDn700pWCAjjGrYI=; b=n4RZQU1oA4t8qeZJs+MjzNdKAT8a NLe/Et3/QfSZ9wzhRxObj9bsIetrhSBiidSc1Mk7M7xUr5NMyux6ikKYGSn90Yv4 YzdNQ+offHF08VrOQL3d7dR9q9LDutAhX2+efWCJALQL05BVqB0C9df6mzbEvEGb rQrzb0vIK7c9IpA= X-Sasl-enc: +Ml65F9anGLoJbKs+tdj46YkjSfG8VO+dH+mPLkaSHS3 1373975590 Received: from mwi1.coffeenet.org (unknown [66.170.3.2]) by mail.messagingengine.com (Postfix) with ESMTPA id 0F0A56800AE for ; Tue, 16 Jul 2013 07:53:10 -0400 (EDT) Date: Tue, 16 Jul 2013 06:53:05 -0500 From: Mark Felder To: freebsd-fs@freebsd.org Subject: Re: ZFS vdev I/O questions Message-ID: <20130716115305.GA40918@mwi1.coffeenet.org> References: <51E5316B.9070201@digsys.bg> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E5316B.9070201@digsys.bg> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 11:53:12 -0000 On Tue, Jul 16, 2013 at 02:41:31PM +0300, Daniel Kalchev wrote: > I am observing some "strange" behaviour with I/O spread on ZFS vdevs and > thought I might ask if someone has observed it too. > --SNIP-- > Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, > firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been > replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware > CC24) -- this is an 4k sector drive of a new generation (notice the > relatively 'old' firmware, that can't be upgraded). --SNIP-- > The other observation I have is with the first vdev: the 512b drives do > a lot of I/O fast, complete first and then sit idle, while da0 continues > to write for many more seconds. They consistently show many more IOPS > than the other drives for this type of activity -- on streaming writes > all drives behave more or less the same. It is only on this un-dedup > scenario where the difference is so much pronounced. > > All the vdevs in the pool are with ashift=12 so the theory that ZFS > actually issues 512b writes to these drives can't be true, can it? > > Another worry is this Seagate Barracuda 7200.14 (AF) > (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly > under-performing. Does anyone know if it is so different from the > ST3000DM001-9YN166 drives? Might be, I should just replace it? > A lot of information here. Those Hitachis are great drives. The addition of the Barracuda with different performance characteristics could be part of the problem. I'm glad you pointed out that the pool ashift=12 so we can try to rule that out. I'd be quite interested in knowing if some or perhaps even all of your issues go away simply by replacing that drive with another Hitachi. From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 11:58:22 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 08BDD788 for ; Tue, 16 Jul 2013 11:58:22 +0000 (UTC) (envelope-from prvs=190921e474=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 686A8FB for ; Tue, 16 Jul 2013 11:58:21 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50005008033.msg for ; Tue, 16 Jul 2013 12:58:18 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Tue, 16 Jul 2013 12:58:18 +0100 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=190921e474=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: freebsd-fs@freebsd.org Message-ID: <3472068604314C9887FE3BD4CD42B7C8@multiplay.co.uk> From: "Steven Hartland" To: "Daniel Kalchev" , "freebsd-fs" References: <51E5316B.9070201@digsys.bg> Subject: Re: ZFS vdev I/O questions Date: Tue, 16 Jul 2013 12:58:43 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 11:58:22 -0000 One thing to check with this is to add -d to your gstat to see if your waiting deletes IO? I doubt it but worth checking. Regards Steve ----- Original Message ----- From: "Daniel Kalchev" To: "freebsd-fs" Sent: Tuesday, July 16, 2013 12:41 PM Subject: ZFS vdev I/O questions >I am observing some "strange" behaviour with I/O spread on ZFS vdevs and > thought I might ask if someone has observed it too. > > The system hardware is an Supermicro X8DTH-6F board with integrated > LSI2008 controller, two Xeon E5620 CPUs and 72GB or RAM (6x4 + 6x8 GB > modules). > Runs 9-stable r252690. > > It has currently 18 drive zpool, split on three 6 drive raidz2 vdevs, > plus ZIL and L2ARC on separate SSDs (240GB Intel 520). The ZIL consists > of two partitions of the boot SSDs (Intel 320), not mirrored. The zpool > layout is > > pool: storage > state: ONLINE > scan: scrub canceled on Thu Jul 11 17:14:50 2013 > config: > > NAME STATE READ WRITE CKSUM > storage ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > gpt/disk00 ONLINE 0 0 0 > gpt/disk01 ONLINE 0 0 0 > gpt/disk02 ONLINE 0 0 0 > gpt/disk03 ONLINE 0 0 0 > gpt/disk04 ONLINE 0 0 0 > gpt/disk05 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > gpt/disk06 ONLINE 0 0 0 > gpt/disk07 ONLINE 0 0 0 > gpt/disk08 ONLINE 0 0 0 > gpt/disk09 ONLINE 0 0 0 > gpt/disk10 ONLINE 0 0 0 > gpt/disk11 ONLINE 0 0 0 > raidz2-2 ONLINE 0 0 0 > gpt/disk12 ONLINE 0 0 0 > gpt/disk13 ONLINE 0 0 0 > gpt/disk14 ONLINE 0 0 0 > gpt/disk15 ONLINE 0 0 0 > gpt/disk16 ONLINE 0 0 0 > gpt/disk17 ONLINE 0 0 0 > logs > ada0p2 ONLINE 0 0 0 > ada1p2 ONLINE 0 0 0 > cache > da20p2 ONLINE 0 0 0 > > > zdb output > > storage: > version: 5000 > name: 'storage' > state: 0 > txg: 5258772 > pool_guid: 17094379857311239400 > hostid: 3505628652 > hostname: 'a1.register.bg' > vdev_children: 5 > vdev_tree: > type: 'root' > id: 0 > guid: 17094379857311239400 > children[0]: > type: 'raidz' > id: 0 > guid: 2748500753748741494 > nparity: 2 > metaslab_array: 33 > metaslab_shift: 37 > ashift: 12 > asize: 18003521961984 > is_log: 0 > create_txg: 4 > children[0]: > type: 'disk' > id: 0 > guid: 5074824874132816460 > path: '/dev/gpt/disk00' > phys_path: '/dev/gpt/disk00' > whole_disk: 1 > DTL: 378 > create_txg: 4 > children[1]: > type: 'disk' > id: 1 > guid: 14410366944090513563 > path: '/dev/gpt/disk01' > phys_path: '/dev/gpt/disk01' > whole_disk: 1 > DTL: 53 > create_txg: 4 > children[2]: > type: 'disk' > id: 2 > guid: 3526681390841761237 > path: '/dev/gpt/disk02' > phys_path: '/dev/gpt/disk02' > whole_disk: 1 > DTL: 52 > create_txg: 4 > children[3]: > type: 'disk' > id: 3 > guid: 3773850995072323004 > path: '/dev/gpt/disk03' > phys_path: '/dev/gpt/disk03' > whole_disk: 1 > DTL: 51 > create_txg: 4 > children[4]: > type: 'disk' > id: 4 > guid: 16528489666301728411 > path: '/dev/gpt/disk04' > phys_path: '/dev/gpt/disk04' > whole_disk: 1 > DTL: 50 > create_txg: 4 > children[5]: > type: 'disk' > id: 5 > guid: 11222774817699257051 > path: '/dev/gpt/disk05' > phys_path: '/dev/gpt/disk05' > whole_disk: 1 > DTL: 44147 > create_txg: 4 > children[1]: > type: 'raidz' > id: 1 > guid: 614220834244218709 > nparity: 2 > metaslab_array: 39 > metaslab_shift: 37 > ashift: 12 > asize: 18003521961984 > is_log: 0 > create_txg: 40 > children[0]: > type: 'disk' > id: 0 > guid: 8076478524731550200 > path: '/dev/gpt/disk06' > phys_path: '/dev/gpt/disk06' > whole_disk: 1 > DTL: 2914 > create_txg: 40 > children[1]: > type: 'disk' > id: 1 > guid: 1689851194543981566 > path: '/dev/gpt/disk07' > phys_path: '/dev/gpt/disk07' > whole_disk: 1 > DTL: 48 > create_txg: 40 > children[2]: > type: 'disk' > id: 2 > guid: 9743236178648200269 > path: '/dev/gpt/disk08' > phys_path: '/dev/gpt/disk08' > whole_disk: 1 > DTL: 47 > create_txg: 40 > children[3]: > type: 'disk' > id: 3 > guid: 10157617457760516410 > path: '/dev/gpt/disk09' > phys_path: '/dev/gpt/disk09' > whole_disk: 1 > DTL: 46 > create_txg: 40 > children[4]: > type: 'disk' > id: 4 > guid: 5035981195206926078 > path: '/dev/gpt/disk10' > phys_path: '/dev/gpt/disk10' > whole_disk: 1 > DTL: 45 > create_txg: 40 > children[5]: > type: 'disk' > id: 5 > guid: 4975835521778875251 > path: '/dev/gpt/disk11' > phys_path: '/dev/gpt/disk11' > whole_disk: 1 > DTL: 44149 > create_txg: 40 > children[2]: > type: 'raidz' > id: 2 > guid: 7453512836015019221 > nparity: 2 > metaslab_array: 38974 > metaslab_shift: 37 > ashift: 12 > asize: 18003521961984 > is_log: 0 > create_txg: 4455560 > children[0]: > type: 'disk' > id: 0 > guid: 11182458869377968267 > path: '/dev/gpt/disk12' > phys_path: '/dev/gpt/disk12' > whole_disk: 1 > DTL: 45059 > create_txg: 4455560 > children[1]: > type: 'disk' > id: 1 > guid: 5844283175515272344 > path: '/dev/gpt/disk13' > phys_path: '/dev/gpt/disk13' > whole_disk: 1 > DTL: 44145 > create_txg: 4455560 > children[2]: > type: 'disk' > id: 2 > guid: 13095364699938843583 > path: '/dev/gpt/disk14' > phys_path: '/dev/gpt/disk14' > whole_disk: 1 > DTL: 44144 > create_txg: 4455560 > children[3]: > type: 'disk' > id: 3 > guid: 5196507898996589388 > path: '/dev/gpt/disk15' > phys_path: '/dev/gpt/disk15' > whole_disk: 1 > DTL: 44143 > create_txg: 4455560 > children[4]: > type: 'disk' > id: 4 > guid: 12809770017318709512 > path: '/dev/gpt/disk16' > phys_path: '/dev/gpt/disk16' > whole_disk: 1 > DTL: 44142 > create_txg: 4455560 > children[5]: > type: 'disk' > id: 5 > guid: 7339883019925920701 > path: '/dev/gpt/disk17' > phys_path: '/dev/gpt/disk17' > whole_disk: 1 > DTL: 44141 > create_txg: 4455560 > children[3]: > type: 'disk' > id: 3 > guid: 18011869864924559827 > path: '/dev/ada0p2' > phys_path: '/dev/ada0p2' > whole_disk: 1 > metaslab_array: 16675 > metaslab_shift: 26 > ashift: 12 > asize: 8585216000 > is_log: 1 > DTL: 86787 > create_txg: 5182360 > children[4]: > type: 'disk' > id: 4 > guid: 1338775535758010670 > path: '/dev/ada1p2' > phys_path: '/dev/ada1p2' > whole_disk: 1 > metaslab_array: 16693 > metaslab_shift: 26 > ashift: 12 > asize: 8585216000 > is_log: 1 > DTL: 86788 > create_txg: 5182377 > features_for_read: > > Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, > firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been > replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware > CC24) -- this is an 4k sector drive of a new generation (notice the > relatively 'old' firmware, that can't be upgraded). > Drives da6-da17 are also Seagate Barracuda 7200.14 (AF) but > (ST3000DM001-9YN166, firmware CC4H) -- the more "normal" part number. > Some have firmware CC4C which I replace drive by drive (but other than > the excessive load counts no other issues so far). > > The only ZFS related tuning is in /etc/sysctl.conf > # improve ZFS resilver > vfs.zfs.resilver_delay=0 > vfs.zfs.scrub_delay=0 > vfs.zfs.top_maxinflight=128 > vfs.zfs.resilver_min_time_ms=5000 > vfs.zfs.vdev.max_pending=24 > # L2ARC: > vfs.zfs.l2arc_norw=0 > vfs.zfs.l2arc_write_max=83886080 > vfs.zfs.l2arc_write_boost=83886080 > > > The pool of course had dedup and had serious dedup ratios, like over > 10x. In general, with the ZIL and L2ARC, the only trouble I have seen > with dedup is when deleting lots of data... which this server has seen > plenty of. During this experiment, I have moved most data to other > server and un-dedup the last remaining TBs. > > While doing zfs destroy on an 2-3TB dataset, I observe very annoying > behaviour. The pool would stay mostly idle, accepting almost no I/O and > doing small random reads, like this > > $ zpool iostat storage 10 > storage 45.3T 3.45T 466 0 1.82M 0 > storage 45.3T 3.45T 50 0 203K 0 > storage 45.3T 3.45T 45 25 183K 1.70M > storage 45.3T 3.45T 49 0 199K 0 > storage 45.3T 3.45T 50 0 202K 0 > storage 45.3T 3.45T 51 0 204K 0 > storage 45.3T 3.45T 57 0 230K 0 > storage 45.3T 3.45T 65 0 260K 0 > storage 45.3T 3.45T 68 25 274K 1.70M > storage 45.3T 3.45T 65 0 260K 0 > storage 45.3T 3.45T 64 0 260K 0 > storage 45.3T 3.45T 67 0 272K 0 > storage 45.3T 3.45T 66 0 266K 0 > storage 45.3T 3.45T 64 0 258K 0 > storage 45.3T 3.45T 62 25 250K 1.70M > storage 45.3T 3.45T 57 0 231K 0 > storage 45.3T 3.45T 58 0 235K 0 > storage 45.3T 3.45T 66 0 267K 0 > storage 45.3T 3.45T 64 0 257K 0 > storage 45.3T 3.45T 60 0 241K 0 > storage 45.3T 3.45T 50 0 203K 0 > storage 45.3T 3.45T 52 25 209K 1.70M > storage 45.3T 3.45T 54 0 217K 0 > storage 45.3T 3.45T 51 0 205K 0 > storage 45.3T 3.45T 54 0 216K 0 > storage 45.3T 3.45T 55 0 222K 0 > storage 45.3T 3.45T 56 0 226K 0 > storage 45.3T 3.45T 65 0 264K 0 > storage 45.3T 3.45T 71 0 286K 0 > > The write peaks are from processes syncing data to the pool - in this > state it does not do reads (the data the sync process deals with is > already in ARC). > Then it goes into writing back to the pool (perhaps DDT metadata) > > storage 45.3T 3.45T 17 24.4K 69.6K 97.5M > storage 45.3T 3.45T 0 19.6K 0 78.5M > storage 45.3T 3.45T 0 14.2K 0 56.8M > storage 45.3T 3.45T 0 7.90K 0 31.6M > storage 45.3T 3.45T 0 7.81K 0 32.8M > storage 45.3T 3.45T 0 9.54K 0 38.2M > storage 45.3T 3.45T 0 7.07K 0 28.3M > storage 45.3T 3.45T 0 7.70K 0 30.8M > storage 45.3T 3.45T 0 6.19K 0 24.8M > storage 45.3T 3.45T 0 5.45K 0 21.8M > storage 45.3T 3.45T 0 5.78K 0 24.7M > storage 45.3T 3.45T 0 5.29K 0 21.2M > storage 45.3T 3.45T 0 5.69K 0 22.8M > storage 45.3T 3.45T 0 5.52K 0 22.1M > storage 45.3T 3.45T 0 3.26K 0 13.1M > storage 45.3T 3.45T 0 1.77K 0 7.10M > storage 45.3T 3.45T 0 1.63K 0 8.14M > storage 45.3T 3.45T 0 1.41K 0 5.64M > storage 45.3T 3.45T 0 1.22K 0 4.88M > storage 45.3T 3.45T 0 1.27K 0 5.09M > storage 45.3T 3.45T 0 1.06K 0 4.26M > storage 45.3T 3.45T 0 1.07K 0 4.30M > storage 45.3T 3.45T 0 979 0 3.83M > storage 45.3T 3.45T 0 1002 0 3.91M > storage 45.3T 3.45T 0 1010 0 3.95M > storage 45.3T 3.45T 0 948 2.40K 3.71M > storage 45.3T 3.45T 0 939 0 3.67M > storage 45.3T 3.45T 0 1023 0 7.10M > storage 45.3T 3.45T 0 1.01K 4.80K 4.04M > storage 45.3T 3.45T 0 822 0 3.22M > storage 45.3T 3.45T 0 434 0 1.70M > storage 45.3T 3.45T 0 398 2.40K 1.56M > > For quite some time, there are no reads from the pool. When that > happens, gstat (gstat -f 'da[0-9]*$') displays something like this: > > > dT: 1.001s w: 1.000s filter: da[0-9]*$ > L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name > 24 1338 0 0 0.0 1338 12224 17.8 100.9| da0 > 24 6888 0 0 0.0 6888 60720 3.5 100.0| da1 > 24 6464 0 0 0.0 6464 71997 3.7 100.0| da2 > 24 6117 0 0 0.0 6117 82386 3.9 99.9| da3 > 24 6455 0 0 0.0 6455 66822 3.7 100.0| da4 > 24 6782 0 0 0.0 6782 69207 3.5 100.0| da5 > 24 698 0 0 0.0 698 27533 34.1 99.6| da6 > 24 590 0 0 0.0 590 21627 40.9 99.7| da7 > 24 561 0 0 0.0 561 21031 42.8 100.2| da8 > 24 724 0 0 0.0 724 25583 33.1 99.9| da9 > 24 567 0 0 0.0 567 22965 41.4 98.0| da10 > 24 566 0 0 0.0 566 21834 42.4 99.9| da11 > 24 586 0 0 0.0 586 4899 43.5 100.2| da12 > 24 487 0 0 0.0 487 4008 49.3 100.9| da13 > 24 628 0 0 0.0 628 5007 38.9 100.2| da14 > 24 714 0 0 0.0 714 5706 33.8 99.9| da15 > 24 595 0 0 0.0 595 4831 39.8 99.8| da16 > 24 485 0 0 0.0 485 3932 49.2 100.1| da17 > 0 0 0 0 0.0 0 0 0.0 0.0| da18 > 0 0 0 0 0.0 0 0 0.0 0.0| da19 > 0 0 0 0 0.0 0 0 0.0 0.0| da20 > 0 0 0 0 0.0 0 0 0.0 0.0| ada0 > 0 0 0 0 0.0 0 0 0.0 0.0| ada1 > > > (drives da8 and 19 are spares, da20 is the L2ARC SSD drive, ada0 and > ada0 are the boot SSDs in separate zpool) > Now, here comes the weird part. the gpart display would show intensive > writes to all vdevs (da0-da5, da6-da11,da12-da17) then one of the vdevs > would complete writing, and stop writing, while other vdevs continue, at > the end only one vdev writes until as it seems, data is completely > written to all vdevs (this can be observed in the zfs iostat output > above with the decreasing write IOPS each 10 seconds), then there is a > few seconds "do nothing" period and then we are back to small reads. > > The other observation I have is with the first vdev: the 512b drives do > a lot of I/O fast, complete first and then sit idle, while da0 continues > to write for many more seconds. They consistently show many more IOPS > than the other drives for this type of activity -- on streaming writes > all drives behave more or less the same. It is only on this un-dedup > scenario where the difference is so much pronounced. > > All the vdevs in the pool are with ashift=12 so the theory that ZFS > actually issues 512b writes to these drives can't be true, can it? > > Another worry is this Seagate Barracuda 7200.14 (AF) > (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly > under-performing. Does anyone know if it is so different from the > ST3000DM001-9YN166 drives? Might be, I should just replace it? > > My concern is the bursty and irregular nature of writing to vdevs. As it > is now, an write operation to the pool needs to wait for all of the vdev > writes to complete which is this case takes tens of seconds. A single > drive in an vdev that underperforms will slow down the entire pool. > Perhaps ZFS could prioritize vdev usage based on the vdev troughput, > similar to how it prioritizes writes based on how much it is full. > > Also, what is ZFS doing during the idle periods? Are there some timeouts > involved? It is certainly not using any CPU... The small random I/O is > certainly not loading the disks. > > Then, I have 240GB L2ARC and secondarycache=metadata for the pool. Yet, > the DDT apparently does not want to go there... Is there a way to > "force" it to be loaded to L2ARC? Before the last big delete, I had > > zdb -D storage > DDT-sha256-zap-duplicate: 19907778 entries, size 1603 on disk, 259 in core > DDT-sha256-zap-unique: 30101659 entries, size 1428 on disk, 230 in core > > dedup = 1.98, compress = 1.00, copies = 1.03, dedup * compress / copies > = 1.92 > > With time, the in core values stay more or less the same. > > I also discovered, that the L2ARC drive apparently is not subject to > TRIM for some reason. TRIM works on the boot drives, but these are > connected to the motherboard SATA ports). > > # sysctl kern.cam.da.20 > kern.cam.da.20.delete_method: ATA_TRIM > kern.cam.da.20.minimum_cmd_size: 6 > kern.cam.da.20.sort_io_queue: 0 > kern.cam.da.20.error_inject: 0 > > # sysctl -a | grep trim > vfs.zfs.vdev.trim_on_init: 1 > vfs.zfs.vdev.trim_max_pending: 64 > vfs.zfs.vdev.trim_max_bytes: 2147483648 > vfs.zfs.trim.enabled: 1 > vfs.zfs.trim.max_interval: 1 > vfs.zfs.trim.timeout: 30 > vfs.zfs.trim.txg_delay: 32 > kstat.zfs.misc.zio_trim.bytes: 139489971200 > kstat.zfs.misc.zio_trim.success: 628351 > kstat.zfs.misc.zio_trim.unsupported: 622819 > kstat.zfs.misc.zio_trim.failed: 0 > > Yet, I don't observe any BIO_DELETE activity to this drive with gstat -d > > Wasn't TRIM supposed to work on drives attached to LSI2008 in 9-stable? > > Daniel > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 12:16:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 61619E15 for ; Tue, 16 Jul 2013 12:16:44 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) by mx1.freebsd.org (Postfix) with ESMTP id D481E1FA for ; Tue, 16 Jul 2013 12:16:43 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [193.68.6.1]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r6GCGcZ4023385 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Tue, 16 Jul 2013 15:16:39 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <51E539A6.9090109@digsys.bg> Date: Tue, 16 Jul 2013 15:16:38 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130627 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs Subject: Re: ZFS vdev I/O questions References: <51E5316B.9070201@digsys.bg> <3472068604314C9887FE3BD4CD42B7C8@multiplay.co.uk> In-Reply-To: <3472068604314C9887FE3BD4CD42B7C8@multiplay.co.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 12:16:44 -0000 On 16.07.13 14:53, Mark Felder wrote: >> The other observation I have is with the first vdev: the 512b drives do >> a lot of I/O fast, complete first and then sit idle, while da0 continues >> to write for many more seconds. They consistently show many more IOPS >> than the other drives for this type of activity -- on streaming writes >> all drives behave more or less the same. It is only on this un-dedup >> scenario where the difference is so much pronounced. >> >> All the vdevs in the pool are with ashift=12 so the theory that ZFS >> actually issues 512b writes to these drives can't be true, can it? >> >> Another worry is this Seagate Barracuda 7200.14 (AF) >> (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly >> under-performing. Does anyone know if it is so different from the >> ST3000DM001-9YN166 drives? Might be, I should just replace it? >> > A lot of information here. > > Those Hitachis are great drives. The addition of the Barracuda with > different performance characteristics could be part of the problem. I'm > glad you pointed out that the pool ashift=12 so we can try to rule that > out. I'd be quite interested in knowing if some or perhaps even all of > your issues go away simply by replacing that drive with another Hitachi. I don't have any more of these available and the vendor unfortunately could not supply more (but I am considering looking elsewhere) At the moment, I could only replace that Barracuda with one of the spare drives, which are Seagate SV35 (ST3000VX000-9YW166, firmware CV13) I have observed that these drives are not particularly good at random I/O however.... On 16.07.13 14:58, Steven Hartland wrote: > One thing to check with this is to add -d to your gstat to see > if your waiting deletes IO? I doubt it but worth checking. Here is with gstat -d. Actually, running this script makes more sense: while true ;do gstat -f 'da[0-9]*$' -d -b sleep 1 done (small random reads, heavy writes, then some reads) dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 11 1 4 20.8 10 44 0.4 0 0 0.0 2.2 da0 0 15 5 20 7.1 10 40 5.8 0 0 0.0 5.1 da1 0 7 0 0 0.0 7 28 0.4 0 0 0.0 0.2 da2 1 7 0 0 0.0 7 28 4.4 0 0 0.0 1.1 da3 1 11 0 0 0.0 11 44 0.5 0 0 0.0 0.2 da4 0 11 0 0 0.0 11 44 0.4 0 0 0.0 0.2 da5 0 25 16 68 7.8 9 36 0.4 0 0 0.0 5.4 da6 0 21 12 52 14.2 9 36 0.3 0 0 0.0 6.0 da7 0 9 0 0 0.0 9 36 0.3 0 0 0.0 0.1 da8 0 35 23 176 12.5 12 48 0.4 0 0 0.0 9.4 da9 0 26 17 140 13.7 9 36 0.3 0 0 0.0 7.3 da10 0 8 0 0 0.0 8 32 0.3 0 0 0.0 0.1 da11 0 158 29 212 17.4 129 4083 81.9 0 0 0.0 57.8 da12 0 162 25 192 16.7 137 4423 84.8 0 0 0.0 61.9 da13 0 118 0 0 0.0 118 3844 95.5 0 0 0.0 45.6 da14 5 152 20 112 10.8 132 4127 66.9 0 0 0.0 45.8 da15 0 171 14 88 17.4 157 4990 89.2 0 0 0.0 69.6 da16 0 124 1 4 6.6 123 3840 71.5 0 0 0.0 37.6 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 4558 4557 6736 0.5 1 1 0.2 0 0 0.0 14.7 da20 0 120 0 0 0.0 119 8083 35.2 0 0 0.0 17.4 ada0 0 120 0 0 0.0 119 8083 34.0 0 0 0.0 16.8 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 1 1 4 14.0 0 0 0.0 0 0 0.0 1.4 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 2 2 8 18.8 0 0 0.0 0 0 0.0 3.7 da2 0 1 1 4 35.1 0 0 0.0 0 0 0.0 3.5 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 1 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 0 3 3 12 12.8 0 0 0.0 0 0 0.0 3.8 da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da7 0 3 3 12 14.3 0 0 0.0 0 0 0.0 4.3 da8 0 1 1 4 9.8 0 0 0.0 0 0 0.0 1.0 da9 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da10 0 7 7 28 9.8 0 0 0.0 0 0 0.0 6.8 da11 0 1 1 4 7.4 0 0 0.0 0 0 0.0 0.7 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 3 3 12 34.0 0 0 0.0 0 0 0.0 10.2 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 2656 236 455 0.2 2420 12284 0.3 0 0 0.0 21.3 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 4 4 16 22.8 0 0 0.0 0 0 0.0 9.1 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 1 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 1 1 4 17.4 0 0 0.0 0 0 0.0 1.7 da3 0 3 3 12 13.2 0 0 0.0 0 0 0.0 4.0 da4 0 2 2 8 28.2 0 0 0.0 0 0 0.0 5.6 da5 0 1 1 4 9.7 0 0 0.0 0 0 0.0 1.0 da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da7 0 3 3 12 8.8 0 0 0.0 0 0 0.0 2.6 da8 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da9 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da10 0 5 5 20 10.1 0 0 0.0 0 0 0.0 5.0 da11 0 4 4 16 10.4 0 0 0.0 0 0 0.0 4.1 da12 0 1 1 4 16.0 0 0 0.0 0 0 0.0 1.6 da13 0 1 1 4 13.8 0 0 0.0 0 0 0.0 1.4 da14 0 4 4 16 7.3 0 0 0.0 0 0 0.0 2.9 da15 0 1 1 4 13.1 0 0 0.0 0 0 0.0 1.3 da16 0 1 1 4 21.3 0 0 0.0 0 0 0.0 2.1 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 515 515 979 0.2 0 0 0.0 0 0 0.0 9.2 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 1 1 4 32.3 0 0 0.0 0 0 0.0 3.2 da0 0 1 1 4 35.0 0 0 0.0 0 0 0.0 3.5 da1 0 4 4 16 4.3 0 0 0.0 0 0 0.0 1.7 da2 0 3 3 12 31.8 0 0 0.0 0 0 0.0 9.5 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 4 4 16 6.9 0 0 0.0 0 0 0.0 2.8 da5 0 2 2 8 13.6 0 0 0.0 0 0 0.0 2.7 da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da7 0 4 4 16 7.4 0 0 0.0 0 0 0.0 3.0 da8 0 1 1 4 12.6 0 0 0.0 0 0 0.0 1.3 da9 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da10 0 6 6 24 10.0 0 0 0.0 0 0 0.0 6.0 da11 0 1 1 4 21.6 0 0 0.0 0 0 0.0 2.2 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 1 1 1 4 10.8 0 0 0.0 0 0 0.0 1.1 da14 0 1 1 4 18.3 0 0 0.0 0 0 0.0 1.8 da15 0 1 1 4 9.9 0 0 0.0 0 0 0.0 1.0 da16 0 1 1 4 21.6 0 0 0.0 0 0 0.0 2.2 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 588 587 1130 0.2 1 2 0.6 0 0 0.0 10.5 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 23 1412 0 0 0.0 1412 24423 15.0 0 0 0.0 99.6 da0 23 4699 0 0 0.0 4699 72429 4.1 0 0 0.0 97.2 da1 24 4294 0 0 0.0 4294 67749 4.6 0 0 0.0 99.0 da2 23 4923 0 0 0.0 4923 75746 3.8 0 0 0.0 96.2 da3 24 4452 0 0 0.0 4452 67729 4.3 0 0 0.0 97.5 da4 23 4558 0 0 0.0 4558 68992 4.2 0 0 0.0 97.3 da5 24 813 0 0 0.0 813 10675 27.2 0 0 0.0 99.6 da6 24 866 0 0 0.0 866 11338 25.5 0 0 0.0 99.5 da7 24 845 0 0 0.0 845 11058 25.8 0 0 0.0 99.5 da8 24 866 0 0 0.0 866 11174 25.3 0 0 0.0 98.8 da9 24 928 0 0 0.0 928 12061 23.6 0 0 0.0 99.2 da10 24 864 0 0 0.0 864 11150 25.6 0 0 0.0 99.2 da11 24 674 0 0 0.0 674 14587 32.4 0 0 0.0 99.9 da12 24 705 0 0 0.0 705 15043 31.7 0 0 0.0 100.1 da13 24 670 0 0 0.0 670 14527 33.0 0 0 0.0 99.5 da14 24 648 0 0 0.0 648 13360 34.3 0 0 0.0 99.5 da15 24 716 0 0 0.0 716 14835 30.9 0 0 0.0 99.5 da16 24 730 0 0 0.0 730 15227 30.0 0 0 0.0 98.7 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 24 1169 0 0 0.0 1169 15954 20.6 0 0 0.0 100.3 da0 0 1811 0 0 0.0 1811 20961 3.9 0 0 0.0 29.6 da1 0 2392 0 0 0.0 2392 27444 3.7 0 0 0.0 37.1 da2 0 1199 0 0 0.0 1199 14571 3.7 0 0 0.0 18.5 da3 0 2305 0 0 0.0 2305 26924 4.0 0 0 0.0 38.1 da4 0 1820 0 0 0.0 1820 21573 3.8 0 0 0.0 28.8 da5 24 744 0 0 0.0 744 9248 32.6 0 0 0.0 99.5 da6 24 771 0 0 0.0 771 10722 31.1 0 0 0.0 100.0 da7 24 716 0 0 0.0 716 9268 33.5 0 0 0.0 100.2 da8 24 754 0 0 0.0 754 9280 31.8 0 0 0.0 99.9 da9 24 769 0 0 0.0 769 9384 31.4 0 0 0.0 99.9 da10 24 742 0 0 0.0 742 9508 32.1 0 0 0.0 98.5 da11 24 665 0 0 0.0 665 9967 36.5 0 0 0.0 99.3 da12 24 740 0 0 0.0 740 10958 32.2 0 0 0.0 98.7 da13 24 639 0 0 0.0 639 9619 37.4 0 0 0.0 99.7 da14 24 695 0 0 0.0 695 10367 35.1 0 0 0.0 99.8 da15 24 583 0 0 0.0 583 8429 41.5 0 0 0.0 100.2 da16 24 697 0 0 0.0 697 10287 34.3 0 0 0.0 100.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 24 895 0 0 0.0 895 14773 26.9 0 0 0.0 101.6 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 702 0 0 0.0 702 9327 33.8 0 0 0.0 97.6 da6 24 675 0 0 0.0 675 9191 35.3 0 0 0.0 100.1 da7 24 668 0 0 0.0 668 8575 36.1 0 0 0.0 100.7 da8 24 744 0 0 0.0 744 9179 31.4 0 0 0.0 97.1 da9 24 750 0 0 0.0 750 9003 32.3 0 0 0.0 100.1 da10 24 586 0 0 0.0 586 6909 40.9 0 0 0.0 100.0 da11 24 711 0 0 0.0 711 13990 33.6 0 0 0.0 99.4 da12 24 653 0 0 0.0 653 10074 36.5 0 0 0.0 100.0 da13 24 645 0 0 0.0 645 13111 37.0 0 0 0.0 100.0 da14 24 687 0 0 0.0 687 12795 34.8 0 0 0.0 99.9 da15 24 678 0 0 0.0 678 12855 35.6 0 0 0.0 100.6 da16 24 696 0 0 0.0 696 13091 34.4 0 0 0.0 99.2 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 24 1177 0 0 0.0 1177 18586 20.6 0 0 0.0 100.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 780 0 0 0.0 780 11733 30.6 0 0 0.0 99.3 da6 23 776 0 0 0.0 776 12636 31.1 0 0 0.0 100.2 da7 24 636 0 0 0.0 636 9387 37.8 0 0 0.0 99.7 da8 24 778 0 0 0.0 778 13343 31.2 0 0 0.0 100.6 da9 24 821 0 0 0.0 821 14202 29.3 0 0 0.0 101.0 da10 24 786 0 0 0.0 786 12024 30.6 0 0 0.0 100.0 da11 24 666 0 0 0.0 666 9351 35.9 0 0 0.0 100.0 da12 24 720 0 0 0.0 720 9954 33.1 0 0 0.0 99.7 da13 24 706 0 0 0.0 706 10014 34.0 0 0 0.0 99.8 da14 24 801 0 0 0.0 801 11105 30.3 0 0 0.0 101.1 da15 24 738 0 0 0.0 738 10126 32.5 0 0 0.0 99.9 da16 24 670 0 0 0.0 670 9203 36.5 0 0 0.0 101.4 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 23 1650 0 0 0.0 1650 18674 14.4 0 0 0.0 100.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 669 0 0 0.0 669 13234 35.9 0 0 0.0 100.1 da6 24 573 0 0 0.0 573 12223 42.4 0 0 0.0 99.7 da7 23 519 0 0 0.0 519 10584 46.3 0 0 0.0 100.1 da8 24 659 0 0 0.0 659 15552 36.9 0 0 0.0 100.5 da9 24 687 0 0 0.0 687 18410 35.1 0 0 0.0 99.3 da10 23 671 0 0 0.0 671 13190 36.0 0 0 0.0 100.2 da11 24 807 0 0 0.0 807 9425 29.7 0 0 0.0 99.9 da12 24 665 0 0 0.0 665 7902 36.4 0 0 0.0 100.3 da13 24 861 0 0 0.0 861 10140 28.0 0 0 0.0 99.6 da14 24 714 0 0 0.0 714 8797 33.4 0 0 0.0 99.7 da15 24 753 0 0 0.0 753 9225 32.5 0 0 0.0 100.0 da16 24 760 0 0 0.0 760 9337 31.5 0 0 0.0 100.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 648 0 0 0.0 648 9954 37.0 0 0 0.0 100.0 da6 24 664 0 0 0.0 664 10194 36.1 0 0 0.0 100.0 da7 24 653 0 0 0.0 653 13163 36.7 0 0 0.0 99.9 da8 24 749 0 0 0.0 749 12716 31.8 0 0 0.0 100.4 da9 24 769 0 0 0.0 769 13239 31.4 0 0 0.0 100.3 da10 24 742 0 0 0.0 742 12017 32.4 0 0 0.0 100.1 da11 24 584 0 0 0.0 584 7393 40.8 0 0 0.0 101.1 da12 24 648 0 0 0.0 648 8336 37.7 0 0 0.0 99.7 da13 24 679 0 0 0.0 679 8756 35.3 0 0 0.0 100.8 da14 24 646 0 0 0.0 646 8160 37.0 0 0 0.0 100.5 da15 24 692 0 0 0.0 692 8900 35.5 0 0 0.0 99.9 da16 24 684 0 0 0.0 684 8692 35.6 0 0 0.0 101.7 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 662 0 0 0.0 662 14641 35.8 0 0 0.0 100.0 da6 24 705 0 0 0.0 705 16535 34.2 0 0 0.0 99.1 da7 24 644 0 0 0.0 644 10120 37.4 0 0 0.0 101.8 da8 24 560 0 0 0.0 560 14721 43.4 0 0 0.0 99.5 da9 24 677 0 0 0.0 677 20864 35.2 0 0 0.0 100.0 da10 24 685 0 0 0.0 685 19009 35.6 0 0 0.0 100.3 da11 24 714 0 0 0.0 714 10328 34.0 0 0 0.0 101.7 da12 24 738 0 0 0.0 738 10476 32.2 0 0 0.0 100.0 da13 24 587 0 0 0.0 587 8465 40.6 0 0 0.0 100.1 da14 24 720 0 0 0.0 720 9888 33.8 0 0 0.0 100.1 da15 24 603 0 0 0.0 603 8258 39.1 0 0 0.0 100.1 da16 24 623 0 0 0.0 623 9025 38.3 0 0 0.0 99.9 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 0 76 0 0 0.0 76 1702 39.6 0 0 0.0 10.9 da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da7 0 298 0 0 0.0 298 8704 33.8 0 0 0.0 40.5 da8 0 112 0 0 0.0 112 3065 48.0 0 0 0.0 22.1 da9 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da10 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da11 24 645 0 0 0.0 645 9942 37.3 0 0 0.0 100.9 da12 24 581 0 0 0.0 581 8660 41.1 0 0 0.0 99.8 da13 24 717 0 0 0.0 717 10910 33.6 0 0 0.0 101.4 da14 24 737 0 0 0.0 737 11737 32.6 0 0 0.0 99.9 da15 24 556 0 0 0.0 556 8796 42.3 0 0 0.0 99.5 da16 24 538 0 0 0.0 538 8444 44.1 0 0 0.0 100.1 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da7 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da8 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da9 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da10 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da11 0 642 0 0 0.0 642 8875 37.2 0 0 0.0 100.0 da12 24 672 0 0 0.0 672 9323 36.0 0 0 0.0 100.0 da13 24 765 0 0 0.0 765 10602 31.5 0 0 0.0 101.6 da14 0 684 0 0 0.0 684 9902 31.0 0 0 0.0 87.1 da15 23 683 0 0 0.0 683 10210 35.6 0 0 0.0 100.4 da16 24 654 0 0 0.0 654 9603 37.1 0 0 0.0 99.9 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.000s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 114 109 3811 1.9 4 16 0.5 0 0 0.0 12.6 da0 0 349 343 12129 0.7 4 16 0.5 0 0 0.0 24.5 da1 0 232 226 8322 1.0 4 16 0.5 0 0 0.0 23.6 da2 0 230 224 8338 2.7 4 16 0.5 0 0 0.0 28.0 da3 0 349 343 12129 0.7 4 16 0.5 0 0 0.0 23.9 da4 0 122 116 3775 1.1 4 16 0.5 0 0 0.0 21.5 da5 0 5 0 0 0.0 4 16 0.4 0 0 0.0 4.8 da6 0 5 0 0 0.0 4 16 0.3 0 0 0.0 4.2 da7 0 356 351 12577 0.8 4 16 0.4 0 0 0.0 14.4 da8 0 349 344 12573 0.7 4 16 0.3 0 0 0.0 15.5 da9 0 359 354 12573 0.6 4 16 0.4 0 0 0.0 9.5 da10 0 360 355 12589 1.0 4 16 0.3 0 0 0.0 14.0 da11 0 6 0 0 0.0 4 16 0.3 0 0 0.0 16.9 da12 0 5 0 0 0.0 4 16 0.3 0 0 0.0 3.6 da13 0 5 0 0 0.0 4 16 0.3 0 0 0.0 5.3 da14 0 5 0 0 0.0 4 16 0.3 0 0 0.0 3.8 da15 0 5 0 0 0.0 4 16 0.3 0 0 0.0 4.9 da16 0 9 4 16 15.7 4 16 0.4 0 0 0.0 5.8 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 1 0 0 0.0 0 0 0.0 1 3839 0.4 0.0 ada0 0 2 0 0 0.0 0 0 0.0 2 3839 0.5 0.0 ada1 ^C There were some heavy writes to da20 (the L2ARC) but no delete requests. While deletes on ada0 and ada1 do happen as expected. The L2ARC should delete, as it is nearly full. zpool iostat -v storage 10 capacity operations bandwidth pool alloc free read write read write -------------- ----- ----- ----- ----- ----- ----- storage 44.8T 3.95T 77 9.34K 5.03M 43.1M raidz2 15.1T 1.13T 27 3.09K 2.37M 14.1M gpt/disk00 - - 8 485 303K 6.66M gpt/disk01 - - 16 477 603K 6.56M gpt/disk02 - - 9 476 320K 6.56M gpt/disk03 - - 8 476 322K 6.56M gpt/disk04 - - 16 477 603K 6.56M gpt/disk05 - - 8 476 301K 6.55M raidz2 14.7T 1.53T 32 3.13K 2.57M 14.5M gpt/disk06 - - 14 444 502K 6.74M gpt/disk07 - - 11 444 385K 6.74M gpt/disk08 - - 14 444 552K 6.74M gpt/disk09 - - 15 445 582K 6.73M gpt/disk10 - - 7 444 280K 6.72M gpt/disk11 - - 9 444 369K 6.72M raidz2 15.0T 1.29T 17 3.10K 87.0K 14.1M gpt/disk12 - - 4 434 40.8K 6.65M gpt/disk13 - - 2 434 16.3K 6.65M gpt/disk14 - - 0 433 8.75K 6.62M gpt/disk15 - - 4 435 40.6K 6.66M gpt/disk16 - - 2 433 15.3K 6.61M gpt/disk17 - - 0 434 8.54K 6.65M logs - - - - - - ada0p2 1.01M 7.94G 0 2 0 185K ada1p2 1.01M 7.94G 0 2 0 185K cache - - - - - - da20p2 215G 80M 439 674 703K 2.84M -------------- ----- ----- ----- ----- ----- ----- Daniel From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 13:16:15 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0263E989 for ; Tue, 16 Jul 2013 13:16:15 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) by mx1.freebsd.org (Postfix) with ESMTP id 75FF8670 for ; Tue, 16 Jul 2013 13:16:14 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [193.68.6.1]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r6GDG9rV061441 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Tue, 16 Jul 2013 16:16:10 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <51E54799.8070700@digsys.bg> Date: Tue, 16 Jul 2013 16:16:09 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130627 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: ZFS vdev I/O questions References: <51E5316B.9070201@digsys.bg> <20130716115305.GA40918@mwi1.coffeenet.org> In-Reply-To: <20130716115305.GA40918@mwi1.coffeenet.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 13:16:15 -0000 On 16.07.13 14:53, Mark Felder wrote: > On Tue, Jul 16, 2013 at 02:41:31PM +0300, Daniel Kalchev wrote: >> I am observing some "strange" behaviour with I/O spread on ZFS vdevs and >> thought I might ask if someone has observed it too. >> > --SNIP-- > >> Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, >> firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been >> replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware >> CC24) -- this is an 4k sector drive of a new generation (notice the >> relatively 'old' firmware, that can't be upgraded). > --SNIP-- > >> The other observation I have is with the first vdev: the 512b drives do >> a lot of I/O fast, complete first and then sit idle, while da0 continues >> to write for many more seconds. They consistently show many more IOPS >> than the other drives for this type of activity -- on streaming writes >> all drives behave more or less the same. It is only on this un-dedup >> scenario where the difference is so much pronounced. >> >> All the vdevs in the pool are with ashift=12 so the theory that ZFS >> actually issues 512b writes to these drives can't be true, can it? >> >> Another worry is this Seagate Barracuda 7200.14 (AF) >> (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly >> under-performing. Does anyone know if it is so different from the >> ST3000DM001-9YN166 drives? Might be, I should just replace it? >> > A lot of information here. > > Those Hitachis are great drives. The addition of the Barracuda with > different performance characteristics could be part of the problem. I'm > glad you pointed out that the pool ashift=12 so we can try to rule that > out. I'd be quite interested in knowing if some or perhaps even all of > your issues go away simply by replacing that drive with another Hitachi. > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" I wanted to commend further on this. The Hitachi drives are only in the first vdev (da0-da5) together with that new Seagate Barracuda drive. However, I observe very irregular writing to all the vdevs, not just within the same vdev. Here is output of gstat -d with interval 1 second: dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 42 42 600 1.1 0 0 0.0 0 0 0.0 2.5 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 33 33 460 4.3 0 0 0.0 0 0 0.0 3.5 da5 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da7 0 30 30 656 2.1 0 0 0.0 0 0 0.0 3.0 da8 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da9 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da10 0 34 34 748 1.5 0 0 0.0 0 0 0.0 2.4 da11 0 43 43 1299 1.7 0 0 0.0 0 0 0.0 4.2 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 41 41 1395 1.5 0 0 0.0 0 0 0.0 3.3 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 1081 0 0 0.0 1081 14551 0.7 0 0 0.0 10.3 da20 0 124 0 0 0.0 97 286 0.5 25 273 3.7 1.2 ada0 0 119 0 0 0.0 92 286 0.4 25 273 3.5 1.1 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 24 501 0 0 0.0 501 18421 46.8 0 0 0.0 98.8 da0 24 690 0 0 0.0 690 34208 34.6 0 0 0.0 99.8 da1 24 691 0 0 0.0 691 33317 33.6 0 0 0.0 100.2 da2 24 750 0 0 0.0 750 37752 30.9 0 0 0.0 99.9 da3 24 672 0 0 0.0 672 32694 34.9 0 0 0.0 100.1 da4 24 722 0 0 0.0 722 36178 32.5 0 0 0.0 100.0 da5 24 633 0 0 0.0 633 9046 37.6 0 0 0.0 100.1 da6 24 601 0 0 0.0 601 8727 39.2 0 0 0.0 100.0 da7 24 620 0 0 0.0 620 9198 38.1 0 0 0.0 100.0 da8 24 619 0 0 0.0 619 8915 38.3 0 0 0.0 100.3 da9 24 539 0 0 0.0 539 7692 43.3 0 0 0.0 100.0 da10 24 715 0 0 0.0 715 10221 33.0 0 0 0.0 100.5 da11 24 584 0 0 0.0 584 44525 39.8 0 0 0.0 99.4 da12 24 543 0 0 0.0 543 41081 43.2 0 0 0.0 100.6 da13 24 523 0 0 0.0 523 40641 44.2 0 0 0.0 100.0 da14 24 521 0 0 0.0 521 40509 44.9 0 0 0.0 99.9 da15 24 505 0 0 0.0 505 40206 46.1 0 0 0.0 99.8 da16 24 524 0 0 0.0 524 40677 43.9 0 0 0.0 99.9 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 1082 0 0 0.0 1082 2941 0.2 0 0 0.0 6.5 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.000s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 24 507 0 0 0.0 507 30284 47.7 0 0 0.0 99.9 da0 24 625 0 0 0.0 625 36929 38.7 0 0 0.0 100.6 da1 24 724 0 0 0.0 724 44142 33.2 0 0 0.0 100.5 da2 24 775 0 0 0.0 775 53063 30.6 0 0 0.0 98.2 da3 24 630 0 0 0.0 630 40891 37.8 0 0 0.0 100.2 da4 24 698 0 0 0.0 698 38149 35.4 0 0 0.0 102.5 da5 24 784 0 0 0.0 784 11787 30.7 0 0 0.0 99.9 da6 24 707 0 0 0.0 707 10840 34.3 0 0 0.0 99.1 da7 24 689 0 0 0.0 689 10668 34.9 0 0 0.0 99.6 da8 24 635 0 0 0.0 635 9528 37.8 0 0 0.0 100.1 da9 24 669 0 0 0.0 669 10268 35.6 0 0 0.0 99.7 da10 24 675 0 0 0.0 675 10304 35.2 0 0 0.0 100.3 da11 24 507 0 0 0.0 507 23746 47.4 0 0 0.0 100.0 da12 24 476 0 0 0.0 476 24454 48.9 0 0 0.0 100.0 da13 24 495 0 0 0.0 495 31043 48.2 0 0 0.0 100.8 da14 24 582 0 0 0.0 582 34710 41.3 0 0 0.0 100.1 da15 24 592 0 0 0.0 592 34022 41.1 0 0 0.0 100.4 da16 24 559 0 0 0.0 559 34854 42.5 0 0 0.0 99.6 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.000s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 24 719 0 0 0.0 719 23063 33.0 0 0 0.0 99.2 da0 0 94 0 0 0.0 94 8274 43.6 0 0 0.0 16.0 da1 0 46 0 0 0.0 46 3839 37.0 0 0 0.0 7.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 135 0 0 0.0 135 8966 39.3 0 0 0.0 21.9 da4 0 11 0 0 0.0 11 896 38.8 0 0 0.0 1.4 da5 24 648 0 0 0.0 648 9070 36.6 0 0 0.0 99.9 da6 24 679 0 0 0.0 679 9750 35.7 0 0 0.0 100.1 da7 24 686 0 0 0.0 686 9922 35.0 0 0 0.0 99.9 da8 24 666 0 0 0.0 666 9654 35.8 0 0 0.0 100.6 da9 24 682 0 0 0.0 682 9450 35.1 0 0 0.0 100.4 da10 24 700 0 0 0.0 700 9346 34.1 0 0 0.0 100.0 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.000s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 428 0 0 0.0 428 4207 55.6 0 0 0.0 100.6 da6 24 447 0 0 0.0 447 4279 52.9 0 0 0.0 100.4 da7 24 432 0 0 0.0 432 4087 55.6 0 0 0.0 100.5 da8 24 524 0 0 0.0 524 6243 45.3 0 0 0.0 99.6 da9 24 554 0 0 0.0 554 6379 43.2 0 0 0.0 100.0 da10 24 439 0 0 0.0 439 4611 54.0 0 0 0.0 97.9 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.000s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 350 0 0 0.0 350 3263 69.9 0 0 0.0 102.4 da6 24 326 0 0 0.0 326 3611 74.2 0 0 0.0 100.1 da7 24 335 0 0 0.0 335 3367 72.4 0 0 0.0 100.0 da8 24 329 0 0 0.0 329 2943 73.5 0 0 0.0 100.3 da9 24 326 0 0 0.0 326 2883 75.1 0 0 0.0 99.8 da10 24 369 0 0 0.0 369 2995 65.2 0 0 0.0 100.3 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 525 0 0 0.0 525 8507 45.3 0 0 0.0 100.1 da6 24 430 0 0 0.0 430 6761 55.6 0 0 0.0 101.7 da7 24 479 0 0 0.0 479 7548 50.1 0 0 0.0 100.6 da8 24 542 0 0 0.0 542 9463 44.3 0 0 0.0 100.2 da9 23 593 0 0 0.0 593 10386 40.6 0 0 0.0 100.1 da10 24 555 0 0 0.0 555 9678 42.9 0 0 0.0 98.0 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 566 0 0 0.0 566 9800 41.3 0 0 0.0 98.4 da6 24 526 0 0 0.0 526 12370 46.5 0 0 0.0 99.9 da7 24 577 0 0 0.0 577 13166 41.5 0 0 0.0 99.9 da8 24 538 0 0 0.0 538 11990 44.7 0 0 0.0 99.9 da9 24 631 0 0 0.0 631 12666 37.8 0 0 0.0 99.6 da10 24 650 0 0 0.0 650 12894 36.4 0 0 0.0 101.2 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 365 0 0 0.0 365 3604 66.3 0 0 0.0 100.0 da6 24 361 0 0 0.0 361 3724 65.8 0 0 0.0 100.2 da7 24 363 0 0 0.0 363 3680 65.4 0 0 0.0 99.5 da8 24 342 0 0 0.0 342 3500 69.4 0 0 0.0 100.8 da9 24 355 0 0 0.0 355 3460 70.1 0 0 0.0 101.1 da10 24 373 0 0 0.0 373 3616 65.0 0 0 0.0 99.5 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.000s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 539 0 0 0.0 539 4947 44.2 0 0 0.0 99.8 da6 24 468 0 0 0.0 468 12565 52.0 0 0 0.0 100.2 da7 24 493 0 0 0.0 493 10950 49.5 0 0 0.0 100.0 da8 24 450 0 0 0.0 450 12665 52.8 0 0 0.0 100.2 da9 24 528 0 0 0.0 528 11070 45.7 0 0 0.0 100.0 da10 24 542 0 0 0.0 542 10750 43.9 0 0 0.0 98.0 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 39 0 0 0.0 38 2583 14.6 0 0 0.0 4.9 ada0 0 39 0 0 0.0 38 2583 14.5 0 0 0.0 4.9 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 24 367 0 0 0.0 367 7972 65.1 0 0 0.0 100.0 da6 24 360 0 0 0.0 360 5055 67.4 0 0 0.0 100.1 da7 24 345 0 0 0.0 345 6233 69.4 0 0 0.0 100.2 da8 24 359 0 0 0.0 359 5191 65.9 0 0 0.0 101.3 da9 24 383 0 0 0.0 383 7452 62.2 0 0 0.0 100.2 da10 24 368 0 0 0.0 368 7528 64.7 0 0 0.0 100.9 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da1 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da5 0 181 0 0 0.0 181 6534 68.1 0 0 0.0 50.3 da6 0 379 0 0 0.0 379 14071 57.0 0 0 0.0 90.5 da7 0 229 0 0 0.0 229 8680 65.1 0 0 0.0 64.1 da8 24 400 0 0 0.0 400 13871 60.5 0 0 0.0 99.8 da9 0 193 0 0 0.0 193 6874 57.7 0 0 0.0 45.4 da10 0 222 0 0 0.0 222 7753 68.2 0 0 0.0 65.0 da11 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da12 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da13 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da14 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da15 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da16 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 0 332 202 10093 8.8 130 11620 13.3 0 0 0.0 51.2 da0 0 253 121 4381 2.1 132 11620 10.9 0 0 0.0 20.0 da1 0 280 152 4976 4.6 128 11612 11.2 0 0 0.0 28.9 da2 0 376 247 9705 3.9 129 11612 11.3 0 0 0.0 34.6 da3 0 244 115 5540 5.4 129 11628 11.0 0 0 0.0 27.7 da4 0 268 138 5896 7.6 130 11620 10.9 0 0 0.0 38.7 da5 1 467 273 10988 7.9 194 12627 12.2 0 0 0.0 60.4 da6 0 349 147 6583 8.7 202 12659 9.4 0 0 0.0 49.6 da7 5 368 169 6803 7.0 199 12647 10.7 0 0 0.0 48.7 da8 0 451 253 11020 8.4 198 12651 9.7 0 0 0.0 61.1 da9 0 306 104 4493 6.2 202 12647 10.4 0 0 0.0 31.7 da10 0 350 151 4765 5.6 199 12627 10.4 0 0 0.0 40.3 da11 9 366 258 11652 6.9 108 12455 10.3 0 0 0.0 47.1 da12 0 302 194 8126 5.2 108 12455 13.2 0 0 0.0 36.8 da13 0 292 186 8162 3.1 106 12447 12.9 0 0 0.0 30.0 da14 0 370 264 12627 9.7 106 12447 13.3 0 0 0.0 54.1 da15 0 182 72 3110 10.2 110 12459 9.8 0 0 0.0 28.1 da16 0 171 62 3206 8.5 109 12455 9.8 0 0 0.0 25.5 da17 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da18 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 da19 4 744 216 394 0.4 529 1540 0.1 0 0 0.0 4.0 da20 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada0 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0.0 ada1 As you can see, the initial burst is to all vdevs, saturating drives at 100%. Then vdev 3 completes, then the Hitachi drives of vdev 1 complete with the Seagate drive writing some more and then for few more seconds, only vdev 2 drives are writing. It seems the amount of data is the same, just vdev 2 writes the data slower. However, drives in vdev 2 and vdev 3 are the same. They should have the same performance characteristics (and as long as the drives are not 100% saturated, all vdevs complete more or less at the same time). At other times, some other vdev would complete last -- it is never the same vdev that is 'slow'. Could this be DDT/metadata specific issue? Is the DDT/metadata vdev-specific? The pool initially had only two vdevs and after vdev 3 was added, most of the written data had no dedup enabled. Also, the ZIL was added later and initial metadata could be fragmented. But.. why should this affect writing? The zpool is indeed pretty full, but performance should degrade for all vdevs (which are more or less equally full). Daniel From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 13:28:12 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3CA77AED for ; Tue, 16 Jul 2013 13:28:12 +0000 (UTC) (envelope-from feld@freebsd.org) Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25]) by mx1.freebsd.org (Postfix) with ESMTP id 1593F6E3 for ; Tue, 16 Jul 2013 13:28:11 +0000 (UTC) Received: from compute1.internal (compute1.nyi.mail.srv.osa [10.202.2.41]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id AB0A820DD9 for ; Tue, 16 Jul 2013 09:28:10 -0400 (EDT) Received: from web3 ([10.202.2.213]) by compute1.internal (MEProxy); Tue, 16 Jul 2013 09:28:10 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:from:to:mime-version :content-transfer-encoding:content-type:in-reply-to:references :subject:date; s=smtpout; bh=mddd0WTAXXbCrTook0FGMu24DSc=; b=n3z nXv31hzgW/bg+2diuacmdfPn4LE+Kw0+2hiqWqlevweTPJN6AQj2Yl/bKz3mHthp 8IiF87O7XpFSqVnf+HSbYl3Fe6wIkKrLl7PRI9bFt9XFTM5D8IumluWZcWO1vb8Z 71s6yVTVCJ7hIZw/d8eJi+isq7XnxoXbTEgXE0Pc= Received: by web3.nyi.mail.srv.osa (Postfix, from userid 99) id 8A324B00003; Tue, 16 Jul 2013 09:28:10 -0400 (EDT) Message-Id: <1373981290.1619.140661256268541.61E5E601@webmail.messagingengine.com> X-Sasl-Enc: JPgyAKDUogswwdNuepDwVrkVBIyjJ64y7196KMhTC34O 1373981290 From: Mark Felder To: freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain X-Mailer: MessagingEngine.com Webmail Interface - ajax-bdcdd1cb In-Reply-To: <51E54799.8070700@digsys.bg> References: <51E5316B.9070201@digsys.bg> <20130716115305.GA40918@mwi1.coffeenet.org> <51E54799.8070700@digsys.bg> Subject: Re: ZFS vdev I/O questions Date: Tue, 16 Jul 2013 08:28:10 -0500 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 13:28:12 -0000 On Tue, Jul 16, 2013, at 8:16, Daniel Kalchev wrote: > > Could this be DDT/metadata specific issue? Is the DDT/metadata > vdev-specific? The pool initially had only two vdevs and after vdev 3 > was added, most of the written data had no dedup enabled. Also, the ZIL > was added later and initial metadata could be fragmented. But.. why > should this affect writing? The zpool is indeed pretty full, but > performance should degrade for all vdevs (which are more or less equally > full). > I don't want to put you down the wrong path, but you're right -- the zpool is pretty full, and zfs is known to have issues writing when above ~80%. There's another thread where this was discussed briefly. However, you have quite a large pool so I find it hard to believe that your 3.45TB free is so fragmented that zfs is having issues choosing where to write. It's certainly possible, though. Hopefully someone will drop in their 2c as well From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 14:09:42 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A217462D for ; Tue, 16 Jul 2013 14:09:42 +0000 (UTC) (envelope-from Ivailo.Tanusheff@skrill.com) Received: from db9outboundpool.messaging.microsoft.com (mail-db9lp0249.outbound.messaging.microsoft.com [213.199.154.249]) by mx1.freebsd.org (Postfix) with ESMTP id 0743A93E for ; Tue, 16 Jul 2013 14:09:41 +0000 (UTC) Received: from mail214-db9-R.bigfish.com (10.174.16.236) by DB9EHSOBE012.bigfish.com (10.174.14.75) with Microsoft SMTP Server id 14.1.225.22; Tue, 16 Jul 2013 14:09:33 +0000 Received: from mail214-db9 (localhost [127.0.0.1]) by mail214-db9-R.bigfish.com (Postfix) with ESMTP id A6E9F1E01D2; Tue, 16 Jul 2013 14:09:33 +0000 (UTC) X-Forefront-Antispam-Report: CIP:157.56.249.213; KIP:(null); UIP:(null); IPV:NLI; H:AM2PRD0710HT004.eurprd07.prod.outlook.com; RD:none; EFVD:NLI X-SpamScore: -2 X-BigFish: PS-2(zz98dI9371I542I1432Izz1f42h1ee6h1de0h1fdah2073h1202h1e76h1d1ah1d2ah1fc6hzz17326ah8275dhz2fh2a8h668h839h944hd24hf0ah1220h1288h12a5h12a9h12bdh137ah13b6h1441h1504h1537h153bh162dh1631h1758h18e1h1946h19b5h19ceh1ad9h1b0ah1d07h1d0ch1d2eh1d3fh1de9h1dfeh1dffh1e1dh9a9j1155h) Received-SPF: pass (mail214-db9: domain of skrill.com designates 157.56.249.213 as permitted sender) client-ip=157.56.249.213; envelope-from=Ivailo.Tanusheff@skrill.com; helo=AM2PRD0710HT004.eurprd07.prod.outlook.com ; .outlook.com ; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM; SFS:(189002)(199002)(51704005)(24454002)(13464003)(377454003)(49866001)(74706001)(80022001)(54316002)(47736001)(74662001)(81342001)(74366001)(74502001)(76576001)(50986001)(76786001)(56816003)(16406001)(47976001)(74316001)(15202345003)(81542001)(53806001)(77096001)(76796001)(56776001)(51856001)(66066001)(33646001)(63696002)(74876001)(79102001)(54356001)(4396001)(69226001)(83072001)(46102001)(77982001)(31966008)(65816001)(47446002)(76482001)(59766001)(24736002); DIR:OUT; SFP:; SCL:1; SRVR:DB3PR07MB060; H:DB3PR07MB059.eurprd07.prod.outlook.com; CLIP:217.18.249.148; RD:InfoNoRecords; MX:1; A:1; LANG:en; Received: from mail214-db9 (localhost.localdomain [127.0.0.1]) by mail214-db9 (MessageSwitch) id 1373983771484888_2080; Tue, 16 Jul 2013 14:09:31 +0000 (UTC) Received: from DB9EHSMHS025.bigfish.com (unknown [10.174.16.254]) by mail214-db9.bigfish.com (Postfix) with ESMTP id 7190220006E; Tue, 16 Jul 2013 14:09:31 +0000 (UTC) Received: from AM2PRD0710HT004.eurprd07.prod.outlook.com (157.56.249.213) by DB9EHSMHS025.bigfish.com (10.174.14.35) with Microsoft SMTP Server (TLS) id 14.16.227.3; Tue, 16 Jul 2013 14:09:30 +0000 Received: from DB3PR07MB060.eurprd07.prod.outlook.com (10.242.137.151) by AM2PRD0710HT004.eurprd07.prod.outlook.com (10.255.165.39) with Microsoft SMTP Server (TLS) id 14.16.329.3; Tue, 16 Jul 2013 14:09:24 +0000 Received: from DB3PR07MB059.eurprd07.prod.outlook.com (10.242.137.149) by DB3PR07MB060.eurprd07.prod.outlook.com (10.242.137.151) with Microsoft SMTP Server (TLS) id 15.0.731.16; Tue, 16 Jul 2013 14:09:23 +0000 Received: from DB3PR07MB059.eurprd07.prod.outlook.com ([169.254.2.117]) by DB3PR07MB059.eurprd07.prod.outlook.com ([169.254.2.117]) with mapi id 15.00.0731.000; Tue, 16 Jul 2013 14:09:23 +0000 From: Ivailo Tanusheff To: Daniel Kalchev , "freebsd-fs@freebsd.org" Subject: RE: ZFS vdev I/O questions Thread-Topic: ZFS vdev I/O questions Thread-Index: AQHOghmmLmdfZD/4b0i3hLA3+rJ4o5lnMdyAgAAXNoCAAA4g4A== Date: Tue, 16 Jul 2013 14:09:23 +0000 Message-ID: <9d3cf0be165d4351acc5e757de3868ec@DB3PR07MB059.eurprd07.prod.outlook.com> References: <51E5316B.9070201@digsys.bg> <20130716115305.GA40918@mwi1.coffeenet.org> <51E54799.8070700@digsys.bg> In-Reply-To: <51E54799.8070700@digsys.bg> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [217.18.249.148] x-forefront-prvs: 09090B6B69 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: skrill.com X-FOPE-CONNECTOR: Id%0$Dn%*$RO%0$TLS%0$FQDN%$TlsDn% X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 14:09:42 -0000 Hi danbo :) Isn't this some kind of pool fragmentation? Because this is usually the cas= e in such slow parts of the disk systems. I think your pool is getting full= and it is heavily fragmented, that's why you have more data for each reque= st on a different vdev. But this has nothing to do with the single, slow device :( Best regards, Ivailo Tanusheff -----Original Message----- From: owner-freebsd-fs@freebsd.org [mailto:owner-freebsd-fs@freebsd.org] On= Behalf Of Daniel Kalchev Sent: Tuesday, July 16, 2013 4:16 PM To: freebsd-fs@freebsd.org Subject: Re: ZFS vdev I/O questions On 16.07.13 14:53, Mark Felder wrote: > On Tue, Jul 16, 2013 at 02:41:31PM +0300, Daniel Kalchev wrote: >> I am observing some "strange" behaviour with I/O spread on ZFS vdevs=20 >> and thought I might ask if someone has observed it too. >> > --SNIP-- > >> Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640,=20 >> firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has=20 >> been replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166,=20 >> firmware >> CC24) -- this is an 4k sector drive of a new generation (notice the=20 >> relatively 'old' firmware, that can't be upgraded). > --SNIP-- > As you can see, the initial burst is to all vdevs, saturating drives at 100= %. Then vdev 3 completes, then the Hitachi drives of vdev 1 complete with t= he Seagate drive writing some more and then for few more seconds, only vdev= 2 drives are writing. It seems the amount of data is the same, just vdev 2= writes the data slower. However, drives in vdev 2 and vdev 3 are the same.= They should have the same performance characteristics (and as long as the = drives are not 100% saturated, all vdevs complete more or less at the same = time). At other times, some other vdev would complete last -- it is never t= he same vdev that is 'slow'. Could this be DDT/metadata specific issue? Is the DDT/metadata vdev-specifi= c? The pool initially had only two vdevs and after vdev 3 was added, most o= f the written data had no dedup enabled. Also, the ZIL was added later and = initial metadata could be fragmented. But.. why should this affect writing?= The zpool is indeed pretty full, but performance should degrade for all vd= evs (which are more or less equally full). Daniel _______________________________________________ freebsd-fs@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 14:23:41 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 4DF269FC for ; Tue, 16 Jul 2013 14:23:41 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) by mx1.freebsd.org (Postfix) with ESMTP id CEE059DC for ; Tue, 16 Jul 2013 14:23:39 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [193.68.6.1]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r6GENbX7075932 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 16 Jul 2013 17:23:38 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <51E55769.4030207@digsys.bg> Date: Tue, 16 Jul 2013 17:23:37 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130627 Thunderbird/17.0.7 MIME-Version: 1.0 To: Ivailo Tanusheff Subject: Re: ZFS vdev I/O questions References: <51E5316B.9070201@digsys.bg> <20130716115305.GA40918@mwi1.coffeenet.org> <51E54799.8070700@digsys.bg> <9d3cf0be165d4351acc5e757de3868ec@DB3PR07MB059.eurprd07.prod.outlook.com> In-Reply-To: <9d3cf0be165d4351acc5e757de3868ec@DB3PR07MB059.eurprd07.prod.outlook.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 14:23:41 -0000 On 16.07.13 17:09, Ivailo Tanusheff wrote: > Isn't this some kind of pool fragmentation? Because this is usually the case in such slow parts of the disk systems. I think your pool is getting full and it is heavily fragmented, that's why you have more data for each request on a different vdev. The pool may be fragmented. But not because it is full. It is fragmented because I forgot to add an ZIL when creating the pool, then proceeded to heavily use dedup and even some compression. Now, I am rewriting the pool's data and hopefully metadata, in userland, for the lack of better technology, primarily by doing zfs send/receive of various datasets then removing the originals. That helps me both balance the data across all vdevs as well as get rid of dedup and compression (that go to other pools with less deletes). My guess is this is more specifically metadata fragmentation. But fragmentation does not fully explain why the writes are so irregular -- writes should be grouped easily, especially metadata rewrites... and what is ZFS doing while not reading or writing (many seconds)? Morale: always add an ZIL to an ZFS pool, as this will save you to deal with fragmentation later. Depending on the pool usage, even an normal drive could do. Writes to the ZIL are sequential. > But this has nothing to do with the single, slow device :( > That drive is slow only when doing lots of small I/O. For bulk writes (which ZFS should be doing anyway with the kind of data this pool holds), it is actually faster than the Hitachi's. It will eventually get replaced soon. Daniel From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 15:33:36 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3718CE53; Tue, 16 Jul 2013 15:33:36 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) by mx1.freebsd.org (Postfix) with ESMTP id 0E69FE91; Tue, 16 Jul 2013 15:33:36 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2082DB917; Tue, 16 Jul 2013 11:33:34 -0400 (EDT) From: John Baldwin To: freebsd-fs@freebsd.org Subject: Re: RAID10 stripe size and PostgreSQL performance Date: Tue, 16 Jul 2013 11:04:53 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; ) References: In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201307161104.54089.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 16 Jul 2013 11:33:34 -0400 (EDT) Cc: freebsd-database@freebsd.org, Ivan Voras X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 15:33:36 -0000 On Friday, July 12, 2013 3:15:03 pm Artem Naluzhnyy wrote: > On Fri, Jul 12, 2013 at 4:55 PM, Ivan Voras wrote: > > I just looked at your RAID configuration at http://pastebin.com/F8uZEZdm > > and you have a mirror of stripes (RAID-01) nor a stripe of mirrors > > (RAID-10). And apparently, is I parse your configuration correctly, you > > have a 1M stripe in the MIRROR part of the RAID, and an unknown stripe > > size in the STRIPE part. > > This is probably a bug in mfiutil output. There is no "RAID 01" option > in the controller configuration, and its documentation says > (http://goo.gl/6X5pe): > > "RAID 10, a combination of RAID 0 and RAID 1, consists of striped data > across mirrored spans. A RAID 10 drive group is a spanned drive group > that creates a striped set from a series of mirrored drives. RAID 10 > allows a maximum of eight spans. You must use an even number of > configuration Scenarios 1-7 drives in each RAID virtual drive in the > span. The RAID 1 virtual drives must have the same stripe size." > > There is also no options to configure a different stripe size for the > mirrors, I can only set it globally for the whole RAID 10 volume. It is true that mfi only does stripes across RAID mirrors. mfiutil depends on the secondary raid level being set in the ddf info for detecting a RAID-10 vs a RAID-1, but not all mfi BIOS-configured volumes have that set. It should probably check if a volume spans multiple arrays instead. -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 19:33:40 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 97BF6C55; Tue, 16 Jul 2013 19:33:40 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 79B0ACF2; Tue, 16 Jul 2013 19:33:39 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id WAA15503; Tue, 16 Jul 2013 22:33:37 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1UzB0D-000Lxa-CR; Tue, 16 Jul 2013 22:33:37 +0300 Message-ID: <51E59FD9.4020103@FreeBSD.org> Date: Tue, 16 Jul 2013 22:32:41 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Adrian Chadd , freebsd-fs@FreeBSD.org Subject: Re: Deadlock in nullfs/zfs somewhere References: <51DCFEDA.1090901@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-current X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 19:33:40 -0000 on 10/07/2013 19:50 Adrian Chadd said the following: > On 9 July 2013 23:27, Andriy Gapon wrote: >> on 09/07/2013 16:03 Adrian Chadd said the following: >>> Does anyone have any ideas as to what's going on? >> >> Please provide output of 'thread apply all bt' from kgdb, then perhaps someone >> might be able to tell. > > Done - http://people.freebsd.org/~adrian/ath/20130710-vm0-zfs-hang.txt vmcore.0 was useless for some reason - an interesting address was not accessible. vmcore.1 seems to be very similar and is actually useful. This problem looks like an interesting deadlock involving ZFS and VFS and vnode shortage. The most obvious things are that many threads could not allocate a new vnode and are waiting in getnewvnode_reserve and also many threads are stuck waiting on vnode locks held by the former threads. In effect, they all wait for vnlru, which in turn is stuck in zfs_freebsd_reclaim on z_teardown_lock. That lock is held by a thread doing a rollback ioctl. And that thread waits for zfs sync thread to actually perform the rollback. The sync thread waits on zfs quiesce thread to declare the current transaction group as quiesced. The quiesce thread, obviously, waits for all operations running in the current transaction group to complete. Some of those operations are e.g. VOP_CREATE -> zfs_create. They already started a zfs transaction (as a part of the current transaction group) and they execute zfs_mknode which needs a new vnode. So these threads are waiting for a new vnode and do not let the current transaction group become quiesced. GOTO beginning. Compressing the above description to the extreme, it boils down to: ZFS needs a new vnode from vnlru and is waiting on it, while vnlru has to wait on ZFS. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 19:40:37 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4F3ADEEC; Tue, 16 Jul 2013 19:40:37 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wi0-x233.google.com (mail-wi0-x233.google.com [IPv6:2a00:1450:400c:c05::233]) by mx1.freebsd.org (Postfix) with ESMTP id 8E8F2D66; Tue, 16 Jul 2013 19:40:36 +0000 (UTC) Received: by mail-wi0-f179.google.com with SMTP id hj3so1109517wib.12 for ; Tue, 16 Jul 2013 12:40:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=e4g4eB7EyD1Iooh4RReiuCMqz0WX8xTsz/RvmJVnyr4=; b=e9lvg/uR8ueYsWfKdCrYICdjf0cFt5NYEhsrzeH4PnXcbiYDJgFg25ONdoVSIBl+OU +ElnQ6nDnyTTCw6BEo6jyXhhXdmonxZcNh0FZ9C3YkfWaoNmE5aqwY+NiRNWOHFmxnjc x5lf20441lW/dagqWFajN7+lRYP/JBmN3bYA9P6lUEdX+YVe49BgSLPD6uq1nuGKnWRj RXAcgkJAbwZVTV86cW066V+f/A1++abFIvKrcCy9DcXFPfm/zSekwsbRzjmTzM6+vG0v mJ+0Kau6ID9NaeIlMzHubZTUnxGlX1neIrC972dPWlCnFcELVbtQMeiXhJRcLDkNTZpb ZxlA== MIME-Version: 1.0 X-Received: by 10.180.39.212 with SMTP id r20mr2286284wik.30.1374003635341; Tue, 16 Jul 2013 12:40:35 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.217.94.132 with HTTP; Tue, 16 Jul 2013 12:40:35 -0700 (PDT) In-Reply-To: <51E59FD9.4020103@FreeBSD.org> References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> Date: Tue, 16 Jul 2013 12:40:35 -0700 X-Google-Sender-Auth: GCnD91X04MXzJpmn1mauhv_YnNw Message-ID: Subject: Re: Deadlock in nullfs/zfs somewhere From: Adrian Chadd To: Andriy Gapon Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs@freebsd.org, freebsd-current X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 19:40:37 -0000 On 16 July 2013 12:32, Andriy Gapon wrote: > vmcore.0 was useless for some reason - an interesting address was not accessible. Eek. > vmcore.1 seems to be very similar and is actually useful. Oh good. > This problem looks like an interesting deadlock involving ZFS and VFS and vnode > shortage. > The most obvious things are that many threads could not allocate a new vnode and > are waiting in getnewvnode_reserve and also many threads are stuck waiting on > vnode locks held by the former threads. > In effect, they all wait for vnlru, which in turn is stuck in > zfs_freebsd_reclaim on z_teardown_lock. > That lock is held by a thread doing a rollback ioctl. > And that thread waits for zfs sync thread to actually perform the rollback. > The sync thread waits on zfs quiesce thread to declare the current transaction > group as quiesced. > The quiesce thread, obviously, waits for all operations running in the current > transaction group to complete. > Some of those operations are e.g. VOP_CREATE -> zfs_create. They already > started a zfs transaction (as a part of the current transaction group) and they > execute zfs_mknode which needs a new vnode. So these threads are waiting for a > new vnode and do not let the current transaction group become quiesced. > GOTO beginning. > > Compressing the above description to the extreme, it boils down to: ZFS needs a > new vnode from vnlru and is waiting on it, while vnlru has to wait on ZFS. :( So it's a deadlock. Ok, so what's next? -adrian From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 20:26:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 44BB7D6B for ; Tue, 16 Jul 2013 20:26:10 +0000 (UTC) (envelope-from javocado@gmail.com) Received: from mail-lb0-x22d.google.com (mail-lb0-x22d.google.com [IPv6:2a00:1450:4010:c04::22d]) by mx1.freebsd.org (Postfix) with ESMTP id C54C1F5C for ; Tue, 16 Jul 2013 20:26:09 +0000 (UTC) Received: by mail-lb0-f173.google.com with SMTP id v1so962541lbd.4 for ; Tue, 16 Jul 2013 13:26:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=STdwzaAPNi9jZB3iLbfPZJK29KnZSWsKkVMqYtxQplw=; b=cpAE+ftGI47DfaibXYo6lcFOsygTzmi+qPhuK67ceyuPiG42vIpIWMWfe5oGLWIugR PAQmoc0eefv6cTPBLEsNNj1uyfWPs+eMMOdyIL7nun4x1wpE+HYkFJ56ceZ8N64pfm2F 8WuSQN5vXST52xx4hwr+wlFfif+m8zy0lATdISnVw48NLxbVDo7lIi1FczbAtZf+AF9g ao1fUsuU32aQeByo7VA5GqnBKe5p3qQPk57eE/RhC4mYIqEfCrUH2nGbTZCOtuHOLacH NlpK1XjzCMK0Z7P0kyP7J4aFrZGvHR5RvEYrTjAXkBpj0rti8spQ29so3709QK1Ry47J 3fWQ== MIME-Version: 1.0 X-Received: by 10.152.27.9 with SMTP id p9mr1550507lag.4.1374006368684; Tue, 16 Jul 2013 13:26:08 -0700 (PDT) Received: by 10.114.98.42 with HTTP; Tue, 16 Jul 2013 13:26:08 -0700 (PDT) Date: Tue, 16 Jul 2013 13:26:08 -0700 Message-ID: Subject: ZFS memory exhaustion? From: javocado To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 20:26:10 -0000 I have a couple questions: - what does it look like when zfs needs more physical memory / is running out of memory for its operations? - what diagnostic numbers (vmstat, etc.) should I watch for that? swapinfo shows zero (basically zero) swap usage, so it doesn't look like things get that bad. Thanks From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 20:50:35 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id AFC9B3BB for ; Tue, 16 Jul 2013 20:50:35 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qe0-x232.google.com (mail-qe0-x232.google.com [IPv6:2607:f8b0:400d:c02::232]) by mx1.freebsd.org (Postfix) with ESMTP id 7449BEF for ; Tue, 16 Jul 2013 20:50:35 +0000 (UTC) Received: by mail-qe0-f50.google.com with SMTP id f6so687080qej.37 for ; Tue, 16 Jul 2013 13:50:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=hIZPC6bzLhXC1kMfDLu3ymjyasMXmaZw7TeZgn7+9Yk=; b=j47h+yi5b7SZEMrWKfDOt1ng8DxYSByxfBMh+tKT7GkzbftVYcPMZJ2Ni8bRzlfDzg 5wYYrLgEPnmqYyjC+HnfIBSi2DH2lg31VPMHV8VAm24ASXiU4wdhgzW/gxAkll8LaGeX MKH5HYWdryrnJFk1pmRvV21zGKW72tOBLJThO/JsIER88o/jsV/icVU3yAy8wPYoRZ+O KyFH+sAPeQ6PbeOWIHv4Elpc91MB51Ky1GiE7sswLM7Y7+UvxOzdKQN+pNXTJwcjSAU0 OiC9JN4r4oAgpdwhxHDFLry/9H9hhFkWu9sbJeNELG+vRWP1QGOS/oPEPTvW/g6eTZzW lkWA== MIME-Version: 1.0 X-Received: by 10.224.79.14 with SMTP id n14mr5545663qak.114.1374007834974; Tue, 16 Jul 2013 13:50:34 -0700 (PDT) Received: by 10.49.49.135 with HTTP; Tue, 16 Jul 2013 13:50:34 -0700 (PDT) In-Reply-To: References: Date: Tue, 16 Jul 2013 13:50:34 -0700 Message-ID: Subject: Re: ZFS memory exhaustion? From: Freddie Cash To: javocado Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 20:50:35 -0000 On Tue, Jul 16, 2013 at 1:26 PM, javocado wrote: > I have a couple questions: > > - what does it look like when zfs needs more physical memory / is running > out of memory for its operations? > Disk I/O drops to 0, reading/writing any file from the pool appears to "hang" the console, programs already loaded into RAM continue to work so long as they don't touch the pool, etc. > > - what diagnostic numbers (vmstat, etc.) should I watch for that? > > Top output will show Wired at/near 100% of RAM. > swapinfo shows zero (basically zero) swap usage, so it doesn't look like > things get that bad. > > ZFS uses non-swappable kernel memory, so you won't ever see swap used when ZFS runs out of RAM. Those are the symptoms we've noticed when our ZFS systems have run out of RAM. -- Freddie Cash fjwcash@gmail.com From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 21:55:05 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8B745278 for ; Tue, 16 Jul 2013 21:55:05 +0000 (UTC) (envelope-from javocado@gmail.com) Received: from mail-la0-x231.google.com (mail-la0-x231.google.com [IPv6:2a00:1450:4010:c03::231]) by mx1.freebsd.org (Postfix) with ESMTP id 15383385 for ; Tue, 16 Jul 2013 21:55:04 +0000 (UTC) Received: by mail-la0-f49.google.com with SMTP id ea20so941602lab.22 for ; Tue, 16 Jul 2013 14:55:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=RyM3e43SBQoypfvbI0MCPx2WFRYOHVndSDQUETGCKNI=; b=bSvnIHDiuzoz61XNg0tipW+Fded7gVwEt3V/oV37s+VW+vzhIRrPWKsscDJ5n46w8Y 3tYEB8Y1Rnjlmhq8ytUlqaBjf+OKK/uMKQdeHERakuuAL+y04gWdsWME3CwPXP4JaIAJ VU7dLt/TTEu8BFKNnyqOx6sCkVu8keMfWscf+d7gVb+iUEtCE+Ad9V7mN20PXvRsuwd2 syqbaChJaZPbHqaKd6fmigXMufGuzxgTfzGV+PVtgUrft8mk3tAGFn4FfCO/9upqK5g0 jCVZYksSUCLZPGHuoduEnp7bY42BkvoHUx85pUjGE4R6Ce0IzdoMHtZyaLvgqrF3Vkq9 UBlw== MIME-Version: 1.0 X-Received: by 10.152.25.169 with SMTP id d9mr1594497lag.63.1374011704047; Tue, 16 Jul 2013 14:55:04 -0700 (PDT) Received: by 10.114.98.42 with HTTP; Tue, 16 Jul 2013 14:55:04 -0700 (PDT) In-Reply-To: References: Date: Tue, 16 Jul 2013 14:55:04 -0700 Message-ID: Subject: Re: ZFS memory exhaustion? From: javocado To: Freddie Cash Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 21:55:05 -0000 Thank you. Is gstat the best way to watch zfs i/o ? On Tue, Jul 16, 2013 at 1:50 PM, Freddie Cash wrote: > On Tue, Jul 16, 2013 at 1:26 PM, javocado wrote: > >> I have a couple questions: >> >> - what does it look like when zfs needs more physical memory / is running >> out of memory for its operations? >> > > Disk I/O drops to 0, reading/writing any file from the pool appears to > "hang" the console, programs already loaded into RAM continue to work so > long as they don't touch the pool, etc. > > >> >> - what diagnostic numbers (vmstat, etc.) should I watch for that? >> >> Top output will show Wired at/near 100% of RAM. > > >> swapinfo shows zero (basically zero) swap usage, so it doesn't look like >> things get that bad. >> >> > ZFS uses non-swappable kernel memory, so you won't ever see swap used when > ZFS runs out of RAM. > > Those are the symptoms we've noticed when our ZFS systems have run out of > RAM. > > > -- > Freddie Cash > fjwcash@gmail.com > From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 22:24:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id DA46EB49 for ; Tue, 16 Jul 2013 22:24:03 +0000 (UTC) (envelope-from gezeala@gmail.com) Received: from mail-la0-x22a.google.com (mail-la0-x22a.google.com [IPv6:2a00:1450:4010:c03::22a]) by mx1.freebsd.org (Postfix) with ESMTP id 63CFA67E for ; Tue, 16 Jul 2013 22:24:03 +0000 (UTC) Received: by mail-la0-f42.google.com with SMTP id eb20so990523lab.15 for ; Tue, 16 Jul 2013 15:24:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=JtFKpZrQ2tE+KnNo5PbghaEcbeKN152+dypnW3f4xTI=; b=Fy7GV+aHdVY5Jkgm70EQ8FxxBD6DLyahA859LrqjHmDp6o7dlJuypm5Fk4C8EMhc2m NZGF9o5E9be6MNSTIqp69Njbgc2Ct1Qn5Yo3qgFt73fcOc7K3Hp8PEyP7EfNJgKJGgKH 4Kp/SmpCd7dZIK9BdHbEd8zfK3L2aeoTQLyj9bMv4KiGgUlbKKldbUcZ2XkkKd4m5EMg bLG0ldqSs0QDH37Y8yLiHRvKtHdDFwNUtd6whkU2aRapaSfA2Kv18qSPPK+QHxM6PBkF PlCgw2Ykjj56bfziS23i4OwjiP6wz8yilaxFZbffFlInyBR77sZYr56l9IzFXqlrvhVB 9QKg== X-Received: by 10.152.19.70 with SMTP id c6mr1708656lae.13.1374013442273; Tue, 16 Jul 2013 15:24:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.82.72 with HTTP; Tue, 16 Jul 2013 15:23:22 -0700 (PDT) In-Reply-To: References: From: =?ISO-8859-1?Q?Gezeala_M=2E_Bacu=F1o_II?= Date: Tue, 16 Jul 2013 15:23:22 -0700 Message-ID: Subject: Re: ZFS memory exhaustion? To: javocado Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 22:24:03 -0000 On Tue, Jul 16, 2013 at 2:55 PM, javocado wrote: > Thank you. Is gstat the best way to watch zfs i/o ? > YMMV and stats you are looking for.. I use these a lot: zpool iostat 1 zpool iostat -v 1 zpool iostat -v _pool_name_ 1 zpool iostat -v _poolA_ _poolB_ 1 iostat -xz -w1 -h not much: systat -iostat ==> catch: there are several options although it still truncates output and stats displayed is limited with your screen size gstat From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 22:47:27 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id BDC6CF65 for ; Tue, 16 Jul 2013 22:47:27 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from melon.pingpong.net (melon.pingpong.net [79.136.116.200]) by mx1.freebsd.org (Postfix) with ESMTP id 5D0AD76A for ; Tue, 16 Jul 2013 22:47:27 +0000 (UTC) Received: from girgbook.lan (c-0f54e155.1525-1-64736c12.cust.bredbandsbolaget.se [85.225.84.15]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by melon.pingpong.net (Postfix) with ESMTPSA id BE2992E651; Wed, 17 Jul 2013 00:47:23 +0200 (CEST) Message-ID: <51E5CD7A.2020109@FreeBSD.org> Date: Wed, 17 Jul 2013 00:47:22 +0200 From: Palle Girgensohn User-Agent: Postbox 3.0.8 (Macintosh/20130427) MIME-Version: 1.0 To: Kirk McKusick Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) References: <201307151932.r6FJWSxM087108@chez.mckusick.com> In-Reply-To: <201307151932.r6FJWSxM087108@chez.mckusick.com> X-Enigmail-Version: 1.2.3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, Jeff Roberson , Julian Akehurst X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 22:47:27 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Kirk McKusick skrev: >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of >> unreferenced inodes (pg_xlog files?) From: Dan Thomas >> To: Kirk McKusick Cc: >> Palle Girgensohn , freebsd-fs@freebsd.org, Jeff >> Roberson , Julian Akehurst >> X-ASK-Info: Message Queued (2013/07/15 >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04) >> >> On 11 June 2013 01:17, Kirk McKusick >> wrote: >>> OK, good to have it narrowed down. I will look to devise some >>> additional diagnostics that hopefully will help tease out the >>> bug. I'll hopefully get back to you soon. >> Hi, >> >> Is there any news on this issue? We're still running several >> servers that are exhibiting this problem (most recently, one that >> seems to be leaking around 10gb/hour), and it's getting to the >> point where we're looking at moving to a different OS until it's >> resolved. >> >> We have access to several production systems with this problem and >> (at least from time to time) will have systems with a significant >> leak on them that we can experiment with. Is there any way we can >> assist with tracking this down? Any diagnostics or testing that >> would be useful? >> >> Thanks, Dan > > Hi Dan (and Palle), > > Sorry for the long delay with no help / news. I have gotten > side-tracked on several projects and have had little time to try and > devise some tests that would help find the cause of the lost space. > It almost certainly is a one-line fix (a missing vput or vrele > probably in some error path), but finding where it goes is the hard > part :-) > > I have had little success in inserting code that tracks reference > counts (too many false positives). So, I am going to need some help > from you to narrow it down. My belief is that there is some set of > filesystem operations (system calls) that are leading to the > problem. Notably, a file is being created, data put into it, then the > file is deleted (either before or after being closed). Somehow a > reference to that file is persisting despite there being no valid > reference to it. Hence the filesystem thinks it is still live and is > not deleting it. When you do the forcible unmount, these files get > cleared and the space shows back up. > > What I need to devise is a small test program doing the set of system > calls that cause this to happen. The way that I would like to try and > get it is to have you `ktrace -i' your application and then run your > application just long enough to create at least one of these lost > files. The goal is to minimize the amount of ktrace data through > which we need to sift. > > In preparation for doing this test you need to have a kernel compiled > with `option DIAGNOSTIC' or if you prefer, just add `#define > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you > have at least one offending file when you try to unmount the affected > filesystem and find it busy. Before doing the `umount -f', enable > busy printing using `sysctl debug.busyprt=1'. Then capture the > console output which will show the details of all the vnodes that had > to be forcibly flushed. Hopefully we will then be able to correlate > them back to the files (NAMI in the ktrace output) with which they > were associated. We may need to augment the NAMI data with the inode > number of the associated file to make the association with the > busyprt output. Anyway, once we have that, we can look at all the > system calls done on those files and create a small test program that > exhibits the problem. Given a small test program, Jeff or I can track > down the offending system call path and nail this pernicious bug once > and for all. > > Kirk McKusick Hi, I have run ktrace -i on pg_ctl (which forks off all the postgresql processes) and I got two "busy" files that where "lost" after a few hours. dmesg reveals this: vflush: busy vnode 0xfffffe067cdde960: tag ufs, type VREG usecount 1, writecount 0, refcount 2 mountedhere 0 flags (VI(0x200)) VI_LOCKed v_object 0xfffffe0335922000 ref 0 pages 0 lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) ino 11047146, on dev da2s1d vflush: busy vnode 0xfffffe039f35bb40: tag ufs, type VREG usecount 1, writecount 0, refcount 3 mountedhere 0 flags (VI(0x200)) VI_LOCKed v_object 0xfffffe03352701d0 ref 0 pages 0 lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) ino 11045961, on dev da2s1d I had to umount -f, so they where "lost". So, now I have 55 GB ktrace output... ;) Is there anything I can do to filter it, or shall I compress it and put it on a web server for you to fetch as it is? Palle -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJR5c16AAoJEIhV+7FrxBJDK0AH/RLG1QLdyQhwNC6USlqO2+2B 6HXmYwbmDCMIlUQZAaG4h0x6QPzWjXWYMa1KDdpk/BtRhfL7z8tFPdWjTzqBPuK1 aEEQjv/Cp5IgI6FqVbc2agW3GfUwomtjEL3lUk2zmKdPImEWte6ZkLzOFgQpqQao QAxFnN0I8/g+ynQNQIavGOo0foze89wAuOaNvoy9z1wa7tFbjlH2lsVK1xGU6eNj AQn4RJw+tMPMGkNMy6Xjy7B/WMXfxutz1f4O9B1KBwLRZ/cgKxhmppoZdF3N4JsK GNiQvcRbYR9GhBiK+Er87UXKBcj2NS+QQsdSqIb5Ik1ahp78hjxq3raHuOLCTLw= =8+W4 -----END PGP SIGNATURE----- From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 00:29:36 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id B14CC2E0 for ; Wed, 17 Jul 2013 00:29:36 +0000 (UTC) (envelope-from gezeala@gmail.com) Received: from mail-la0-x234.google.com (mail-la0-x234.google.com [IPv6:2a00:1450:4010:c03::234]) by mx1.freebsd.org (Postfix) with ESMTP id 3518BA17 for ; Wed, 17 Jul 2013 00:29:36 +0000 (UTC) Received: by mail-la0-f52.google.com with SMTP id fo12so1044101lab.11 for ; Tue, 16 Jul 2013 17:29:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=WTqss1WK/2Z2odUxS2+sI6OJu5G5J7nMzHOlcf4qoJA=; b=b1ZvuxDNkvIKtIqpnoSoJktpQR+RLzlelVswps8YgdL27f/JquItSP98n9M2ZmJfJY tJMIrJe8Qu+Sp/cAvNqPskCdFocxkK/RDYjegYs9+wVJ2VO6GweSjuCrrJCXNSWm+r6c HSM259UkyhJFtvqjx+AtrDPHIIvAwUEUQ5HXNJU9xiXoij+cb5QM+M4gWex3AbZZQElq meh01OdT0URnqMdPUgaqXwotGc1Nk6+GrfpYE496BdZBWi2ifKU11KFR6iIZnirZZsKt r6Bczcgjq3hUNHBnCLUV+DhHQfF6Vh438uNif5O+49/YuM5DzX4TKYk2+oR+aCNXEYRI +m5A== X-Received: by 10.112.167.100 with SMTP id zn4mr2143351lbb.44.1374020975173; Tue, 16 Jul 2013 17:29:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.82.72 with HTTP; Tue, 16 Jul 2013 17:28:55 -0700 (PDT) In-Reply-To: References: <51D42107.1050107@digsys.bg> <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se> <51D437E2.4060101@digsys.bg> <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <3CFB4564D8EB4A6A9BCE2AFCC5B6E400@multiplay.co.uk> <51D6A206.2020303@digsys.bg> From: =?ISO-8859-1?Q?Gezeala_M=2E_Bacu=F1o_II?= Date: Tue, 16 Jul 2013 17:28:55 -0700 Message-ID: Subject: Re: Slow resilvering with mirrored ZIL To: Freddie Cash Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 00:29:36 -0000 On Fri, Jul 5, 2013 at 6:08 PM, Freddie Cash wrote: > > ZFS- on-Linux has added this as "-o ashift=" property for zpool create. > > There's a threat on the illumos list about standardising this s across all > ZFS- using OSes. > > > +1 on this. We tested zfs-on-linux last year and it does automatically handle disk partitioning for correct alignment. What we do is just add ashift=12 option during zpool create. No more gpart/gnop/ashift/import steps. http://zfsonlinux.org/faq.html#HowDoesZFSonLinuxHandlesAdvacedFormatDrives Back to FreeBSD ZFS, After reading the thread, I'm still at a loss on this (too much info I guess).. regarding gpart/gnop/ashift tweaks for alignment, do we still need to perform gpart on newly purchased (SSD/SATA/SAS) Advanced Format drives? Or, skip gpart and proceed with gnop/ashift only? From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 01:47:54 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id B25F7CB1 for ; Wed, 17 Jul 2013 01:47:54 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id 532EEBD5 for ; Wed, 17 Jul 2013 01:47:54 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.7/8.14.7) with ESMTP id r6H1lrRE084549; Tue, 16 Jul 2013 19:47:53 -0600 (MDT) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.7/8.14.7/Submit) with ESMTP id r6H1lrCT084546; Tue, 16 Jul 2013 19:47:53 -0600 (MDT) (envelope-from wblock@wonkity.com) Date: Tue, 16 Jul 2013 19:47:53 -0600 (MDT) From: Warren Block To: =?ISO-8859-15?Q?Gezeala_M=2E_Bacu=F1o_II?= Subject: Re: Slow resilvering with mirrored ZIL In-Reply-To: Message-ID: References: <51D42107.1050107@digsys.bg> <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se> <51D437E2.4060101@digsys.bg> <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <3CFB4564D8EB4A6A9BCE2AFCC5B6E400@multiplay.co.uk> <51D6A206.2020303@digsys.bg> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="3512871622-1111948915-1374025673=:84500" X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (wonkity.com [127.0.0.1]); Tue, 16 Jul 2013 19:47:53 -0600 (MDT) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 01:47:54 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --3512871622-1111948915-1374025673=:84500 Content-Type: TEXT/PLAIN; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 8BIT On Tue, 16 Jul 2013, Gezeala M. Bacuño II wrote: > On Fri, Jul 5, 2013 at 6:08 PM, Freddie Cash wrote: > >> >> ZFS- on-Linux has added this as "-o ashift=" property for zpool create. >> >> There's a threat on the illumos list about standardising this s across all >> ZFS- using OSes. >> >> >> > +1 on this. We tested zfs-on-linux last year and it does automatically > handle disk partitioning for correct alignment. What we do is just add > ashift=12 option during zpool create. No more gpart/gnop/ashift/import > steps. > > http://zfsonlinux.org/faq.html#HowDoesZFSonLinuxHandlesAdvacedFormatDrives > > > Back to FreeBSD ZFS, > > After reading the thread, I'm still at a loss on this (too much info I > guess).. regarding gpart/gnop/ashift tweaks for alignment, do we still need > to perform gpart on newly purchased (SSD/SATA/SAS) Advanced Format drives? > Or, skip gpart and proceed with gnop/ashift only? If ZFS goes on a bare drive, it will be aligned by default. If ZFS is going in a partition, yes, align that partition to 4K boundaries or larger multiples of 4K, like 1M. The gnop/ashift workaround is just to get ZFS to use the right block size. So if you don't take care to get partition alignment right, you might end up using the right block size but misaligned. And yes, it will be nice to be able to just explicitly tell ZFS the block size to use. --3512871622-1111948915-1374025673=:84500-- From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 05:34:42 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id F127B2C4; Wed, 17 Jul 2013 05:34:42 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 222EA6A4; Wed, 17 Jul 2013 05:34:41 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6H5YWki097217; Wed, 17 Jul 2013 08:34:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6H5YWki097217 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6H5YWcq097216; Wed, 17 Jul 2013 08:34:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 17 Jul 2013 08:34:31 +0300 From: Konstantin Belousov To: Palle Girgensohn Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) Message-ID: <20130717053431.GN5991@kib.kiev.ua> References: <201307151932.r6FJWSxM087108@chez.mckusick.com> <51E5CD7A.2020109@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="LHvWgpbS7VDUdu2f" Content-Disposition: inline In-Reply-To: <51E5CD7A.2020109@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: Kirk McKusick , freebsd-fs@freebsd.org, Jeff Roberson , Julian Akehurst X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 05:34:43 -0000 --LHvWgpbS7VDUdu2f Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 17, 2013 at 12:47:22AM +0200, Palle Girgensohn wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 >=20 > Kirk McKusick skrev: > >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of > >> unreferenced inodes (pg_xlog files?) From: Dan Thomas > >> To: Kirk McKusick Cc: > >> Palle Girgensohn , freebsd-fs@freebsd.org, Jeff > >> Roberson , Julian Akehurst > >> X-ASK-Info: Message Queued (2013/07/15 > >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04) > >>=20 > >> On 11 June 2013 01:17, Kirk McKusick > >> wrote: > >>> OK, good to have it narrowed down. I will look to devise some=20 > >>> additional diagnostics that hopefully will help tease out the=20 > >>> bug. I'll hopefully get back to you soon. > >> Hi, > >>=20 > >> Is there any news on this issue? We're still running several > >> servers that are exhibiting this problem (most recently, one that > >> seems to be leaking around 10gb/hour), and it's getting to the > >> point where we're looking at moving to a different OS until it's > >> resolved. > >>=20 > >> We have access to several production systems with this problem and > >> (at least from time to time) will have systems with a significant > >> leak on them that we can experiment with. Is there any way we can > >> assist with tracking this down? Any diagnostics or testing that > >> would be useful? > >>=20 > >> Thanks, Dan > >=20 > > Hi Dan (and Palle), > >=20 > > Sorry for the long delay with no help / news. I have gotten=20 > > side-tracked on several projects and have had little time to try and > > devise some tests that would help find the cause of the lost space. > > It almost certainly is a one-line fix (a missing vput or vrele > > probably in some error path), but finding where it goes is the hard > > part :-) > >=20 > > I have had little success in inserting code that tracks reference=20 > > counts (too many false positives). So, I am going to need some help=20 > > from you to narrow it down. My belief is that there is some set of=20 > > filesystem operations (system calls) that are leading to the > > problem. Notably, a file is being created, data put into it, then the > > file is deleted (either before or after being closed). Somehow a > > reference to that file is persisting despite there being no valid > > reference to it. Hence the filesystem thinks it is still live and is > > not deleting it. When you do the forcible unmount, these files get=20 > > cleared and the space shows back up. > >=20 > > What I need to devise is a small test program doing the set of system > > calls that cause this to happen. The way that I would like to try and > > get it is to have you `ktrace -i' your application and then run your > > application just long enough to create at least one of these lost > > files. The goal is to minimize the amount of ktrace data through > > which we need to sift. > >=20 > > In preparation for doing this test you need to have a kernel compiled > > with `option DIAGNOSTIC' or if you prefer, just add `#define > > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you > > have at least one offending file when you try to unmount the affected > > filesystem and find it busy. Before doing the `umount -f', enable > > busy printing using `sysctl debug.busyprt=3D1'. Then capture the > > console output which will show the details of all the vnodes that had > > to be forcibly flushed. Hopefully we will then be able to correlate > > them back to the files (NAMI in the ktrace output) with which they > > were associated. We may need to augment the NAMI data with the inode > > number of the associated file to make the association with the > > busyprt output. Anyway, once we have that, we can look at all the > > system calls done on those files and create a small test program that > > exhibits the problem. Given a small test program, Jeff or I can track > > down the offending system call path and nail this pernicious bug once > > and for all. > >=20 > > Kirk McKusick >=20 > Hi, >=20 > I have run ktrace -i on pg_ctl (which forks off all the postgresql > processes) and I got two "busy" files that where "lost" after a few > hours. dmesg reveals this: >=20 > vflush: busy vnode > 0xfffffe067cdde960: tag ufs, type VREG > usecount 1, writecount 0, refcount 2 mountedhere 0 > flags (VI(0x200)) > VI_LOCKed v_object 0xfffffe0335922000 ref 0 pages 0 > lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) > ino 11047146, on dev da2s1d > vflush: busy vnode > 0xfffffe039f35bb40: tag ufs, type VREG > usecount 1, writecount 0, refcount 3 mountedhere 0 > flags (VI(0x200)) > VI_LOCKed v_object 0xfffffe03352701d0 ref 0 pages 0 > lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) > ino 11045961, on dev da2s1d >=20 >=20 > I had to umount -f, so they where "lost". >=20 > So, now I have 55 GB ktrace output... ;) Is there anything I can do to > filter it, or shall I compress it and put it on a web server for you to > fetch as it is? I think that 55GB of ktrace is obviously useless. The Kirk' idea was to have an isolated test case that would only create the situation triggering the leak, without irrelevant activity. This indeed requires drilling down and isolating the file activities to get to the core of problem. FWIW, I and Peter Holm used the following alternative approach quite successfully when tracking down other vnode reference leaks. The approach still requires some understanding of the specifics of the problematic files to be useful, but not as much as isolated test. Basically, you take the patch below, and set the VV_DEBUGVREF flag for the vnode that has characteristics as much specific for the leaked vnode as possible. The patch has example of setting the flag for all new NFS=20 vnodes. You would probably want to do the same in vfs_vgetf(), checking e.g. for the partition where your leaks happen. The limiting of the vnodes for which the vref traces are accumulated is needed to save the kernel memory. Then after the leak was observed, you just print the vnode with ddb command 'show vnode addr' and send the output to developer. Index: sys/sys/vnode.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/sys/vnode.h (revision 248723) +++ sys/sys/vnode.h (working copy) @@ -94,6 +94,13 @@ struct vpollinfo { =20 #if defined(_KERNEL) || defined(_KVM_VNODE) =20 +struct debug_ref { + TAILQ_ENTRY(debug_ref) link; + int val; + const char *op; + struct stack stack; +}; + struct vnode { /* * Fields which define the identity of the vnode. These fields are @@ -169,6 +176,7 @@ struct vnode { int v_writecount; /* v ref count of writers */ u_int v_hash; enum vtype v_type; /* u vnode type */ + TAILQ_HEAD(, debug_ref) v_debug_ref; }; =20 #endif /* defined(_KERNEL) || defined(_KVM_VNODE) */ @@ -253,6 +261,7 @@ struct xvnode { #define VV_DELETED 0x0400 /* should be removed */ #define VV_MD 0x0800 /* vnode backs the md device */ #define VV_FORCEINSMQ 0x1000 /* force the insmntque to succeed */ +#define VV_DEBUGVREF 0x2000 =20 /* * Vnode attributes. A field value of VNOVAL represents a field whose val= ue Index: sys/kern/vfs_subr.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/kern/vfs_subr.c (revision 248723) +++ sys/kern/vfs_subr.c (working copy) @@ -71,6 +71,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include #include @@ -871,6 +872,23 @@ static struct kproc_desc vnlru_kp =3D { }; SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start, &vnlru_kp); + +MALLOC_DEFINE(M_RECORD_REF, "recordref", "recordref"); +static void +v_record_ref(struct vnode *vp, int val, const char *op) +{ + struct debug_ref *r; + + if ((vp->v_type !=3D VREG && vp->v_type !=3D VBAD) || + (vp->v_vflag & VV_DEBUGVREF) =3D=3D 0) + return; + r =3D malloc(sizeof(struct debug_ref), M_RECORD_REF, M_NOWAIT | + M_USE_RESERVE); + r->val =3D val; + r->op =3D op; + stack_save(&r->stack); + TAILQ_INSERT_TAIL(&vp->v_debug_ref, r, link); +} =20 /* * Routines having to do with the management of the vnode table. @@ -1073,6 +1091,7 @@ alloc: vp->v_vflag |=3D VV_NOKNOTE; } rangelock_init(&vp->v_rl); + TAILQ_INIT(&vp->v_debug_ref); =20 /* * For the filesystems which do not use vfs_hash_insert(), @@ -1082,6 +1101,7 @@ alloc: */ vp->v_hash =3D (uintptr_t)vp >> vnsz2log; =20 + TAILQ_INIT(&vp->v_debug_ref); *vpp =3D vp; return (0); } @@ -2197,6 +2217,7 @@ vget(struct vnode *vp, int flags, struct thread *t vinactive(vp, td); vp->v_iflag &=3D ~VI_OWEINACT; } + v_record_ref(vp, 1, "vget"); VI_UNLOCK(vp); return (0); } @@ -2211,6 +2232,7 @@ vref(struct vnode *vp) CTR2(KTR_VFS, "%s: vp %p", __func__, vp); VI_LOCK(vp); v_incr_usecount(vp); + v_record_ref(vp, 1, "vref"); VI_UNLOCK(vp); } =20 @@ -2253,6 +2275,7 @@ vputx(struct vnode *vp, int func) KASSERT(func =3D=3D VPUTX_VRELE, ("vputx: wrong func")); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); VI_LOCK(vp); + v_record_ref(vp, -1, "vputx"); =20 /* Skip this v_writecount check if we're going to panic below. */ VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < 1, vp, @@ -2409,6 +2432,7 @@ void vdropl(struct vnode *vp) { struct bufobj *bo; + struct debug_ref *r, *r1; struct mount *mp; int active; =20 @@ -2489,6 +2513,9 @@ vdropl(struct vnode *vp) lockdestroy(vp->v_vnlock); mtx_destroy(&vp->v_interlock); mtx_destroy(BO_MTX(bo)); + TAILQ_FOREACH_SAFE(r, &vp->v_debug_ref, link, r1) { + free(r, M_RECORD_REF); + } uma_zfree(vnode_zone, vp); } =20 @@ -2888,6 +2915,8 @@ vn_printf(struct vnode *vp, const char *fmt, ...) va_list ap; char buf[256], buf2[16]; u_long flags; + int ref; + struct debug_ref *r; =20 va_start(ap, fmt); vprintf(fmt, ap); @@ -2960,8 +2989,21 @@ vn_printf(struct vnode *vp, const char *fmt, ...) vp->v_object->resident_page_count); printf(" "); lockmgr_printinfo(vp->v_vnlock); - if (vp->v_data !=3D NULL) - VOP_PRINT(vp); +#if DDB + if (kdb_active) { + if (vp->v_data !=3D NULL) + VOP_PRINT(vp); + } +#endif + + /* Getnewvnode() initial reference is not recorded due to VNON */ + ref =3D 1; + TAILQ_FOREACH(r, &vp->v_debug_ref, link) { + ref +=3D r->val; + printf("REF %d %s\n", ref, r->op); + stack_print(&r->stack); + } + } =20 #ifdef DDB Index: sys/fs/nfsclient/nfs_clport.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/fs/nfsclient/nfs_clport.c (revision 248723) +++ sys/fs/nfsclient/nfs_clport.c (working copy) @@ -273,6 +273,7 @@ nfscl_nget(struct mount *mntp, struct vnode *dvp, /* vfs_hash_insert() vput()'s the losing vnode */ return (0); } + vp->v_vflag |=3D VV_DEBUGVREF; *npp =3D np; =20 return (0); Index: sys/fs/nfsclient/nfs_clnode.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/fs/nfsclient/nfs_clnode.c (revision 248723) +++ sys/fs/nfsclient/nfs_clnode.c (working copy) @@ -179,6 +179,7 @@ ncl_nget(struct mount *mntp, u_int8_t *fhp, int fh /* vfs_hash_insert() vput()'s the losing vnode */ return (0); } + vp->v_vflag |=3D VV_DEBUGVREF; *npp =3D np; =20 return (0); --LHvWgpbS7VDUdu2f Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR5iznAAoJEJDCuSvBvK1BNLMP/1CiRpEvV7G5anywsWJTDMRl ZyrXwXUUSjWnN7gAb3sHz8JpJf86AW57e7z8V6YBlZCZh7D7SeVwh5UaN7onuOzN D1RQPmR2AMcU6eybvJwt2fuMvgaTRoncyPN5YdyQK/jWUpdFCoNFxh5wD5RAyIne wFxQoeF9XFop+kMIBvcT92r5qM6jZIaNwzAChggkBAoh/r9b/DH9WXlUlX/tj+JN ZJ34yHhCvz1hnRD5hJVMvkYGauZSv2J+0TYS8FCvLXCafOKDtwty8OjnfMGsJJOb GQ18mEgjpsxJ6lZCvRTsjgrXbgkhHhrIxITzl914cRFPAFnjyLa5lXVsWwO+tdUZ fQ2PVtFVIX1x5c7sl+UPMEpRWZU9gNs59zBEybDR8vwxyBnbrMyLdOhreZSLaWE/ 6KiGKggAg65XNz6DBrhFlJaxZbXH2zlTBeAs1ZThQtclCL9u6jWrKjT3kM7rWmOy 8MD3SrLmV1nSJMRJ62emMGHlDmtHzFfLhT0/1JlLnHKuacXcJdL9dsLZW3t14Qm+ IvjJOAtId5nt4d+e/7RDstGi5ItjOsGmiM1szp2N5tbTrbB6WNmGbEiIJskxn5g7 Ba+GDD9goYCwsY/2AlL7T33QgCeblya9V9cf9j1TTOTjWubQWbzpNe+iMJJvG/le Lym+qBFczH7WrR0VW1kA =bhM7 -----END PGP SIGNATURE----- --LHvWgpbS7VDUdu2f-- From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 06:37:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id E1865C86 for ; Wed, 17 Jul 2013 06:37:33 +0000 (UTC) (envelope-from hostdl@gmail.com) Received: from mail-ve0-x230.google.com (mail-ve0-x230.google.com [IPv6:2607:f8b0:400c:c01::230]) by mx1.freebsd.org (Postfix) with ESMTP id A9F8E89E for ; Wed, 17 Jul 2013 06:37:33 +0000 (UTC) Received: by mail-ve0-f176.google.com with SMTP id c13so1187005vea.7 for ; Tue, 16 Jul 2013 23:37:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=Ad9RO0udsfuUu6SrcRSdyW2Ij5tx9LUFUQkVQdv6lLQ=; b=o16OpoLlaDUu42ZlspD+XnK+mdoSclLskI60MF5pfw8RQ3CUBfUmzATFcBs/zvLxtU GER93dPHsyyWasTytnkalKW33qO2v3xyJ9EyDV8+YiJfIGEY5XhZKCtasg4yqzXa0dBJ jnv6doMcJ3e6XEX2EImC3NGrQRgM5OEkkjUw8qNVYdPB7VX5SmSkkiXObN8V2ePdRJIE CDn4CySpaZmT3x/5d0A0+/+aMzKqYk6PTl3EUBp+9nHr2FIJTwOoAyKkp3Lq3F+k23ax Z8slmwjxUAy8vZDpnLlRGSB+qzHOr5z47F8p2lajkyyDsWrGuJaPTgTL/7dwKgEOooS/ H7UQ== MIME-Version: 1.0 X-Received: by 10.52.237.164 with SMTP id vd4mr1306105vdc.118.1374043053120; Tue, 16 Jul 2013 23:37:33 -0700 (PDT) Received: by 10.59.6.227 with HTTP; Tue, 16 Jul 2013 23:37:33 -0700 (PDT) Date: Wed, 17 Jul 2013 11:07:33 +0430 Message-ID: Subject: XFS write support From: Host DL To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 06:37:33 -0000 Hello, I just want to port my current CentOS boxes to FreeBSD but noticed that XFS write support hasn't been implemented yet after many years or it is still experimental Please let me know what is the current status of this feature and how to enable it Regards From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 08:28:23 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 506183BB for ; Wed, 17 Jul 2013 08:28:23 +0000 (UTC) (envelope-from maurizio.vairani@cloverinformatica.it) Received: from smtpdg2.aruba.it (smtpdg220.aruba.it [62.149.158.220]) by mx1.freebsd.org (Postfix) with ESMTP id 8B63BC96 for ; Wed, 17 Jul 2013 08:28:21 +0000 (UTC) Received: from cloverinformatica.it ([188.10.129.202]) by smtpcmd01.ad.aruba.it with bizsmtp id 1LT91m00h4N8xN401LT9JH; Wed, 17 Jul 2013 10:27:10 +0200 Received: from [192.168.0.81] (ASUS-TERMINATOR [192.168.0.81]) by cloverinformatica.it (Postfix) with ESMTP id B66FBF5E6; Wed, 17 Jul 2013 10:27:09 +0200 (CEST) Message-ID: <51E6555D.2080803@cloverinformatica.it> Date: Wed, 17 Jul 2013 10:27:09 +0200 From: Maurizio Vairani User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: freebsd-stable@FreeBSD.org, freebsd-fs@freebsd.org Subject: Shutdown problem with an USB memory stick as ZFS cache device Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 08:28:23 -0000 Hi all, on a Compaq Presario laptop I have just installed the latest stable #uname -a FreeBSD presario 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0: Tue Jul 16 16:32:39 CEST 2013 root@presario:/usr/obj/usr/src/sys/GENERIC amd64 For speed up the compilation I have added to the pool, tank0, a SanDisk memory stick as cache device with the command: # zpool add tank0 cache /dev/da0 But when I shutdown the laptop the process will halt with this screen shot: http://www.dump-it.fr/freebsd-screen-shot/2f9169f18c7c77e52e873580f9c2d4bf.jpg.html and I need to press the power button for more than 4 seconds to switch off the laptop. The problem is always reproducible. Regards Maurizio From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 09:40:18 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id E8D45449; Wed, 17 Jul 2013 09:40:18 +0000 (UTC) (envelope-from Ivailo.Tanusheff@skrill.com) Received: from db9outboundpool.messaging.microsoft.com (mail-db9lp0253.outbound.messaging.microsoft.com [213.199.154.253]) by mx1.freebsd.org (Postfix) with ESMTP id 4BD67F8C; Wed, 17 Jul 2013 09:40:17 +0000 (UTC) Received: from mail9-db9-R.bigfish.com (10.174.16.230) by DB9EHSOBE017.bigfish.com (10.174.14.80) with Microsoft SMTP Server id 14.1.225.22; Wed, 17 Jul 2013 09:25:05 +0000 Received: from mail9-db9 (localhost [127.0.0.1]) by mail9-db9-R.bigfish.com (Postfix) with ESMTP id 97B8DC80216; Wed, 17 Jul 2013 09:25:05 +0000 (UTC) X-Forefront-Antispam-Report: CIP:157.56.249.213; KIP:(null); UIP:(null); IPV:NLI; H:AM2PRD0710HT004.eurprd07.prod.outlook.com; RD:none; EFVD:NLI X-SpamScore: -1 X-BigFish: PS-1(zz9371I542I14ffIzz1f42h1ee6h1de0h1fdah2073h1202h1e76h1d1ah1d2ah1fc6hzz17326ah8275dhz2fh2a8h668h839h944hd24hf0ah1220h1288h12a5h12a9h12bdh137ah13b6h1441h1504h1537h153bh162dh1631h1758h18e1h1946h19b5h19ceh1ad9h1b0ah1d07h1d0ch1d2eh1d3fh1de9h1dfeh1dffh1e1dh9a9j1155h) Received-SPF: pass (mail9-db9: domain of skrill.com designates 157.56.249.213 as permitted sender) client-ip=157.56.249.213; envelope-from=Ivailo.Tanusheff@skrill.com; helo=AM2PRD0710HT004.eurprd07.prod.outlook.com ; .outlook.com ; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM; SFS:(377454003)(13464003)(199002)(189002)(53754006)(4396001)(69226001)(63696002)(79102001)(54356001)(33646001)(83072001)(74876001)(74366001)(59766001)(77982001)(65816001)(47446002)(46102001)(76786001)(16406001)(77096001)(74502001)(50986001)(76576001)(54316002)(81342001)(31966008)(49866001)(81542001)(47976001)(47736001)(74316001)(56816003)(66066001)(76796001)(56776001)(51856001)(74662001)(15202345003)(80022001)(53806001)(76482001)(74706001)(24736002); DIR:OUT; SFP:; SCL:1; SRVR:DB3PR07MB059; H:DB3PR07MB059.eurprd07.prod.outlook.com; CLIP:217.18.249.148; RD:InfoNoRecords; A:1; MX:1; LANG:en; Received: from mail9-db9 (localhost.localdomain [127.0.0.1]) by mail9-db9 (MessageSwitch) id 1374053102822740_658; Wed, 17 Jul 2013 09:25:02 +0000 (UTC) Received: from DB9EHSMHS009.bigfish.com (unknown [10.174.16.231]) by mail9-db9.bigfish.com (Postfix) with ESMTP id B9673920046; Wed, 17 Jul 2013 09:25:02 +0000 (UTC) Received: from AM2PRD0710HT004.eurprd07.prod.outlook.com (157.56.249.213) by DB9EHSMHS009.bigfish.com (10.174.14.19) with Microsoft SMTP Server (TLS) id 14.16.227.3; Wed, 17 Jul 2013 09:25:02 +0000 Received: from DB3PR07MB059.eurprd07.prod.outlook.com (10.242.137.149) by AM2PRD0710HT004.eurprd07.prod.outlook.com (10.255.165.39) with Microsoft SMTP Server (TLS) id 14.16.329.3; Wed, 17 Jul 2013 09:25:02 +0000 Received: from DB3PR07MB059.eurprd07.prod.outlook.com (10.242.137.149) by DB3PR07MB059.eurprd07.prod.outlook.com (10.242.137.149) with Microsoft SMTP Server (TLS) id 15.0.731.16; Wed, 17 Jul 2013 09:25:00 +0000 Received: from DB3PR07MB059.eurprd07.prod.outlook.com ([169.254.2.117]) by DB3PR07MB059.eurprd07.prod.outlook.com ([169.254.2.117]) with mapi id 15.00.0731.000; Wed, 17 Jul 2013 09:25:00 +0000 From: Ivailo Tanusheff To: Maurizio Vairani , "freebsd-stable@FreeBSD.org" , "freebsd-fs@freebsd.org" Subject: RE: Shutdown problem with an USB memory stick as ZFS cache device Thread-Topic: Shutdown problem with an USB memory stick as ZFS cache device Thread-Index: AQHOgseq21n5ubkB006RuCOsHZ/zxplomN5Q Date: Wed, 17 Jul 2013 09:25:00 +0000 Message-ID: <0243b7c6538240c69770fdd0aaa4e8e0@DB3PR07MB059.eurprd07.prod.outlook.com> References: <51E6555D.2080803@cloverinformatica.it> In-Reply-To: <51E6555D.2080803@cloverinformatica.it> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [217.18.249.148] x-forefront-prvs: 0910AAF391 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: skrill.com X-FOPE-CONNECTOR: Id%0$Dn%*$RO%0$TLS%0$FQDN%$TlsDn% X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 09:40:19 -0000 I think this is expected as your screenshot shows the USB has being disconn= ected, so you actually lost the cache device on the shutdown. Maybe you should implement a shutdown script that removes the USB cache fro= m the pool before the shutdown command is issued :) Best regards, Ivailo Tanusheff -----Original Message----- From: owner-freebsd-stable@freebsd.org [mailto:owner-freebsd-stable@freebsd= .org] On Behalf Of Maurizio Vairani Sent: Wednesday, July 17, 2013 11:27 AM To: freebsd-stable@FreeBSD.org; freebsd-fs@freebsd.org Subject: Shutdown problem with an USB memory stick as ZFS cache device Hi all, on a Compaq Presario laptop I have just installed the latest stable #uname -a FreeBSD presario 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0: Tue Jul 16=20 16:32:39 CEST 2013 root@presario:/usr/obj/usr/src/sys/GENERIC amd64 For speed up the compilation I have added to the pool, tank0, a SanDisk me= mory stick as cache device with the command: # zpool add tank0 cache /dev/da0 But when I shutdown the laptop the process will halt with this screen shot: http://www.dump-it.fr/freebsd-screen-shot/2f9169f18c7c77e52e873580f9c2d4bf.= jpg.html and I need to press the power button for more than 4 seconds to switch=20 off the laptop. The problem is always reproducible. Regards Maurizio _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 09:50:32 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id BAD18703; Wed, 17 Jul 2013 09:50:32 +0000 (UTC) (envelope-from ronald-freebsd8@klop.yi.org) Received: from smarthost1.greenhost.nl (smarthost1.greenhost.nl [195.190.28.81]) by mx1.freebsd.org (Postfix) with ESMTP id 7F0D4A3; Wed, 17 Jul 2013 09:50:32 +0000 (UTC) Received: from smtp.greenhost.nl ([213.108.104.138]) by smarthost1.greenhost.nl with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.69) (envelope-from ) id 1UzONM-0002DW-CT; Wed, 17 Jul 2013 11:50:25 +0200 Received: from [81.21.138.17] (helo=ronaldradial.versatec.local) by smtp.greenhost.nl with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from ) id 1UzONM-0002DC-8P; Wed, 17 Jul 2013 11:50:24 +0200 Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org, "Maurizio Vairani" Subject: Re: Shutdown problem with an USB memory stick as ZFS cache device References: <51E6555D.2080803@cloverinformatica.it> Date: Wed, 17 Jul 2013 11:50:22 +0200 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: "Ronald Klop" Message-ID: In-Reply-To: <51E6555D.2080803@cloverinformatica.it> User-Agent: Opera Mail/12.16 (Win32) X-Virus-Scanned: by clamav at smarthost1.samage.net X-Spam-Level: - X-Spam-Score: -1.9 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled version=3.3.1 X-Scan-Signature: dfea3049d3b923820beb462d65569822 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 09:50:32 -0000 On Wed, 17 Jul 2013 10:27:09 +0200, Maurizio Vairani wrote: > Hi all, > > > on a Compaq Presario laptop I have just installed the latest stable > > > #uname -a > > FreeBSD presario 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0: Tue Jul 16 > 16:32:39 CEST 2013 root@presario:/usr/obj/usr/src/sys/GENERIC amd64 > > > For speed up the compilation I have added to the pool, tank0, a SanDisk > memory stick as cache device with the command: > > > # zpool add tank0 cache /dev/da0 > > > But when I shutdown the laptop the process will halt with this screen > shot: > > > http://www.dump-it.fr/freebsd-screen-shot/2f9169f18c7c77e52e873580f9c2d4bf.jpg.html > > > and I need to press the power button for more than 4 seconds to switch > off the laptop. > > The problem is always reproducible. Does sysctl hw.usb.no_shutdown_wait=1 help? Ronald. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 09:58:51 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8D3F4F20; Wed, 17 Jul 2013 09:58:51 +0000 (UTC) (envelope-from prvs=1910bf16bb=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 0243A286; Wed, 17 Jul 2013 09:58:49 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50005028799.msg; Wed, 17 Jul 2013 10:58:40 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Wed, 17 Jul 2013 10:58:40 +0100 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=1910bf16bb=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <6ACFDC285CB64C148915BC0FD8357B11@multiplay.co.uk> From: "Steven Hartland" To: , , "Maurizio Vairani" , "Ronald Klop" References: <51E6555D.2080803@cloverinformatica.it> Subject: Re: Shutdown problem with an USB memory stick as ZFS cache device Date: Wed, 17 Jul 2013 10:59:04 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 09:58:51 -0000 ----- Original Message ----- From: "Ronald Klop" > > Does sysctl hw.usb.no_shutdown_wait=1 help? That will just prevent the wait it won't stop the shutdown from happening. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 10:08:47 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 649C86FC for ; Wed, 17 Jul 2013 10:08:47 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id 20BDD6D1 for ; Wed, 17 Jul 2013 10:08:46 +0000 (UTC) Received: by people.fsn.hu (Postfix, from userid 1001) id 0E8FE1127287; Wed, 17 Jul 2013 12:03:16 +0200 (CEST) X-Bogosity: Ham, tests=bogofilter, spamicity=0.001777, version=1.2.3 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 7.5406] X-CRM114-CacheID: sfid-20130717_12031_E9EBBC86 X-CRM114-Status: Good ( pR: 7.5406 ) X-DSPAM-Result: Whitelisted X-DSPAM-Processed: Wed Jul 17 12:03:16 2013 X-DSPAM-Confidence: 0.7005 X-DSPAM-Probability: 0.0000 X-DSPAM-Signature: 51e66be4282261417415160 X-DSPAM-Factors: 27, From*Attila Nagy , 0.00010, From*Attila, 0.00535, Mounted, 0.00594, Mounted+on, 0.00594, mount, 0.00712, Subject*files, 0.00762, USE, 0.00762, the+files, 0.00762, shutdown, 0.00888, fsck, 0.00888, fsck, 0.00888, (at+least, 0.00888, files, 0.00923, files, 0.00923, Received*online.co.hu+[195.228.243.99]), 0.01000, Received*[195.228.243.99]), 0.01000, I'm+waiting, 0.99000, file+system, 0.01000, Received*online.co.hu, 0.01000, From*Attila+Nagy, 0.01000, Date*03+11, 0.99000, Sizes, 0.01000, find+on, 0.99000, Received*(japan.t, 0.01000, From*Nagy+; Wed, 17 Jul 2013 12:03:15 +0200 (CEST) Message-ID: <51E66BDF.4010709@fsn.hu> Date: Wed, 17 Jul 2013 12:03:11 +0200 From: Attila Nagy MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: SU+J all files lost after a reboot? Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 10:08:47 -0000 Hi, SU+J file systems formatted with r248885. The file systems were in active use for some months, they were 70% full. Today, I rebooted the OS (clean shutdown, there were no crashes) with r251643 just to see this: /dev/da0p2 923G 32M 923G 0% /fs All files lost after a reboot??? But a quick find on the file system showed the files are(?) there, I can even read them (at least the ones I've tried so far). Starting an fsck gives: # fsck /fs ** /dev/da0p2 USE JOURNAL? [yn] y ** SU+J Recovering /dev/da0p2 Journal timestamp does not match fs mount time ** Skipping journal, falling through to full fsck ** Last Mounted on ** Phase 1 - Check Blocks and Sizes Now I'm waiting to see what this will do to the data. I'm somewhat inclined to think that SU+J is not production ready yet... From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 10:28:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6C059B14 for ; Wed, 17 Jul 2013 10:28:33 +0000 (UTC) (envelope-from maurizio.vairani@cloverinformatica.it) Received: from smtpdg9.aruba.it (smtpdg8.aruba.it [62.149.158.238]) by mx1.freebsd.org (Postfix) with ESMTP id C72F37E1 for ; Wed, 17 Jul 2013 10:28:32 +0000 (UTC) Received: from cloverinformatica.it ([188.10.129.202]) by smtpcmd03.ad.aruba.it with bizsmtp id 1NUN1m01L4N8xN401NUPLB; Wed, 17 Jul 2013 12:28:24 +0200 Received: from [192.168.0.100] (MAURIZIO-PC [192.168.0.100]) by cloverinformatica.it (Postfix) with ESMTP id 8571BF651; Wed, 17 Jul 2013 12:28:23 +0200 (CEST) Message-ID: <51E671C7.50409@cloverinformatica.it> Date: Wed, 17 Jul 2013 12:28:23 +0200 From: Maurizio Vairani User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: Ronald Klop Subject: [SOLVED] Re: Shutdown problem with an USB memory stick as ZFS cache device References: <51E6555D.2080803@cloverinformatica.it> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 10:28:33 -0000 On 17/07/2013 11:50, Ronald Klop wrote: > On Wed, 17 Jul 2013 10:27:09 +0200, Maurizio Vairani > wrote: > >> Hi all, >> >> >> on a Compaq Presario laptop I have just installed the latest stable >> >> >> #uname -a >> >> FreeBSD presario 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0: Tue Jul 16 >> 16:32:39 CEST 2013 root@presario:/usr/obj/usr/src/sys/GENERIC amd64 >> >> >> For speed up the compilation I have added to the pool, tank0, a >> SanDisk memory stick as cache device with the command: >> >> >> # zpool add tank0 cache /dev/da0 >> >> >> But when I shutdown the laptop the process will halt with this screen >> shot: >> >> >> http://www.dump-it.fr/freebsd-screen-shot/2f9169f18c7c77e52e873580f9c2d4bf.jpg.html >> >> >> >> and I need to press the power button for more than 4 seconds to >> switch off the laptop. >> >> The problem is always reproducible. > > Does sysctl hw.usb.no_shutdown_wait=1 help? > > Ronald. Thank you Ronald it works ! In /boot/loader.conf added the line hw.usb.no_shutdown_wait=1 Maurizio From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 10:30:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3AC92CF9 for ; Wed, 17 Jul 2013 10:30:44 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id E8A68819 for ; Wed, 17 Jul 2013 10:30:43 +0000 (UTC) Received: by people.fsn.hu (Postfix, from userid 1001) id 4DED8112759F; Wed, 17 Jul 2013 12:30:42 +0200 (CEST) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000812, version=1.2.3 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 13.5615] X-CRM114-CacheID: sfid-20130717_12304_19D79649 X-CRM114-Status: Good ( pR: 13.5615 ) X-DSPAM-Result: Whitelisted X-DSPAM-Processed: Wed Jul 17 12:30:42 2013 X-DSPAM-Confidence: 0.9956 X-DSPAM-Probability: 0.0000 X-DSPAM-Signature: 51e67252416701431120196 X-DSPAM-Factors: 27, From*Attila Nagy , 0.00010, >+Hi, 0.00128, in+>, 0.00199, wrote+>, 0.00209, Hi+>, 0.00223, I+>, 0.00282, this+>, 0.00298, >+>, 0.00357, >+>, 0.00357, References*fsn.hu>, 0.00357, In-Reply-To*fsn.hu>, 0.00383, with+>, 0.00383, on+>, 0.00383, wrote, 0.00514, From*Attila, 0.00535, Capacity, 0.00535, Mounted, 0.00594, Mounted, 0.00594, Avail, 0.00594, Used+Avail, 0.00594, Mounted+on, 0.00594, Mounted+on, 0.00594, Avail+Capacity, 0.00594, Capacity+Mounted, 0.00594, Filesystem, 0.00594, >+But, 0.00594, X-Spambayes-Classification: ham; 0.00 Received: from japan.t-online.private (japan.t-online.co.hu [195.228.243.99]) by people.fsn.hu (Postfix) with ESMTPSA id 29A56112758E for ; Wed, 17 Jul 2013 12:30:41 +0200 (CEST) Message-ID: <51E67250.2020102@fsn.hu> Date: Wed, 17 Jul 2013 12:30:40 +0200 From: Attila Nagy MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: SU+J all files lost after a reboot? References: <51E66BDF.4010709@fsn.hu> In-Reply-To: <51E66BDF.4010709@fsn.hu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 10:30:44 -0000 On 07/17/13 12:03, Attila Nagy wrote: > Hi, > > SU+J file systems formatted with r248885. The file systems were in > active use for some months, they were 70% full. > Today, I rebooted the OS (clean shutdown, there were no crashes) with > r251643 just to see this: > /dev/da0p2 923G 32M 923G 0% /fs > > All files lost after a reboot??? > > But a quick find on the file system showed the files are(?) there, I > can even read them (at least the ones I've tried so far). > Starting an fsck gives: > # fsck /fs > ** /dev/da0p2 > > USE JOURNAL? [yn] y > > ** SU+J Recovering /dev/da0p2 > Journal timestamp does not match fs mount time > ** Skipping journal, falling through to full fsck > > ** Last Mounted on > ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups SUMMARY BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y 6768204 files, 84410191 used, 36633989 free (27013 frags, 4575872 blocks, 0.0% fragmentation) ***** FILE SYSTEM IS CLEAN ***** ***** FILE SYSTEM WAS MODIFIED ***** # mount /fs # df -h /fs Filesystem Size Used Avail Capacity Mounted on /dev/da0p2 923G 644G 279G 70% /fs Scary. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 10:38:41 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 862C4EF7; Wed, 17 Jul 2013 10:38:41 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 842F6857; Wed, 17 Jul 2013 10:38:40 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id NAA01409; Wed, 17 Jul 2013 13:38:38 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1UzP81-00002p-NS; Wed, 17 Jul 2013 13:38:37 +0300 Message-ID: <51E67409.3010901@FreeBSD.org> Date: Wed, 17 Jul 2013 13:38:01 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-stable@FreeBSD.org, freebsd-fs@FreeBSD.org Subject: Re: Shutdown problem with an USB memory stick as ZFS cache device References: <51E6555D.2080803@cloverinformatica.it> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 10:38:41 -0000 on 17/07/2013 12:50 Ronald Klop said the following: > Does sysctl hw.usb.no_shutdown_wait=1 help? I believe that the root cause of the issue is that ZFS does not perform full clean up on shutdown and thus does not release its devices. But perhaps I am mistaken. In any case, I think that doing the same kind of clean up as done on zfs module unload would be advantageous. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 11:04:31 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id EE5816A1; Wed, 17 Jul 2013 11:04:30 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 0040196E; Wed, 17 Jul 2013 11:04:29 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id OAA02116; Wed, 17 Jul 2013 14:04:21 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1UzPWu-00005Q-V1; Wed, 17 Jul 2013 14:04:21 +0300 Message-ID: <51E679FD.3040306@FreeBSD.org> Date: Wed, 17 Jul 2013 14:03:25 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: zfs-devel@FreeBSD.org, freebsd-fs@FreeBSD.org Subject: zfs_rename: another zfs+vfs deadlock X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=X-VIET-VPS Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 11:04:31 -0000 I received a report about what looked like a deadlock involving ZFS. Interesting bits from the report are: Thread 1156 (Thread 2038380): #0 sched_switch (td=0xfffffe01a9e56460, newtd=0xfffffe001c3ff000, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1860 #1 0xffffffff808ab51a in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:466 #2 0xffffffff808e3e63 in sleepq_switch (wchan=0xfffffe06d9df2310, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:538 #3 0xffffffff808e4a9d in sleepq_wait (wchan=0xfffffe06d9df2310, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:617 #4 0xffffffff8088aebb in __lockmgr_args (lk=0xfffffe06d9df2310, flags=524544, ilk=0xfffffe06d9df23d8, wmesg=Variable "wmesg" is not available. ) at /usr/src/sys/kern/kern_lock.c:214 #5 0xffffffff8092d349 in vop_stdlock (ap=Variable "ap" is not available. ) at lockmgr.h:97 #6 0xffffffff80bd62ab in VOP_LOCK1_APV (vop=0xffffffff8111cc80, a=0xffffff90729ee6e0) at vnode_if.c:1988 #7 0xffffffff8094cfa7 in _vn_lock (vp=0xfffffe06d9df2278, flags=524288, file=Variable "file" is not available. ) at vnode_if.h:859 #8 0xffffffff80942220 in vputx (vp=0xfffffe06d9df2278, func=1) at /usr/src/sys/kern/vfs_subr.c:2279 #9 0xffffffff816e75a4 in zfs_rename_unlock (zlpp=0xffffff90729ee878) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:3501 #10 0xffffffff816e8df4 in zfs_freebsd_rename (ap=Variable "ap" is not available. ) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:3927 #11 0xffffffff80bd67cb in VOP_RENAME_APV (vop=0xffffffff8175b900, a=0xffffff90729eeaa0) at vnode_if.c:1474 #12 0xffffffff80947844 in kern_renameat (td=Variable "td" is not available. ) at vnode_if.h:636 #13 0xffffffff80b4eff2 in amd64_syscall (td=0xfffffe01a9e56460, traced=0) at subr_syscall.c:135 #14 0xffffffff80b39b97 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:387 fr 11 p *a $1 = {a_gen = {a_desc = 0xffffffff811594c0}, a_fdvp = 0xfffffe05fb094278, a_fvp = 0xfffffe04b7b62278, a_fcnp = 0xffffff90729eea58, a_tdvp = 0xfffffe0514137768, a_tvp = 0x0, a_tcnp = 0xffffff90729ee9a8} Thread 1158 (Thread 4174978): #0 sched_switch (td=0xfffffe088cbef000, newtd=0xfffffe001c40e000, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1860 #1 0xffffffff808ab51a in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:466 #2 0xffffffff808e3e63 in sleepq_switch (wchan=0xfffffe0514137800, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:538 #3 0xffffffff808e4a9d in sleepq_wait (wchan=0xfffffe0514137800, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:617 #4 0xffffffff8088b4e0 in __lockmgr_args (lk=0xfffffe0514137800, flags=2097152, ilk=0xfffffe05141378c8, wmesg=Variable "wmesg" is not available. ) at /usr/src/sys/kern/kern_lock.c:214 #5 0xffffffff8092d349 in vop_stdlock (ap=Variable "ap" is not available. ) at lockmgr.h:97 #6 0xffffffff80bd62ab in VOP_LOCK1_APV (vop=0xffffffff8111cc80, a=0xffffff9072813470) at vnode_if.c:1988 #7 0xffffffff8094cfa7 in _vn_lock (vp=0xfffffe0514137768, flags=2097152, file=Variable "file" is not available. ) at vnode_if.h:859 #8 0xffffffff816e5bdd in zfs_vnode_lock (vp=0xfffffe0514137768, flags=Variable "flags" is not available. ) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1704 #9 0xffffffff816e6d70 in zfs_lookup (dvp=0xfffffe06d9df2278, nm=0xffffff90728135b0 "toBeDeleted", vpp=0xffffff9072813930, cnp=0xffffff9072813958, nameiop=0, cr=0xfffffe0ba89b0a00, td=0xfffffe088cbef000, flags=0) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1433 #10 0xffffffff816e7511 in zfs_freebsd_lookup (ap=0xffffff9072813710) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5758 #11 0xffffffff80bd5d3f in VOP_CACHEDLOOKUP_APV (vop=0xffffffff8175b900, a=0xffffff9072813710) at vnode_if.c:187 #12 0xffffffff8092b103 in vfs_cache_lookup (ap=Variable "ap" is not available. ) at vnode_if.h:80 #13 0xffffffff80bd7187 in VOP_LOOKUP_APV (vop=0xffffffff8175b900, a=0xffffff90728137d0) at vnode_if.c:123 #14 0xffffffff8093260a in lookup (ndp=0xffffff90728138f0) at vnode_if.h:54 #15 0xffffffff8093354e in namei (ndp=0xffffff90728138f0) at /usr/src/sys/kern/vfs_lookup.c:297 #16 0xffffffff80944213 in kern_statat_vnhook (td=0xfffffe088cbef000, flag=Variable "flag" is not available. ) at /usr/src/sys/kern/vfs_syscalls.c:2432 #17 0xffffffff809443b5 in kern_statat (td=Variable "td" is not available. ) at /usr/src/sys/kern/vfs_syscalls.c:2413 #18 0xffffffff8094455a in sys_stat (td=Variable "td" is not available. ) at /usr/src/sys/kern/vfs_syscalls.c:2374 #19 0xffffffff80b4eff2 in amd64_syscall (td=0xfffffe088cbef000, traced=0) at subr_syscall.c:135 #20 0xffffffff80b39b97 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:387 As far as I understand the code, I think that zfs_rename_lock (called from zfs_rename) iterates up ancestor chain of target directory (tdzp) and obtains a reference on each of the ancestors via zfs_zget. zfs_rename_unlock does the opposite - it iterates in the reverse order and VN_RELE-s the ancestor znodes. As you can see above, on FreeBSD VN_RELE translates to vputx, which internally needs to obtain a vnode lock. The problem seems to be is that VOP_RENAME -> zfs_freebsd_rename is called with locked tdvp (and perhaps non-NULL and thus locked tvp). tdvp's vnode lock is released at the very end of zfs_freebsd_rename and so it is held over zfs_rename_unlock. And that means that vnode locks of tvp's ancestors can be acquired while tdvp's vnode lock is held. That violates the VFS lock ordering where a descendant's lock must always be acquired after an ancestor's lock. So that could lead to a deadlock with another VFS operation that acquires locks in the proper order. In the above snippet 0xfffffe06d9df2278 is a directory/ancestor of tdvp and 0xfffffe0514137768 is tdvp. VOP_LOOKUP -> zfs_lookup acquires the locks in the correct order (dvp is the ancestor while vp is the tdvp) while zfs_rename does it in the opposite order. A scenario to reproduce this bug could be like this. mkdir a mkdir a/b mv some-file a/b/ (in parallel with) stat a/b Of course it would have to be repeated many times to hit the right timing window. Also, namecache could interfere with this scenario, but I am not sure. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 11:27:16 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6B8CDA3B; Wed, 17 Jul 2013 11:27:16 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 1858CA3E; Wed, 17 Jul 2013 11:27:14 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id OAA02725; Wed, 17 Jul 2013 14:27:13 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1UzPt3-00007u-04; Wed, 17 Jul 2013 14:27:13 +0300 Message-ID: <51E67F54.9080800@FreeBSD.org> Date: Wed, 17 Jul 2013 14:26:12 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: Deadlock in nullfs/zfs somewhere References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, freebsd-current X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 11:27:16 -0000 on 16/07/2013 22:40 Adrian Chadd said the following: > :( So it's a deadlock. Ok, so what's next? A creative process... One possibility is to add getnewvnode_reserve() calls before the ZFS transaction beginnings in the places where a new vnode/znode may have to be allocated within a transaction. This looks like a quick and cheap solution but it makes the code somewhat messier. Another possibility is to change something in VFS machinery, so that VOP_RECLAIM getting blocked for one filesystem does not prevent vnode allocation for other filesystems. I could think of other possible solutions via infrastructural changes in VFS or ZFS... -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 15:30:02 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7DAC37B9; Wed, 17 Jul 2013 15:30:02 +0000 (UTC) (envelope-from jhs@berklix.com) Received: from land.berklix.org (land.berklix.org [144.76.10.75]) by mx1.freebsd.org (Postfix) with ESMTP id EC7FC8D1; Wed, 17 Jul 2013 15:30:01 +0000 (UTC) Received: from park.js.berklix.net (pD9FBEF06.dip0.t-ipconnect.de [217.251.239.6]) (authenticated bits=128) by land.berklix.org (8.14.5/8.14.5) with ESMTP id r6HFTwND089577; Wed, 17 Jul 2013 15:29:59 GMT (envelope-from jhs@berklix.com) Received: from fire.js.berklix.net (fire.js.berklix.net [192.168.91.41]) by park.js.berklix.net (8.14.3/8.14.3) with ESMTP id r6HFTpWF004069; Wed, 17 Jul 2013 17:29:51 +0200 (CEST) (envelope-from jhs@berklix.com) Received: from fire.js.berklix.net (localhost [127.0.0.1]) by fire.js.berklix.net (8.14.4/8.14.4) with ESMTP id r6HFT4EK063849; Wed, 17 Jul 2013 17:29:10 +0200 (CEST) (envelope-from jhs@fire.js.berklix.net) Message-Id: <201307171529.r6HFT4EK063849@fire.js.berklix.net> To: Maurizio Vairani Subject: Re: [SOLVED] Re: Shutdown problem with an USB memory stick as ZFS cache device From: "Julian H. Stacey" Organization: http://berklix.com BSD Unix Linux Consultancy, Munich Germany User-agent: EXMH on FreeBSD http://berklix.com/free/ X-URL: http://www.berklix.com In-reply-to: Your message "Wed, 17 Jul 2013 12:28:23 +0200." <51E671C7.50409@cloverinformatica.it> Date: Wed, 17 Jul 2013 17:29:04 +0200 Sender: jhs@berklix.com Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org, Ronald Klop X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 15:30:02 -0000 Maurizio Vairani wrote: > On 17/07/2013 11:50, Ronald Klop wrote: > > On Wed, 17 Jul 2013 10:27:09 +0200, Maurizio Vairani > > wrote: > > > >> Hi all, > >> > >> > >> on a Compaq Presario laptop I have just installed the latest stable > >> > >> > >> #uname -a > >> > >> FreeBSD presario 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0: Tue Jul 16 > >> 16:32:39 CEST 2013 root@presario:/usr/obj/usr/src/sys/GENERIC amd64 > >> > >> > >> For speed up the compilation I have added to the pool, tank0, a > >> SanDisk memory stick as cache device with the command: > >> > >> > >> # zpool add tank0 cache /dev/da0 > >> > >> > >> But when I shutdown the laptop the process will halt with this screen > >> shot: > >> > >> > >> http://www.dump-it.fr/freebsd-screen-shot/2f9169f18c7c77e52e873580f9c2d4bf.jpg.html > >> > >> > >> > >> and I need to press the power button for more than 4 seconds to > >> switch off the laptop. > >> > >> The problem is always reproducible. > > > > Does sysctl hw.usb.no_shutdown_wait=1 help? > > > > Ronald. > Thank you Ronald it works ! > > In /boot/loader.conf added the line > hw.usb.no_shutdown_wait=1 > > Maurizio I wonder (from ignorance as I dont use ZFS yet), if that merely masks the symptom or cures the fault ? Presumably one should use a ZFS command to disassociate whatever might have the cache open ? (in case something might need to be written out from cache, if it was a writeable cache ?) I too had a USB shutdown problem (non ZFS, now solved) & several people made useful comments on shutdown scripts etc, so I'm cross referencing: http://lists.freebsd.org/pipermail/freebsd-mobile/2013-July/012803.html Cheers, Julian -- Julian Stacey, BSD Unix Linux C Sys Eng Consultant, Munich http://berklix.com Reply below not above, like a play script. Indent old text with "> ". Send plain text. No quoted-printable, HTML, base64, multipart/alternative. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 17:01:31 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 97C63308 for ; Wed, 17 Jul 2013 17:01:31 +0000 (UTC) (envelope-from gezeala@gmail.com) Received: from mail-la0-x22d.google.com (mail-la0-x22d.google.com [IPv6:2a00:1450:4010:c03::22d]) by mx1.freebsd.org (Postfix) with ESMTP id 193E0DC3 for ; Wed, 17 Jul 2013 17:01:30 +0000 (UTC) Received: by mail-la0-f45.google.com with SMTP id fr10so1706083lab.18 for ; Wed, 17 Jul 2013 10:01:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=OCDhufSUWjnloz7XeBjVA+ybkZOpW2Ct111XH49MBew=; b=dP1xVA+ZqlZn+naAq4Iv7ONDKwjydCRy2FmifMLTssVUIWDQZXEYD3lpowObeeNZNx Z/fgn4flPZzHvu1eeavqlbDY0iZQckaa/Pvl5DIsR14g+VzMD9JTqD4UzYZH5OasmaU8 PGeW3Akeyaii/FEpHFMZvZc72TMDQrZS2OucqaZqO/HKiOSN53Omzhhzy1HVS0mPW07H CTzgJNx079B9pMsWhW5N3z+aTig6QYaK8SPFMrYj0qcymXagTrGYm63UXMGeXs2ED1MW 9KxR4qh9DAvZVpzRkNZbL9MKxjLH/lFet3FGP36joGfhJFL8uRhvZ4CGgeUPRkGV5rMb jf1g== X-Received: by 10.112.97.132 with SMTP id ea4mr3514560lbb.80.1374080490052; Wed, 17 Jul 2013 10:01:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.82.72 with HTTP; Wed, 17 Jul 2013 10:00:49 -0700 (PDT) In-Reply-To: References: <51D42107.1050107@digsys.bg> <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se> <51D437E2.4060101@digsys.bg> <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <3CFB4564D8EB4A6A9BCE2AFCC5B6E400@multiplay.co.uk> <51D6A206.2020303@digsys.bg> From: =?ISO-8859-1?Q?Gezeala_M=2E_Bacu=F1o_II?= Date: Wed, 17 Jul 2013 10:00:49 -0700 Message-ID: Subject: Re: Slow resilvering with mirrored ZIL To: Warren Block Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 17:01:31 -0000 On Tue, Jul 16, 2013 at 6:47 PM, Warren Block wrote: > On Tue, 16 Jul 2013, Gezeala M. Bacu=F1o II wrote: > > On Fri, Jul 5, 2013 at 6:08 PM, Freddie Cash wrote: >> >> >>> ZFS- on-Linux has added this as "-o ashift=3D" property for zpool creat= e. >>> >>> There's a threat on the illumos list about standardising this s across >>> all >>> ZFS- using OSes. >>> >>> >>> >>> +1 on this. We tested zfs-on-linux last year and it does automatically >> handle disk partitioning for correct alignment. What we do is just add >> ashift=3D12 option during zpool create. No more gpart/gnop/ashift/import >> steps. >> >> http://zfsonlinux.org/faq.**html#**HowDoesZFSonLinuxHandlesAdvace** >> dFormatDrives >> >> >> Back to FreeBSD ZFS, >> >> After reading the thread, I'm still at a loss on this (too much info I >> guess).. regarding gpart/gnop/ashift tweaks for alignment, do we still >> need >> to perform gpart on newly purchased (SSD/SATA/SAS) Advanced Format drive= s? >> Or, skip gpart and proceed with gnop/ashift only? >> > > If ZFS goes on a bare drive, it will be aligned by default. If ZFS is > going in a partition, yes, align that partition to 4K boundaries or large= r > multiples of 4K, like 1M. > > Your statement is enlightening and concise, exactly what I need. Thanks. > The gnop/ashift workaround is just to get ZFS to use the right block size= . > So if you don't take care to get partition alignment right, you might en= d > up using the right block size but misaligned. > > And yes, it will be nice to be able to just explicitly tell ZFS the block > size to use. We do add the entire drive (no partitions) to ZFS, perform gnop/ashift and other necessary steps and then verify ashift=3D12 through zdb. The gpart/gnop/ashift steps, if I understand correctly (do correct me if I'm stating this incorrectly), is needed for further SSD performance tuning. Taking into consideration leaving a certain chunk for wear leveling and also if the SSD has a size that may be too big for L2ARC. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 17:19:04 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4B102611; Wed, 17 Jul 2013 17:19:04 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-we0-x22b.google.com (mail-we0-x22b.google.com [IPv6:2a00:1450:400c:c03::22b]) by mx1.freebsd.org (Postfix) with ESMTP id 8B044E95; Wed, 17 Jul 2013 17:19:03 +0000 (UTC) Received: by mail-we0-f171.google.com with SMTP id m46so2072009wev.30 for ; Wed, 17 Jul 2013 10:19:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=QcE8DTskG1MeoodPIsqtquSYJMiHeaUmzK/gBYHIom4=; b=wzgeE/Vid9VR66auA1B1uq7BT9BHFsG7BA9L4fliwv/IYB0Mu8zwKA+/O2ro23N3+D pwnChAlscZsi1r9U8eOK/OmHTTMS1/Vk6+hIcq/xpi9u1Shir4WHr3WZEELIynmTdCtS nFjUQxSG1UmrfVP00z/pO46xzj12xxkl728HBzgVuVNNBOh2jCjvlssk28f4zzgGvDCN 7vbWdyGRn3Q80e8RKlyqWfDjSHDKAl7Kztaj04DmzwD7U/3d5chRNu8v7cDPwu0SyRmM EIhKrI7xkocOYh1nCTmRbiokJwIaqkuausv/xKwFr7HSUcwLeZiQtF/O/mtzozyBD+Je gkKA== MIME-Version: 1.0 X-Received: by 10.194.63.229 with SMTP id j5mr5541008wjs.79.1374081542505; Wed, 17 Jul 2013 10:19:02 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.217.94.132 with HTTP; Wed, 17 Jul 2013 10:19:02 -0700 (PDT) In-Reply-To: <51E67F54.9080800@FreeBSD.org> References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> Date: Wed, 17 Jul 2013 10:19:02 -0700 X-Google-Sender-Auth: ptSTsrEDyA2Cg2Pxbh0saE1rqCw Message-ID: Subject: Re: Deadlock in nullfs/zfs somewhere From: Adrian Chadd To: Andriy Gapon Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs@freebsd.org, freebsd-current X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 17:19:04 -0000 On 17 July 2013 04:26, Andriy Gapon wrote: > on 16/07/2013 22:40 Adrian Chadd said the following: >> :( So it's a deadlock. Ok, so what's next? > > A creative process... Wonderful. :) > One possibility is to add getnewvnode_reserve() calls before the ZFS transaction > beginnings in the places where a new vnode/znode may have to be allocated within > a transaction. > This looks like a quick and cheap solution but it makes the code somewhat messier. > > Another possibility is to change something in VFS machinery, so that VOP_RECLAIM > getting blocked for one filesystem does not prevent vnode allocation for other > filesystems. > > I could think of other possible solutions via infrastructural changes in VFS or > ZFS... Well, what do others think? This seems like a showstopper for systems with lots and lots of ZFS filesystems doing lots and lots of activity. -adrian From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 17:35:27 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id F37BCB85 for ; Wed, 17 Jul 2013 17:35:26 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id C20D6F60 for ; Wed, 17 Jul 2013 17:35:26 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.7/8.14.7) with ESMTP id r6HHZJk2091667; Wed, 17 Jul 2013 11:35:19 -0600 (MDT) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.7/8.14.7/Submit) with ESMTP id r6HHZJjE091664; Wed, 17 Jul 2013 11:35:19 -0600 (MDT) (envelope-from wblock@wonkity.com) Date: Wed, 17 Jul 2013 11:35:19 -0600 (MDT) From: Warren Block To: =?ISO-8859-15?Q?Gezeala_M=2E_Bacu=F1o_II?= Subject: Re: Slow resilvering with mirrored ZIL In-Reply-To: Message-ID: References: <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se> <51D437E2.4060101@digsys.bg> <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <3CFB4564D8EB4A6A9BCE2AFCC5B6E400@multiplay.co.uk> <51D6A206.2020303@digsys.bg> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="3512871622-236036210-1374082519=:91446" X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (wonkity.com [127.0.0.1]); Wed, 17 Jul 2013 11:35:20 -0600 (MDT) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 17:35:27 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --3512871622-236036210-1374082519=:91446 Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8BIT On Wed, 17 Jul 2013, Gezeala M. Bacuño II wrote: > If ZFS goes on a bare drive, it will be aligned by default.  If ZFS is going in a partition, yes, align that partition to 4K boundaries or larger multiples of 4K, like 1M. > > Your statement is enlightening and concise, exactly what I need. Thanks. > > The gnop/ashift workaround is just to get ZFS to use the right block size.  So if you don't take care to get partition alignment right, you might end up using the right > block size but misaligned. > > And yes, it will be nice to be able to just explicitly tell ZFS the block size to use. > > > We do add the entire drive (no partitions) to ZFS, perform gnop/ashift and other necessary steps and then verify ashift=12 through zdb. > > The gpart/gnop/ashift steps, if I understand correctly (do correct me if I'm stating this incorrectly), is needed for further SSD performance tuning. Taking into consideration leaving a > certain chunk for wear leveling and also if the SSD has a size that may be too big for L2ARC. Well, there are several things going on. Partitions can be used for a couple of things. Limiting the size of space available to ZFS, leaving an unallocated part of the drive for wear leveling. Note that ZFS on FreeBSD now has TRIM, which should make leaving unused space on SSDs unnecessary. Aligning partitions preserves performance. If a partition is misaligned, writes can slow down to half speed. For example, a 4K filesystem block written to an aligned partition writes a single block. If the partition is misaligned, that 4K write is split over two disk blocks. Each block has to be read, partly modified, then written, taking roughly twice as long. Finally, ZFS's ashift controls the minimum size of block ZFS uses. ashift=12 (12 bits) sets that to 4K blocks (2^12=4096). Again, a performance thing, matching the filesystem block size to device block size. It would be interesting to see a benchmark of ZFS on a 4K drive with different ashift values. --3512871622-236036210-1374082519=:91446-- From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 18:22:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 177447B6 for ; Wed, 17 Jul 2013 18:22:33 +0000 (UTC) (envelope-from gezeala@gmail.com) Received: from mail-la0-x22c.google.com (mail-la0-x22c.google.com [IPv6:2a00:1450:4010:c03::22c]) by mx1.freebsd.org (Postfix) with ESMTP id 8CB5D224 for ; Wed, 17 Jul 2013 18:22:32 +0000 (UTC) Received: by mail-la0-f44.google.com with SMTP id er20so1795395lab.31 for ; Wed, 17 Jul 2013 11:22:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=yaEL1JjazyUBwi1Dsigqj7eHj8ZWc7148BhP83wQ7BM=; b=ZTA6qiWTLmMNVUI5c+7Fkja23IZE6aO4a7KjD2KZMsu/k3DrxpD1/fjROAhn6YPwJb VqukGDCxkvnDyXfPatndid7BcJmNNKKXjcJIoyEZKuoHXCJwBQDLTs58Q38wdm2IfLaN +jPf9lekFuwmUjXfilxYB3s+2jdytGg3LklV3bl38suNuDYOfyHowAD3XHJYDWEDH9rO 17LHK25wRVbjVQrrJ8sdieeOglbrmBc4H40J5zS91ORv/IzjXFpKYfJkr1eRG0l+t/Od xqEsQo2iGlaZQIiszI+dwjotHw1UrSZNeBAjEE0iXR7YzhS3QuXGM52hujc8UNquFBIp JqTQ== X-Received: by 10.112.51.16 with SMTP id g16mr3707852lbo.0.1374085351426; Wed, 17 Jul 2013 11:22:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.82.72 with HTTP; Wed, 17 Jul 2013 11:21:51 -0700 (PDT) In-Reply-To: References: <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se> <51D437E2.4060101@digsys.bg> <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <3CFB4564D8EB4A6A9BCE2AFCC5B6E400@multiplay.co.uk> <51D6A206.2020303@digsys.bg> From: =?ISO-8859-1?Q?Gezeala_M=2E_Bacu=F1o_II?= Date: Wed, 17 Jul 2013 11:21:51 -0700 Message-ID: Subject: Re: Slow resilvering with mirrored ZIL To: Warren Block Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 18:22:33 -0000 On Wed, Jul 17, 2013 at 10:35 AM, Warren Block wrote: > On Wed, 17 Jul 2013, Gezeala M. Bacu=F1o II wrote: > > If ZFS goes on a bare drive, it will be aligned by default. If ZFS is >> going in a partition, yes, align that partition to 4K boundaries or larg= er >> multiples of 4K, like 1M. >> >> Your statement is enlightening and concise, exactly what I need. Thanks. >> >> The gnop/ashift workaround is just to get ZFS to use the right >> block size. So if you don't take care to get partition alignment right, >> you might end up using the right >> block size but misaligned. >> >> And yes, it will be nice to be able to just explicitly tell ZFS th= e >> block size to use. >> >> >> We do add the entire drive (no partitions) to ZFS, perform gnop/ashift >> and other necessary steps and then verify ashift=3D12 through zdb. >> >> The gpart/gnop/ashift steps, if I understand correctly (do correct me if >> I'm stating this incorrectly), is needed for further SSD performance >> tuning. Taking into consideration leaving a >> certain chunk for wear leveling and also if the SSD has a size that may >> be too big for L2ARC. >> > > Well, there are several things going on. > > Partitions can be used for a couple of things. Limiting the size of spac= e > available to ZFS, leaving an unallocated part of the drive for wear > leveling. Note that ZFS on FreeBSD now has TRIM, which should make leavi= ng > unused space on SSDs unnecessary. > > Aligning partitions preserves performance. If a partition is misaligned, > writes can slow down to half speed. For example, a 4K filesystem block > written to an aligned partition writes a single block. If the partition i= s > misaligned, that 4K write is split over two disk blocks. Each block has = to > be read, partly modified, then written, taking roughly twice as long. > > Finally, ZFS's ashift controls the minimum size of block ZFS uses. > ashift=3D12 (12 bits) sets that to 4K blocks (2^12=3D4096). Again, a > performance thing, matching the filesystem block size to device block siz= e. > > It would be interesting to see a benchmark of ZFS on a 4K drive with > different ashift values. Right on again. I forgot to include on my reply, that it is for a specific use case similar to ours, wherein we dedicate the entire drive to the pool. I believe it is totally time to put all these howto/faq stuff on a central FreeBSD repository, I think there's another thread requesting for the same thing. Scenarios: a] maximizing pool and drive size, don't need to partition -- these are the steps. a.1] new drives a.2] used drives b] For those with limited drives, limited enclosures etc -- these are the steps you may want to check out c] zfs-on-root d] and so on.. This will help a lot on deciding which steps to follow and which are necessary or not, therefore, avoiding all these repeated questions (just like mine) on ZFS setup/performance/tuning. https://wiki.freebsd.org/ZFSTuningGuide (WIP) - outdated, and there's no section for initial zpool/drive(s) setup. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 17 19:46:05 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A8030E47; Wed, 17 Jul 2013 19:46:05 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 0E2C07BC; Wed, 17 Jul 2013 19:46:04 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6HJjvNV095405; Wed, 17 Jul 2013 22:45:57 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6HJjvNV095405 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6HJjvUH095403; Wed, 17 Jul 2013 22:45:57 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 17 Jul 2013 22:45:57 +0300 From: Konstantin Belousov To: Andriy Gapon Subject: Re: zfs_rename: another zfs+vfs deadlock Message-ID: <20130717194557.GU5991@kib.kiev.ua> References: <51E679FD.3040306@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="/Isdj7O9hWi8F9Bn" Content-Disposition: inline In-Reply-To: <51E679FD.3040306@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@FreeBSD.org, zfs-devel@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 19:46:05 -0000 --/Isdj7O9hWi8F9Bn Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 17, 2013 at 02:03:25PM +0300, Andriy Gapon wrote: > A scenario to reproduce this bug could be like this. > mkdir a > mkdir a/b > mv some-file a/b/ (in parallel with) stat a/b > Of course it would have to be repeated many times to hit the right timing > window. Also, namecache could interfere with this scenario, but I am not= sure. >=20 There is no questions or proposals on how to approach the fix, JFYI mail ? I recommend you to look at the ufs_checkpath() and its use in the ufs_rename(). --/Isdj7O9hWi8F9Bn Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR5vR0AAoJEJDCuSvBvK1BevYP/03MlbINCVbX1tI9KuT02IPF KK1YPWykvf11h/GmeONiZv3qZjvYWe9jwkga4f9Hrb6DjAhIZS+3MuIwLK12yANd xfNNFF7XMHcoxyvuF4wDeufgn04ttRgREV0vaDFnODL+fMhzuz7sfjXI4lM9x6+0 nZaAjsS8eR2rYgC2z0oPRyBK+/mMldayM5FWUXBynLpkjgwlk7XP7A6BX9Fw7Mtp vFVKtGSg613ugUYZWwgI5gzJbUjtGCO7l6gQyYQCDGBeetWmyPLRHfz2aS+KsPEI cpG5vi7ruXcA9KMUg8jW9M+9qyMcCKWsnkkTUcpUOXNhbpDMaRKthGM1MVSu8HA6 Q1KfdVuXWPYgg8GJvrBXo6UjgPQmzp/Gw2a4SE/DcHhZ4ouusU0lxX0TOErf+wHW 4i8vWCJO4zk7HIpX546wLqF7eOzDSGJ3VdCkWNheeO6ca7f8wAW8f2/8mD1iBdZo s3wcGSfAKcYXJMX5J7SwTtFtv8V36lU4+XxOo0KiW/tDTu07sPyo7Zgw6iRwnlr+ +KYJzqTI0RftjD0lKlJPYZJTSYIPYffzu9fweiyrO9BbzQf/k+amDK00k30oy1D9 zf0olSwJN+2FhfnzQJf9P+3Urq10JilpmH4xJwuy3M8yKtqQ/eLh4no2ojAORErl nr17M0hGUNV9MUHmaDwL =zpSb -----END PGP SIGNATURE----- --/Isdj7O9hWi8F9Bn-- From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 07:29:23 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id DF647735; Thu, 18 Jul 2013 07:29:23 +0000 (UTC) (envelope-from joe@karthauser.co.uk) Received: from babel.karthauser.co.uk (212-13-197-151.karthauser.co.uk [212.13.197.151]) by mx1.freebsd.org (Postfix) with ESMTP id 3C0EBBC7; Thu, 18 Jul 2013 07:29:22 +0000 (UTC) Received: from [192.168.10.240] (unknown [81.144.225.214]) (Authenticated sender: joemail@tao.org.uk) by babel.karthauser.co.uk (Postfix) with ESMTPSA id B4FFC290E; Thu, 18 Jul 2013 07:29:13 +0000 (UTC) Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Drive failures with ada on FreeBSD-9.1, driver bug or wiring issue? From: Dr Josef Karthauser Date: Thu, 18 Jul 2013 08:29:14 +0100 Message-Id: <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk> References: <20130716225013.1C63B23A@babel.karthauser.co.uk> To: "freebsd-fs@freebsd.org" X-Mailer: Apple Mail (2.1508) Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-stable@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 07:29:23 -0000 Hi there, I'm scratching my head. I've just migrated to a super micro chassis and = at the same time gone from FreeBSD 9.0 to 9.1-RELEASE. The machine in question is running a ZFS mirror configuration on two ada = devices (with a 8gb gmirror carved out for swap). Since doing so I've been having strange drop outs on the drives; the = just disappear from the bus like so: (ada2:ahcich2:0:0:0): removing device entry (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 = (ABRT ) (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 = (ABRT ) (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted At first I though it was a failing drive - one of the drives did this, = and I limped on a single drive for a week until I could get someone up = to the rack to plug a third drive in. We resilvered the zpool onto the = new device and ran with the failed drive still plugged in (but not = responding to a reset on the ada bus with camcontrol) for a week or so. Then, the new drive dropped out in exactly the same way, followed in = short order by the remaining original drive!!! After rebooting the machine, and observing all three drives probing and = available, I resilvered the gmirror and zpool again on the two devices = expected that I thought were reliable, but before the resilvering was = completed the new drive dropped out again. I'm scratching my head now. I can't imagine that it's a wiring problem, = as they are all on individual SATA buses and individually cabled. Smart isn't reporting an drive issues either=85. :/ So, I'm wondering, is it a driver issuer with 9.1-RELEASE, if I upgrade = to 9-RELENG would I expect that to resolve the problem? (Have there = been any reported ada bus issuer reported since last December?) The hardware in question is: ahci0: port = 0xf050-0xf057,0xf040-0xf043,0xf030-0xf037,0xf020-0xf023,0xf000-0xf01f = mem 0xdfb02000-0xdfb027ff irq 19 at device 31.2 on pci0 ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported ahcich0: at channel 0 on ahci0 ahcich1: at channel 1 on ahci0 ahcich2: at channel 2 on ahci0 ahcich3: at channel 3 on ahci0 ahcich4: at channel 4 on ahci0 ahcich5: at channel 5 on ahci0 ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: ATA-8 SATA 2.x device ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad4 ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 ada1: ATA-8 SATA 2.x device ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada1: Previously was known as ad6 ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 ada2: ATA-8 SATA 2.x device ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada2: Command Queueing enabled ada2: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada2: Previously was known as ad8 Any ideas would be greatly welcomed. Thanks, Joe From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 07:33:14 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 822B993A; Thu, 18 Jul 2013 07:33:14 +0000 (UTC) (envelope-from prvs=1911771df7=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 02AF4C0C; Thu, 18 Jul 2013 07:33:13 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50005041117.msg; Thu, 18 Jul 2013 08:33:10 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Thu, 18 Jul 2013 08:33:10 +0100 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=1911771df7=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <33EF2240EDC1446D8E45F8C51974136B@multiplay.co.uk> From: "Steven Hartland" To: "Dr Josef Karthauser" , References: <20130716225013.1C63B23A@babel.karthauser.co.uk> <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk> Subject: Re: Drive failures with ada on FreeBSD-9.1, driver bug or wiring issue? Date: Thu, 18 Jul 2013 08:33:37 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="Windows-1252"; reply-type=original Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 07:33:14 -0000 What chassis is this? ----- Original Message ----- From: "Dr Josef Karthauser" To: Cc: Sent: Thursday, July 18, 2013 8:29 AM Subject: Drive failures with ada on FreeBSD-9.1, driver bug or wiring issue? Hi there, I'm scratching my head. I've just migrated to a super micro chassis and at the same time gone from FreeBSD 9.0 to 9.1-RELEASE. The machine in question is running a ZFS mirror configuration on two ada devices (with a 8gb gmirror carved out for swap). Since doing so I've been having strange drop outs on the drives; the just disappear from the bus like so: (ada2:ahcich2:0:0:0): removing device entry (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT ) (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT ) (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted At first I though it was a failing drive - one of the drives did this, and I limped on a single drive for a week until I could get someone up to the rack to plug a third drive in. We resilvered the zpool onto the new device and ran with the failed drive still plugged in (but not responding to a reset on the ada bus with camcontrol) for a week or so. Then, the new drive dropped out in exactly the same way, followed in short order by the remaining original drive!!! After rebooting the machine, and observing all three drives probing and available, I resilvered the gmirror and zpool again on the two devices expected that I thought were reliable, but before the resilvering was completed the new drive dropped out again. I'm scratching my head now. I can't imagine that it's a wiring problem, as they are all on individual SATA buses and individually cabled. Smart isn't reporting an drive issues either…. :/ So, I'm wondering, is it a driver issuer with 9.1-RELEASE, if I upgrade to 9-RELENG would I expect that to resolve the problem? (Have there been any reported ada bus issuer reported since last December?) The hardware in question is: ahci0: port 0xf050-0xf057,0xf040-0xf043,0xf030-0xf037,0xf020-0xf023,0xf000-0xf01f mem 0xdfb02000-0xdfb027ff irq 19 at device 31.2 on pci0 ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported ahcich0: at channel 0 on ahci0 ahcich1: at channel 1 on ahci0 ahcich2: at channel 2 on ahci0 ahcich3: at channel 3 on ahci0 ahcich4: at channel 4 on ahci0 ahcich5: at channel 5 on ahci0 ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: ATA-8 SATA 2.x device ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad4 ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 ada1: ATA-8 SATA 2.x device ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada1: Previously was known as ad6 ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 ada2: ATA-8 SATA 2.x device ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada2: Command Queueing enabled ada2: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada2: Previously was known as ad8 Any ideas would be greatly welcomed. Thanks, Joe _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 07:53:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C388FF67 for ; Thu, 18 Jul 2013 07:53:33 +0000 (UTC) (envelope-from maurizio.vairani@cloverinformatica.it) Received: from smtpdg10.aruba.it (smtpdg4.aruba.it [62.149.158.234]) by mx1.freebsd.org (Postfix) with ESMTP id ACD9CCFE for ; Thu, 18 Jul 2013 07:53:31 +0000 (UTC) Received: from cloverinformatica.it ([188.10.129.202]) by smtpcmd04.ad.aruba.it with bizsmtp id 1jsB1m0114N8xN401jsCrs; Thu, 18 Jul 2013 09:52:15 +0200 Received: from [192.168.0.100] (MAURIZIO-PC [192.168.0.100]) by cloverinformatica.it (Postfix) with ESMTP id 3FDB9FCB3; Thu, 18 Jul 2013 09:52:12 +0200 (CEST) Message-ID: <51E79EAD.5040602@cloverinformatica.it> Date: Thu, 18 Jul 2013 09:52:13 +0200 From: Maurizio Vairani User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: "Julian H. Stacey" Subject: Re: [SOLVED] Re: Shutdown problem with an USB memory stick as ZFS cache device References: <201307171529.r6HFT4EK063849@fire.js.berklix.net> In-Reply-To: <201307171529.r6HFT4EK063849@fire.js.berklix.net> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org, Ronald Klop X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 07:53:33 -0000 On 17/07/2013 17:29, Julian H. Stacey wrote: > Maurizio Vairani wrote: >> On 17/07/2013 11:50, Ronald Klop wrote: >>> On Wed, 17 Jul 2013 10:27:09 +0200, Maurizio Vairani >>> wrote: >>> >>>> Hi all, >>>> >>>> >>>> on a Compaq Presario laptop I have just installed the latest stable >>>> >>>> >>>> #uname -a >>>> >>>> FreeBSD presario 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0: Tue Jul 16 >>>> 16:32:39 CEST 2013 root@presario:/usr/obj/usr/src/sys/GENERIC amd64 >>>> >>>> >>>> For speed up the compilation I have added to the pool, tank0, a >>>> SanDisk memory stick as cache device with the command: >>>> >>>> >>>> # zpool add tank0 cache /dev/da0 >>>> >>>> >>>> But when I shutdown the laptop the process will halt with this screen >>>> shot: >>>> >>>> >>>> http://www.dump-it.fr/freebsd-screen-shot/2f9169f18c7c77e52e873580f9c2d4bf.jpg.html >>>> >>>> >>>> >>>> and I need to press the power button for more than 4 seconds to >>>> switch off the laptop. >>>> >>>> The problem is always reproducible. >>> Does sysctl hw.usb.no_shutdown_wait=1 help? >>> >>> Ronald. >> Thank you Ronald it works ! >> >> In /boot/loader.conf added the line >> hw.usb.no_shutdown_wait=1 >> >> Maurizio > I wonder (from ignorance as I dont use ZFS yet), > if that merely masks the symptom or cures the fault ? > > Presumably one should use a ZFS command to disassociate whatever > might have the cache open ? (in case something might need to be > written out from cache, if it was a writeable cache ?) > > I too had a USB shutdown problem (non ZFS, now solved)& several people > made useful comments on shutdown scripts etc, so I'm cross referencing: > > http://lists.freebsd.org/pipermail/freebsd-mobile/2013-July/012803.html > > Cheers, > Julian Probably it masks the symptom. Andriy Gapon hypothesizes a bug in the ZFS clean up code: http://lists.freebsd.org/pipermail/freebsd-fs/2013-July/017857.html Surely one can use a startup script with the command: zpool add tank0 cache /dev/da0 and a shutdown script with: zpool remove tank0 /dev/da0 but this mask the symptom too. I prefer the Ronald solution because: - is simpler: it adds only one line (hw.usb.no_shutdown_wait=1) to one file (/boot/loader.conf). - is fastest: the zpool add/remove commands take time and “hw.usb.no_shutdown_wait=1” in /boot/loader.conf speeds up the shutdown process. - is cleaner: the zpool add/remove commands pair will fill up the tank0 pool history. Regards Maurizio From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 08:25:27 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 147B58BF; Thu, 18 Jul 2013 08:25:27 +0000 (UTC) (envelope-from rb@gid.co.uk) Received: from mx0.gid.co.uk (mx0.gid.co.uk [194.32.164.250]) by mx1.freebsd.org (Postfix) with ESMTP id C0550E7B; Thu, 18 Jul 2013 08:25:26 +0000 (UTC) Received: from [194.32.164.26] (80-46-130-69.static.dsl.as9105.com [80.46.130.69]) by mx0.gid.co.uk (8.14.2/8.14.2) with ESMTP id r6I8POHj066332; Thu, 18 Jul 2013 09:25:25 +0100 (BST) (envelope-from rb@gid.co.uk) Subject: Re: Drive failures with ada on FreeBSD-9.1, driver bug or wiring issue? Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=windows-1252 From: Bob Bishop In-Reply-To: <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk> Date: Thu, 18 Jul 2013 09:25:19 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <281DBD06-81D5-4DDD-9464-B96C80C22C3F@gid.co.uk> References: <20130716225013.1C63B23A@babel.karthauser.co.uk> <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk> To: Dr Josef Karthauser X-Mailer: Apple Mail (2.1283) Cc: "freebsd-fs@freebsd.org" , "freebsd-stable@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 08:25:27 -0000 Hi, On 18 Jul 2013, at 08:29, Dr Josef Karthauser wrote: > Hi there, >=20 > I'm scratching my head. I've just migrated to a super micro chassis = and at the same time gone from FreeBSD 9.0 to 9.1-RELEASE. >=20 > The machine in question is running a ZFS mirror configuration on two = ada devices (with a 8gb gmirror carved out for swap). >=20 > Since doing so I've been having strange drop outs on the drives; the = just disappear from the bus like so: >=20 > (ada2:ahcich2:0:0:0): removing device entry > (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 > (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error > (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 = (ABRT ) > (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff > (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted > (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 > (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error > (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 = (ABRT ) > (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff > (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted >=20 >=20 > At first I though it was a failing drive - one of the drives did this, = and I limped on a single drive for a week until I could get someone up = to the rack to plug a third drive in. We resilvered the zpool onto the = new device and ran with the failed drive still plugged in (but not = responding to a reset on the ada bus with camcontrol) for a week or so. >=20 > Then, the new drive dropped out in exactly the same way, followed in = short order by the remaining original drive!!! >=20 > After rebooting the machine, and observing all three drives probing = and available, I resilvered the gmirror and zpool again on the two = devices expected that I thought were reliable, but before the = resilvering was completed the new drive dropped out again. >=20 > I'm scratching my head now. I can't imagine that it's a wiring = problem, as they are all on individual SATA buses and individually = cabled. >=20 > Smart isn't reporting an drive issues either=85. :/ >=20 > So, I'm wondering, is it a driver issuer with 9.1-RELEASE, if I = upgrade to 9-RELENG would I expect that to resolve the problem? (Have = there been any reported ada bus issuer reported since last December?) >=20 > The hardware in question is: >=20 > ahci0: port = 0xf050-0xf057,0xf040-0xf043,0xf030-0xf037,0xf020-0xf023,0xf000-0xf01f = mem 0xdfb02000-0xdfb027ff irq 19 at device 31.2 on pci0 > ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported > ahcich0: at channel 0 on ahci0 > ahcich1: at channel 1 on ahci0 > ahcich2: at channel 2 on ahci0 > ahcich3: at channel 3 on ahci0 > ahcich4: at channel 4 on ahci0 > ahcich5: at channel 5 on ahci0 > ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 > ada0: ATA-8 SATA 2.x device > ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > ada0: Command Queueing enabled > ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) > ada0: Previously was known as ad4 > ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 > ada1: ATA-8 SATA 2.x device > ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > ada1: Command Queueing enabled > ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) > ada1: Previously was known as ad6 > ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 > ada2: ATA-8 SATA 2.x device > ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > ada2: Command Queueing enabled > ada2: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) > ada2: Previously was known as ad8 >=20 >=20 > Any ideas would be greatly welcomed. >=20 > Thanks, > Joe Me too (over a long period, with various hardware). There is a general problem with energy-saving drives that controllers = don't understand them. Typically the drive decides to go into some = power-saving mode, the controller wants to do some operation, the drive = takes too long to come ready, the controller decides the drive has gone = away. You have to persuade the controller to wait longer for the drive to come = ready, and/or persuade the drive to stay awake. This isn't necessarily = easy, eg the controller's ready wait may not be programmable. (Or avoid such drives like the plague, life's too short). -- Bob Bishop rb@gid.co.uk From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 09:34:58 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C5400504; Thu, 18 Jul 2013 09:34:58 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id B404C1AA; Thu, 18 Jul 2013 09:34:57 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id MAA19305; Thu, 18 Jul 2013 12:34:55 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1Uzkbv-0004kP-I2; Thu, 18 Jul 2013 12:34:55 +0300 Message-ID: <51E7B686.4090509@FreeBSD.org> Date: Thu, 18 Jul 2013 12:33:58 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Adrian Chadd , Konstantin Belousov Subject: Re: Deadlock in nullfs/zfs somewhere References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 09:34:58 -0000 on 17/07/2013 20:19 Adrian Chadd said the following: > On 17 July 2013 04:26, Andriy Gapon wrote: >> One possibility is to add getnewvnode_reserve() calls before the ZFS transaction >> beginnings in the places where a new vnode/znode may have to be allocated within >> a transaction. >> This looks like a quick and cheap solution but it makes the code somewhat messier. >> >> Another possibility is to change something in VFS machinery, so that VOP_RECLAIM >> getting blocked for one filesystem does not prevent vnode allocation for other >> filesystems. >> >> I could think of other possible solutions via infrastructural changes in VFS or >> ZFS... > > Well, what do others think? This seems like a showstopper for systems > with lots and lots of ZFS filesystems doing lots and lots of activity. > Looks like others are not speaking yet :-) My current idea is that ZFS should set MNTK_SUSPEND in zfs_suspend_fs() path before acquiring its z_teardown* locks. This should make intentions of ZFS visible to VFS. And thus it should prevent VOP_RECLAIM call on a suspended ZFS filesystem and that should prevent vnlru_free() getting stuck. Hopefully this should break the deadlock cycle. Kostik, what is your opinion? For your convenience here is a message with my analysis of this issue: http://thread.gmane.org/gmane.os.freebsd.current/150889/focus=18534 -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 10:43:57 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id E11CC69B; Thu, 18 Jul 2013 10:43:56 +0000 (UTC) (envelope-from godders@gmail.com) Received: from mail-qc0-x22a.google.com (mail-qc0-x22a.google.com [IPv6:2607:f8b0:400d:c01::22a]) by mx1.freebsd.org (Postfix) with ESMTP id 9755D664; Thu, 18 Jul 2013 10:43:56 +0000 (UTC) Received: by mail-qc0-f170.google.com with SMTP id s1so1641380qcw.1 for ; Thu, 18 Jul 2013 03:43:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ZBPYMT1IGIdD6eDH1wQWLEzJObvK/lBROJZcfLNReWQ=; b=FdXSHoGfCY02bvlYMk51OOoKoi55LajFHINJ1U9faZoKySI1JwTCEo8ACcZZRcJJzB 3zdO8xutOhlfQAC+dqldZFILsp6FLqntFv1qnABrRBgVimg9IkA4dG7RDDGXPQbkYdLm 4NSzuqRnB26hjRPPmotkcndCqL32DKHv3r1hNP1wX1fqgPyxoxQr/FrPJfZ5+fdC1E8N P73WUaViq3Pg0/luZVuj0OZRNzIaVNIyYPNezYokfxQED8Buc2R7rpxRthkXryA8nY46 +ZUGrS5j5shDlgNVXZIioWFTzRoHESilO3TrJR1pfyXhsCyyG+OFSZPgeVj/J9NJxjxd fTOg== MIME-Version: 1.0 X-Received: by 10.229.105.218 with SMTP id u26mr2821014qco.8.1374144235032; Thu, 18 Jul 2013 03:43:55 -0700 (PDT) Received: by 10.49.52.65 with HTTP; Thu, 18 Jul 2013 03:43:54 -0700 (PDT) In-Reply-To: <20130717053431.GN5991@kib.kiev.ua> References: <201307151932.r6FJWSxM087108@chez.mckusick.com> <51E5CD7A.2020109@FreeBSD.org> <20130717053431.GN5991@kib.kiev.ua> Date: Thu, 18 Jul 2013 11:43:54 +0100 Message-ID: Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) From: Dan Thomas To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 Cc: Kirk McKusick , freebsd-fs@freebsd.org, Palle Girgensohn , Jeff Roberson , Julian Akehurst X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 10:43:57 -0000 After a bit of experimentation, we've managed to nail down a reasonably short run that exhibits this leak. Postgres' verbose log output is linked below - whatever is causing the leak is in there somewhere, but alas I lack the necessary understanding of Postgres' internals to be able to pin it down any further. https://dl.dropboxusercontent.com/u/13916028/pg_leak_log.txt I've also got a 2.4M ktrace of this run, which is still pretty big, I'll admit. Unfortunately it's got some data in it that I'd rather not publish, but I'm happy to send it directly to anyone who might find it useful. Thanks, Dan On 17 July 2013 06:34, Konstantin Belousov wrote: > On Wed, Jul 17, 2013 at 12:47:22AM +0200, Palle Girgensohn wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Kirk McKusick skrev: >> >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of >> >> unreferenced inodes (pg_xlog files?) From: Dan Thomas >> >> To: Kirk McKusick Cc: >> >> Palle Girgensohn , freebsd-fs@freebsd.org, Jeff >> >> Roberson , Julian Akehurst >> >> X-ASK-Info: Message Queued (2013/07/15 >> >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04) >> >> >> >> On 11 June 2013 01:17, Kirk McKusick >> >> wrote: >> >>> OK, good to have it narrowed down. I will look to devise some >> >>> additional diagnostics that hopefully will help tease out the >> >>> bug. I'll hopefully get back to you soon. >> >> Hi, >> >> >> >> Is there any news on this issue? We're still running several >> >> servers that are exhibiting this problem (most recently, one that >> >> seems to be leaking around 10gb/hour), and it's getting to the >> >> point where we're looking at moving to a different OS until it's >> >> resolved. >> >> >> >> We have access to several production systems with this problem and >> >> (at least from time to time) will have systems with a significant >> >> leak on them that we can experiment with. Is there any way we can >> >> assist with tracking this down? Any diagnostics or testing that >> >> would be useful? >> >> >> >> Thanks, Dan >> > >> > Hi Dan (and Palle), >> > >> > Sorry for the long delay with no help / news. I have gotten >> > side-tracked on several projects and have had little time to try and >> > devise some tests that would help find the cause of the lost space. >> > It almost certainly is a one-line fix (a missing vput or vrele >> > probably in some error path), but finding where it goes is the hard >> > part :-) >> > >> > I have had little success in inserting code that tracks reference >> > counts (too many false positives). So, I am going to need some help >> > from you to narrow it down. My belief is that there is some set of >> > filesystem operations (system calls) that are leading to the >> > problem. Notably, a file is being created, data put into it, then the >> > file is deleted (either before or after being closed). Somehow a >> > reference to that file is persisting despite there being no valid >> > reference to it. Hence the filesystem thinks it is still live and is >> > not deleting it. When you do the forcible unmount, these files get >> > cleared and the space shows back up. >> > >> > What I need to devise is a small test program doing the set of system >> > calls that cause this to happen. The way that I would like to try and >> > get it is to have you `ktrace -i' your application and then run your >> > application just long enough to create at least one of these lost >> > files. The goal is to minimize the amount of ktrace data through >> > which we need to sift. >> > >> > In preparation for doing this test you need to have a kernel compiled >> > with `option DIAGNOSTIC' or if you prefer, just add `#define >> > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you >> > have at least one offending file when you try to unmount the affected >> > filesystem and find it busy. Before doing the `umount -f', enable >> > busy printing using `sysctl debug.busyprt=1'. Then capture the >> > console output which will show the details of all the vnodes that had >> > to be forcibly flushed. Hopefully we will then be able to correlate >> > them back to the files (NAMI in the ktrace output) with which they >> > were associated. We may need to augment the NAMI data with the inode >> > number of the associated file to make the association with the >> > busyprt output. Anyway, once we have that, we can look at all the >> > system calls done on those files and create a small test program that >> > exhibits the problem. Given a small test program, Jeff or I can track >> > down the offending system call path and nail this pernicious bug once >> > and for all. >> > >> > Kirk McKusick >> >> Hi, >> >> I have run ktrace -i on pg_ctl (which forks off all the postgresql >> processes) and I got two "busy" files that where "lost" after a few >> hours. dmesg reveals this: >> >> vflush: busy vnode >> 0xfffffe067cdde960: tag ufs, type VREG >> usecount 1, writecount 0, refcount 2 mountedhere 0 >> flags (VI(0x200)) >> VI_LOCKed v_object 0xfffffe0335922000 ref 0 pages 0 >> lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) >> ino 11047146, on dev da2s1d >> vflush: busy vnode >> 0xfffffe039f35bb40: tag ufs, type VREG >> usecount 1, writecount 0, refcount 3 mountedhere 0 >> flags (VI(0x200)) >> VI_LOCKed v_object 0xfffffe03352701d0 ref 0 pages 0 >> lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) >> ino 11045961, on dev da2s1d >> >> >> I had to umount -f, so they where "lost". >> >> So, now I have 55 GB ktrace output... ;) Is there anything I can do to >> filter it, or shall I compress it and put it on a web server for you to >> fetch as it is? > > I think that 55GB of ktrace is obviously useless. The Kirk' idea was to > have an isolated test case that would only create the situation triggering > the leak, without irrelevant activity. This indeed requires drilling down > and isolating the file activities to get to the core of problem. > > FWIW, I and Peter Holm used the following alternative approach quite > successfully when tracking down other vnode reference leaks. The approach > still requires some understanding of the specifics of the problematic > files to be useful, but not as much as isolated test. > > Basically, you take the patch below, and set the VV_DEBUGVREF flag for > the vnode that has characteristics as much specific for the leaked vnode > as possible. The patch has example of setting the flag for all new NFS > vnodes. You would probably want to do the same in vfs_vgetf(), > checking e.g. for the partition where your leaks happen. The limiting > of the vnodes for which the vref traces are accumulated is needed to > save the kernel memory. > > Then after the leak was observed, you just print the vnode with ddb > command 'show vnode addr' and send the output to developer. > > Index: sys/sys/vnode.h > =================================================================== > --- sys/sys/vnode.h (revision 248723) > +++ sys/sys/vnode.h (working copy) > @@ -94,6 +94,13 @@ struct vpollinfo { > > #if defined(_KERNEL) || defined(_KVM_VNODE) > > +struct debug_ref { > + TAILQ_ENTRY(debug_ref) link; > + int val; > + const char *op; > + struct stack stack; > +}; > + > struct vnode { > /* > * Fields which define the identity of the vnode. These fields are > @@ -169,6 +176,7 @@ struct vnode { > int v_writecount; /* v ref count of writers */ > u_int v_hash; > enum vtype v_type; /* u vnode type */ > + TAILQ_HEAD(, debug_ref) v_debug_ref; > }; > > #endif /* defined(_KERNEL) || defined(_KVM_VNODE) */ > @@ -253,6 +261,7 @@ struct xvnode { > #define VV_DELETED 0x0400 /* should be removed */ > #define VV_MD 0x0800 /* vnode backs the md device */ > #define VV_FORCEINSMQ 0x1000 /* force the insmntque to succeed */ > +#define VV_DEBUGVREF 0x2000 > > /* > * Vnode attributes. A field value of VNOVAL represents a field whose value > Index: sys/kern/vfs_subr.c > =================================================================== > --- sys/kern/vfs_subr.c (revision 248723) > +++ sys/kern/vfs_subr.c (working copy) > @@ -71,6 +71,7 @@ __FBSDID("$FreeBSD$"); > #include > #include > #include > +#include > #include > #include > #include > @@ -871,6 +872,23 @@ static struct kproc_desc vnlru_kp = { > }; > SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start, > &vnlru_kp); > + > +MALLOC_DEFINE(M_RECORD_REF, "recordref", "recordref"); > +static void > +v_record_ref(struct vnode *vp, int val, const char *op) > +{ > + struct debug_ref *r; > + > + if ((vp->v_type != VREG && vp->v_type != VBAD) || > + (vp->v_vflag & VV_DEBUGVREF) == 0) > + return; > + r = malloc(sizeof(struct debug_ref), M_RECORD_REF, M_NOWAIT | > + M_USE_RESERVE); > + r->val = val; > + r->op = op; > + stack_save(&r->stack); > + TAILQ_INSERT_TAIL(&vp->v_debug_ref, r, link); > +} > > /* > * Routines having to do with the management of the vnode table. > @@ -1073,6 +1091,7 @@ alloc: > vp->v_vflag |= VV_NOKNOTE; > } > rangelock_init(&vp->v_rl); > + TAILQ_INIT(&vp->v_debug_ref); > > /* > * For the filesystems which do not use vfs_hash_insert(), > @@ -1082,6 +1101,7 @@ alloc: > */ > vp->v_hash = (uintptr_t)vp >> vnsz2log; > > + TAILQ_INIT(&vp->v_debug_ref); > *vpp = vp; > return (0); > } > @@ -2197,6 +2217,7 @@ vget(struct vnode *vp, int flags, struct thread *t > vinactive(vp, td); > vp->v_iflag &= ~VI_OWEINACT; > } > + v_record_ref(vp, 1, "vget"); > VI_UNLOCK(vp); > return (0); > } > @@ -2211,6 +2232,7 @@ vref(struct vnode *vp) > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > VI_LOCK(vp); > v_incr_usecount(vp); > + v_record_ref(vp, 1, "vref"); > VI_UNLOCK(vp); > } > > @@ -2253,6 +2275,7 @@ vputx(struct vnode *vp, int func) > KASSERT(func == VPUTX_VRELE, ("vputx: wrong func")); > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > VI_LOCK(vp); > + v_record_ref(vp, -1, "vputx"); > > /* Skip this v_writecount check if we're going to panic below. */ > VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < 1, vp, > @@ -2409,6 +2432,7 @@ void > vdropl(struct vnode *vp) > { > struct bufobj *bo; > + struct debug_ref *r, *r1; > struct mount *mp; > int active; > > @@ -2489,6 +2513,9 @@ vdropl(struct vnode *vp) > lockdestroy(vp->v_vnlock); > mtx_destroy(&vp->v_interlock); > mtx_destroy(BO_MTX(bo)); > + TAILQ_FOREACH_SAFE(r, &vp->v_debug_ref, link, r1) { > + free(r, M_RECORD_REF); > + } > uma_zfree(vnode_zone, vp); > } > > @@ -2888,6 +2915,8 @@ vn_printf(struct vnode *vp, const char *fmt, ...) > va_list ap; > char buf[256], buf2[16]; > u_long flags; > + int ref; > + struct debug_ref *r; > > va_start(ap, fmt); > vprintf(fmt, ap); > @@ -2960,8 +2989,21 @@ vn_printf(struct vnode *vp, const char *fmt, ...) > vp->v_object->resident_page_count); > printf(" "); > lockmgr_printinfo(vp->v_vnlock); > - if (vp->v_data != NULL) > - VOP_PRINT(vp); > +#if DDB > + if (kdb_active) { > + if (vp->v_data != NULL) > + VOP_PRINT(vp); > + } > +#endif > + > + /* Getnewvnode() initial reference is not recorded due to VNON */ > + ref = 1; > + TAILQ_FOREACH(r, &vp->v_debug_ref, link) { > + ref += r->val; > + printf("REF %d %s\n", ref, r->op); > + stack_print(&r->stack); > + } > + > } > > #ifdef DDB > Index: sys/fs/nfsclient/nfs_clport.c > =================================================================== > --- sys/fs/nfsclient/nfs_clport.c (revision 248723) > +++ sys/fs/nfsclient/nfs_clport.c (working copy) > @@ -273,6 +273,7 @@ nfscl_nget(struct mount *mntp, struct vnode *dvp, > /* vfs_hash_insert() vput()'s the losing vnode */ > return (0); > } > + vp->v_vflag |= VV_DEBUGVREF; > *npp = np; > > return (0); > Index: sys/fs/nfsclient/nfs_clnode.c > =================================================================== > --- sys/fs/nfsclient/nfs_clnode.c (revision 248723) > +++ sys/fs/nfsclient/nfs_clnode.c (working copy) > @@ -179,6 +179,7 @@ ncl_nget(struct mount *mntp, u_int8_t *fhp, int fh > /* vfs_hash_insert() vput()'s the losing vnode */ > return (0); > } > + vp->v_vflag |= VV_DEBUGVREF; > *npp = np; > > return (0); From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 11:23:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 07292DFE; Thu, 18 Jul 2013 11:23:33 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 2AABB82F; Thu, 18 Jul 2013 11:23:31 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6IBNPfn014753; Thu, 18 Jul 2013 14:23:25 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6IBNPfn014753 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6IBNPwO014752; Thu, 18 Jul 2013 14:23:25 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 18 Jul 2013 14:23:25 +0300 From: Konstantin Belousov To: Dan Thomas Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?) Message-ID: <20130718112325.GZ5991@kib.kiev.ua> References: <201307151932.r6FJWSxM087108@chez.mckusick.com> <51E5CD7A.2020109@FreeBSD.org> <20130717053431.GN5991@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="MrbiU6dcJfOZ616B" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: Kirk McKusick , freebsd-fs@freebsd.org, Palle Girgensohn , Jeff Roberson , Julian Akehurst X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 11:23:33 -0000 --MrbiU6dcJfOZ616B Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 18, 2013 at 11:43:54AM +0100, Dan Thomas wrote: > After a bit of experimentation, we've managed to nail down a > reasonably short run that exhibits this leak. Postgres' verbose log > output is linked below - whatever is causing the leak is in there > somewhere, but alas I lack the necessary understanding of Postgres' > internals to be able to pin it down any further. >=20 > https://dl.dropboxusercontent.com/u/13916028/pg_leak_log.txt This is of no use, at least for me. >=20 > I've also got a 2.4M ktrace of this run, which is still pretty big, > I'll admit. Unfortunately it's got some data in it that I'd rather not > publish, but I'm happy to send it directly to anyone who might find it > useful. Such big ktrace is also unusable. If you want me to look at the leak, use the patch which I sent earlier, and add the flag to the vnodes which are likely to be leaked. Then, after 'show vnode', I would be able to see what is going on, I hope. >=20 > Thanks, >=20 > Dan >=20 > On 17 July 2013 06:34, Konstantin Belousov wrote: > > On Wed, Jul 17, 2013 at 12:47:22AM +0200, Palle Girgensohn wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> Kirk McKusick skrev: > >> >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of > >> >> unreferenced inodes (pg_xlog files?) From: Dan Thomas > >> >> To: Kirk McKusick Cc: > >> >> Palle Girgensohn , freebsd-fs@freebsd.org, Jeff > >> >> Roberson , Julian Akehurst > >> >> X-ASK-Info: Message Queued (2013/07/15 > >> >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04) > >> >> > >> >> On 11 June 2013 01:17, Kirk McKusick > >> >> wrote: > >> >>> OK, good to have it narrowed down. I will look to devise some > >> >>> additional diagnostics that hopefully will help tease out the > >> >>> bug. I'll hopefully get back to you soon. > >> >> Hi, > >> >> > >> >> Is there any news on this issue? We're still running several > >> >> servers that are exhibiting this problem (most recently, one that > >> >> seems to be leaking around 10gb/hour), and it's getting to the > >> >> point where we're looking at moving to a different OS until it's > >> >> resolved. > >> >> > >> >> We have access to several production systems with this problem and > >> >> (at least from time to time) will have systems with a significant > >> >> leak on them that we can experiment with. Is there any way we can > >> >> assist with tracking this down? Any diagnostics or testing that > >> >> would be useful? > >> >> > >> >> Thanks, Dan > >> > > >> > Hi Dan (and Palle), > >> > > >> > Sorry for the long delay with no help / news. I have gotten > >> > side-tracked on several projects and have had little time to try and > >> > devise some tests that would help find the cause of the lost space. > >> > It almost certainly is a one-line fix (a missing vput or vrele > >> > probably in some error path), but finding where it goes is the hard > >> > part :-) > >> > > >> > I have had little success in inserting code that tracks reference > >> > counts (too many false positives). So, I am going to need some help > >> > from you to narrow it down. My belief is that there is some set of > >> > filesystem operations (system calls) that are leading to the > >> > problem. Notably, a file is being created, data put into it, then the > >> > file is deleted (either before or after being closed). Somehow a > >> > reference to that file is persisting despite there being no valid > >> > reference to it. Hence the filesystem thinks it is still live and is > >> > not deleting it. When you do the forcible unmount, these files get > >> > cleared and the space shows back up. > >> > > >> > What I need to devise is a small test program doing the set of system > >> > calls that cause this to happen. The way that I would like to try and > >> > get it is to have you `ktrace -i' your application and then run your > >> > application just long enough to create at least one of these lost > >> > files. The goal is to minimize the amount of ktrace data through > >> > which we need to sift. > >> > > >> > In preparation for doing this test you need to have a kernel compiled > >> > with `option DIAGNOSTIC' or if you prefer, just add `#define > >> > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you > >> > have at least one offending file when you try to unmount the affected > >> > filesystem and find it busy. Before doing the `umount -f', enable > >> > busy printing using `sysctl debug.busyprt=3D1'. Then capture the > >> > console output which will show the details of all the vnodes that had > >> > to be forcibly flushed. Hopefully we will then be able to correlate > >> > them back to the files (NAMI in the ktrace output) with which they > >> > were associated. We may need to augment the NAMI data with the inode > >> > number of the associated file to make the association with the > >> > busyprt output. Anyway, once we have that, we can look at all the > >> > system calls done on those files and create a small test program that > >> > exhibits the problem. Given a small test program, Jeff or I can track > >> > down the offending system call path and nail this pernicious bug once > >> > and for all. > >> > > >> > Kirk McKusick > >> > >> Hi, > >> > >> I have run ktrace -i on pg_ctl (which forks off all the postgresql > >> processes) and I got two "busy" files that where "lost" after a few > >> hours. dmesg reveals this: > >> > >> vflush: busy vnode > >> 0xfffffe067cdde960: tag ufs, type VREG > >> usecount 1, writecount 0, refcount 2 mountedhere 0 > >> flags (VI(0x200)) > >> VI_LOCKed v_object 0xfffffe0335922000 ref 0 pages 0 > >> lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) > >> ino 11047146, on dev da2s1d > >> vflush: busy vnode > >> 0xfffffe039f35bb40: tag ufs, type VREG > >> usecount 1, writecount 0, refcount 3 mountedhere 0 > >> flags (VI(0x200)) > >> VI_LOCKed v_object 0xfffffe03352701d0 ref 0 pages 0 > >> lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723) > >> ino 11045961, on dev da2s1d > >> > >> > >> I had to umount -f, so they where "lost". > >> > >> So, now I have 55 GB ktrace output... ;) Is there anything I can do to > >> filter it, or shall I compress it and put it on a web server for you to > >> fetch as it is? > > > > I think that 55GB of ktrace is obviously useless. The Kirk' idea was to > > have an isolated test case that would only create the situation trigger= ing > > the leak, without irrelevant activity. This indeed requires drilling d= own > > and isolating the file activities to get to the core of problem. > > > > FWIW, I and Peter Holm used the following alternative approach quite > > successfully when tracking down other vnode reference leaks. The appro= ach > > still requires some understanding of the specifics of the problematic > > files to be useful, but not as much as isolated test. > > > > Basically, you take the patch below, and set the VV_DEBUGVREF flag for > > the vnode that has characteristics as much specific for the leaked vnode > > as possible. The patch has example of setting the flag for all new NFS > > vnodes. You would probably want to do the same in vfs_vgetf(), > > checking e.g. for the partition where your leaks happen. The limiting > > of the vnodes for which the vref traces are accumulated is needed to > > save the kernel memory. > > > > Then after the leak was observed, you just print the vnode with ddb > > command 'show vnode addr' and send the output to developer. > > > > Index: sys/sys/vnode.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/sys/vnode.h (revision 248723) > > +++ sys/sys/vnode.h (working copy) > > @@ -94,6 +94,13 @@ struct vpollinfo { > > > > #if defined(_KERNEL) || defined(_KVM_VNODE) > > > > +struct debug_ref { > > + TAILQ_ENTRY(debug_ref) link; > > + int val; > > + const char *op; > > + struct stack stack; > > +}; > > + > > struct vnode { > > /* > > * Fields which define the identity of the vnode. These fields= are > > @@ -169,6 +176,7 @@ struct vnode { > > int v_writecount; /* v ref count of write= rs */ > > u_int v_hash; > > enum vtype v_type; /* u vnode type */ > > + TAILQ_HEAD(, debug_ref) v_debug_ref; > > }; > > > > #endif /* defined(_KERNEL) || defined(_KVM_VNODE) */ > > @@ -253,6 +261,7 @@ struct xvnode { > > #define VV_DELETED 0x0400 /* should be removed */ > > #define VV_MD 0x0800 /* vnode backs the md device */ > > #define VV_FORCEINSMQ 0x1000 /* force the insmntque to succe= ed */ > > +#define VV_DEBUGVREF 0x2000 > > > > /* > > * Vnode attributes. A field value of VNOVAL represents a field whose= value > > Index: sys/kern/vfs_subr.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/kern/vfs_subr.c (revision 248723) > > +++ sys/kern/vfs_subr.c (working copy) > > @@ -71,6 +71,7 @@ __FBSDID("$FreeBSD$"); > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -871,6 +872,23 @@ static struct kproc_desc vnlru_kp =3D { > > }; > > SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start, > > &vnlru_kp); > > + > > +MALLOC_DEFINE(M_RECORD_REF, "recordref", "recordref"); > > +static void > > +v_record_ref(struct vnode *vp, int val, const char *op) > > +{ > > + struct debug_ref *r; > > + > > + if ((vp->v_type !=3D VREG && vp->v_type !=3D VBAD) || > > + (vp->v_vflag & VV_DEBUGVREF) =3D=3D 0) > > + return; > > + r =3D malloc(sizeof(struct debug_ref), M_RECORD_REF, M_NOWAIT | > > + M_USE_RESERVE); > > + r->val =3D val; > > + r->op =3D op; > > + stack_save(&r->stack); > > + TAILQ_INSERT_TAIL(&vp->v_debug_ref, r, link); > > +} > > > > /* > > * Routines having to do with the management of the vnode table. > > @@ -1073,6 +1091,7 @@ alloc: > > vp->v_vflag |=3D VV_NOKNOTE; > > } > > rangelock_init(&vp->v_rl); > > + TAILQ_INIT(&vp->v_debug_ref); > > > > /* > > * For the filesystems which do not use vfs_hash_insert(), > > @@ -1082,6 +1101,7 @@ alloc: > > */ > > vp->v_hash =3D (uintptr_t)vp >> vnsz2log; > > > > + TAILQ_INIT(&vp->v_debug_ref); > > *vpp =3D vp; > > return (0); > > } > > @@ -2197,6 +2217,7 @@ vget(struct vnode *vp, int flags, struct thread *t > > vinactive(vp, td); > > vp->v_iflag &=3D ~VI_OWEINACT; > > } > > + v_record_ref(vp, 1, "vget"); > > VI_UNLOCK(vp); > > return (0); > > } > > @@ -2211,6 +2232,7 @@ vref(struct vnode *vp) > > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > > VI_LOCK(vp); > > v_incr_usecount(vp); > > + v_record_ref(vp, 1, "vref"); > > VI_UNLOCK(vp); > > } > > > > @@ -2253,6 +2275,7 @@ vputx(struct vnode *vp, int func) > > KASSERT(func =3D=3D VPUTX_VRELE, ("vputx: wrong func")); > > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > > VI_LOCK(vp); > > + v_record_ref(vp, -1, "vputx"); > > > > /* Skip this v_writecount check if we're going to panic below. = */ > > VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < = 1, vp, > > @@ -2409,6 +2432,7 @@ void > > vdropl(struct vnode *vp) > > { > > struct bufobj *bo; > > + struct debug_ref *r, *r1; > > struct mount *mp; > > int active; > > > > @@ -2489,6 +2513,9 @@ vdropl(struct vnode *vp) > > lockdestroy(vp->v_vnlock); > > mtx_destroy(&vp->v_interlock); > > mtx_destroy(BO_MTX(bo)); > > + TAILQ_FOREACH_SAFE(r, &vp->v_debug_ref, link, r1) { > > + free(r, M_RECORD_REF); > > + } > > uma_zfree(vnode_zone, vp); > > } > > > > @@ -2888,6 +2915,8 @@ vn_printf(struct vnode *vp, const char *fmt, ...) > > va_list ap; > > char buf[256], buf2[16]; > > u_long flags; > > + int ref; > > + struct debug_ref *r; > > > > va_start(ap, fmt); > > vprintf(fmt, ap); > > @@ -2960,8 +2989,21 @@ vn_printf(struct vnode *vp, const char *fmt, ...) > > vp->v_object->resident_page_count); > > printf(" "); > > lockmgr_printinfo(vp->v_vnlock); > > - if (vp->v_data !=3D NULL) > > - VOP_PRINT(vp); > > +#if DDB > > + if (kdb_active) { > > + if (vp->v_data !=3D NULL) > > + VOP_PRINT(vp); > > + } > > +#endif > > + > > + /* Getnewvnode() initial reference is not recorded due to VNON = */ > > + ref =3D 1; > > + TAILQ_FOREACH(r, &vp->v_debug_ref, link) { > > + ref +=3D r->val; > > + printf("REF %d %s\n", ref, r->op); > > + stack_print(&r->stack); > > + } > > + > > } > > > > #ifdef DDB > > Index: sys/fs/nfsclient/nfs_clport.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/fs/nfsclient/nfs_clport.c (revision 248723) > > +++ sys/fs/nfsclient/nfs_clport.c (working copy) > > @@ -273,6 +273,7 @@ nfscl_nget(struct mount *mntp, struct vnode *dvp, > > /* vfs_hash_insert() vput()'s the losing vnode */ > > return (0); > > } > > + vp->v_vflag |=3D VV_DEBUGVREF; > > *npp =3D np; > > > > return (0); > > Index: sys/fs/nfsclient/nfs_clnode.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- sys/fs/nfsclient/nfs_clnode.c (revision 248723) > > +++ sys/fs/nfsclient/nfs_clnode.c (working copy) > > @@ -179,6 +179,7 @@ ncl_nget(struct mount *mntp, u_int8_t *fhp, int fh > > /* vfs_hash_insert() vput()'s the losing vnode */ > > return (0); > > } > > + vp->v_vflag |=3D VV_DEBUGVREF; > > *npp =3D np; > > > > return (0); --MrbiU6dcJfOZ616B Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR59AsAAoJEJDCuSvBvK1BYsMQAJZj+18pwjeJQfyBoaK4Rif7 FK32mFljb85RxdGs9nX4Pmq91/00B6lv+8HmdlydYbOw4qiRm8x/hNverAr2GHSf 5ArGEJeTCwfXheG+kulivTo+sMrapeyR6XN5THIHjglBjrSBu8nUrAyNzyjaOFRq 2tLDn/NdibMJeBUKkVWMV3L7cmrIw3snF+kJc6f/1iDnBahOKPADxAHo/N1Exg2s AC5wUuG+d5lrb/jFYaSoND1eDnWIVVu588GQuXrIo9N9GM9D+UVk1OM2pLEOISLM 7hL5mKagLD4wHpzW9FW6nlQjDcGQqJfkvYp+PqsVO6KGaVCk740N3rItDk9WGTfn WNMGl0i9rDopvuaOBX/BwLrwdN/TQaXTHPdVdOHjDWyjqlmaG2r37AYZ2OOVfvhr EH2UOAj1bcRfucGOrczxjSRFEI7honiOuw48RYZYNd4WUujSaA61vINdDXIhrYky /+kTwvGpoBxNvE7pMmUg1fI0Ww/Kp1QsaObh/Kb9KmbOKDSNc8luVfLNbU+EnQaS bfS9eUUkPvk86lgMLVoXRoXV707IF0r7SBosRgrc9IVc9xZjj4fNqDIvUiZrSoow hQOPX6X9fZ1DSCngp9DuL4GZocv3levC/8wirpAydi5dTHs2XoP66P8Opn+SBzct TRwTCYGSbZslIgTHuvMd =NmSV -----END PGP SIGNATURE----- --MrbiU6dcJfOZ616B-- From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 11:28:18 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 8BA2FED2; Thu, 18 Jul 2013 11:28:18 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id F295F85A; Thu, 18 Jul 2013 11:28:17 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6IBSEJA015801; Thu, 18 Jul 2013 14:28:14 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6IBSEJA015801 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6IBSEMo015800; Thu, 18 Jul 2013 14:28:14 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 18 Jul 2013 14:28:14 +0300 From: Konstantin Belousov To: Andriy Gapon Subject: Re: Deadlock in nullfs/zfs somewhere Message-ID: <20130718112814.GA5991@kib.kiev.ua> References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="GFPlsJ7YtLjXgs8j" Content-Disposition: inline In-Reply-To: <51E7B686.4090509@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@FreeBSD.org, Adrian Chadd X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 11:28:18 -0000 --GFPlsJ7YtLjXgs8j Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 18, 2013 at 12:33:58PM +0300, Andriy Gapon wrote: > on 17/07/2013 20:19 Adrian Chadd said the following: > > On 17 July 2013 04:26, Andriy Gapon wrote: > >> One possibility is to add getnewvnode_reserve() calls before the ZFS t= ransaction > >> beginnings in the places where a new vnode/znode may have to be alloca= ted within > >> a transaction. > >> This looks like a quick and cheap solution but it makes the code somew= hat messier. > >> > >> Another possibility is to change something in VFS machinery, so that V= OP_RECLAIM > >> getting blocked for one filesystem does not prevent vnode allocation f= or other > >> filesystems. > >> > >> I could think of other possible solutions via infrastructural changes = in VFS or > >> ZFS... > >=20 > > Well, what do others think? This seems like a showstopper for systems > > with lots and lots of ZFS filesystems doing lots and lots of activity. > >=20 >=20 > Looks like others are not speaking yet :-) >=20 > My current idea is that ZFS should set MNTK_SUSPEND in zfs_suspend_fs() p= ath > before acquiring its z_teardown* locks. This should make intentions of Z= FS > visible to VFS. And thus it should prevent VOP_RECLAIM call on a suspend= ed ZFS > filesystem and that should prevent vnlru_free() getting stuck. > Hopefully this should break the deadlock cycle. >=20 > Kostik, >=20 > what is your opinion? > For your convenience here is a message with my analysis of this issue: > http://thread.gmane.org/gmane.os.freebsd.current/150889/focus=3D18534 Well, I have no opinion. Making the fs suspended, in other words, preventi= ng writers from entering the filesystem code, is probably good. I do not know zfs code to usefully comment on the approach. Note that you must drain existing writers, i.e. call vfs_write_suspend(), to set MNTK_SUSPEND. --GFPlsJ7YtLjXgs8j Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR59FNAAoJEJDCuSvBvK1ByYMP/3njOyjvWN3fDjUVHiJmFgL+ 9STNHDkzaHDTBd7TDtybGrqkljLoSrjkC9LVl6MyRwq2olJ1yhYQKmOlkaBeOaJt rnuXvyGA2Wz4XTUIZVWaV/wtEPUMPskYv60ibYx00JuOFwA/oNR7J7fp/7bPirJ6 jPwQ+W9wU/Qzls3rMmhV2owqhSIUQD8egTB3Es/5Cda/+8zjR9yoQK0KLLCU4GbY n8740XueGxZkTvM2C0ZstQ4JvRAbrRLKT7mCHadISov+ErPPwnnuWIYtYhB/gcq0 i9U5/JMNRyiTlyyDSEiePBtxf+iY9sxWYHi1hwWIWG28rLH3exEGn6kKzXB4q4Pe NzRGJB4p8drGZb4NoUAikhqquY7Jmm8to5NMJzepV9AKa2a08WSHM4SMgk60oeUq NO+XSpnazZK9Bu7shrYnlWdUjXAPzUzUlQArTRmI9cQjkEWiTzwpY2TFn6AFbvwM HUu/AdDP4EBvrW/dyAeLmgocbErqZpNLlemLTBl6I3kfgB/Ytd3VcHbWZCMgP8cS 3DDbbaPqj6eFxXqObDgp+hAPhUaFvO8RW+FH3/SMj+zGjQ9+tmW9L47hB1jHbO8z QIqXAQaAoqhATNurGVqj4qUtb3YX157Csw5+nRMTQ/IRmJghb5W5OxEALNyranmY d4655Qai0ShZPuB/v8ZD =aHUZ -----END PGP SIGNATURE----- --GFPlsJ7YtLjXgs8j-- From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 13:41:49 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id E90F79FC; Thu, 18 Jul 2013 13:41:49 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 0878DEBE; Thu, 18 Jul 2013 13:41:48 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id QAA22718; Thu, 18 Jul 2013 16:41:39 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1UzoSg-00057j-T6; Thu, 18 Jul 2013 16:41:38 +0300 Message-ID: <51E7F05A.5020609@FreeBSD.org> Date: Thu, 18 Jul 2013 16:40:42 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Konstantin Belousov , Adrian Chadd Subject: Re: Deadlock in nullfs/zfs somewhere References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> <20130718112814.GA5991@kib.kiev.ua> In-Reply-To: <20130718112814.GA5991@kib.kiev.ua> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 13:41:50 -0000 on 18/07/2013 14:28 Konstantin Belousov said the following: > Well, I have no opinion. Making the fs suspended, in other words, preventing > writers from entering the filesystem code, is probably good. I do not > know zfs code to usefully comment on the approach. OK, fair. > Note that you must drain existing writers, i.e. call vfs_write_suspend(), > to set MNTK_SUSPEND. Here is my take on it, not tested at all. diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c index 0fc59cc..59c8cbd 100644 --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c @@ -2263,8 +2263,12 @@ zfs_suspend_fs(zfsvfs_t *zfsvfs) { int error; - if ((error = zfsvfs_teardown(zfsvfs, B_FALSE)) != 0) + if ((error = vfs_write_suspend(zfsvfs->z_vfs)) != 0) return (error); + if ((error = zfsvfs_teardown(zfsvfs, B_FALSE)) != 0) { + vfs_write_resume(mp, 0); + return (error); + } dmu_objset_disown(zfsvfs->z_os, zfsvfs); return (0); @@ -2339,5 +2343,6 @@ bail: rrw_exit(&zfsvfs->z_teardown_lock, FTAG); + vfs_write_resume(mp, 0); if (err) { /* * Since we couldn't reopen zfsvfs::z_os, or -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 18:52:20 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id B9B36FDA; Thu, 18 Jul 2013 18:52:20 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 13BFFFD1; Thu, 18 Jul 2013 18:52:19 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6IIqF2c013999; Thu, 18 Jul 2013 21:52:15 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6IIqF2c013999 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6IIqFBP013998; Thu, 18 Jul 2013 21:52:15 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 18 Jul 2013 21:52:15 +0300 From: Konstantin Belousov To: Andriy Gapon Subject: Re: Deadlock in nullfs/zfs somewhere Message-ID: <20130718185215.GE5991@kib.kiev.ua> References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> <20130718112814.GA5991@kib.kiev.ua> <51E7F05A.5020609@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="6PtN/jU//tuarfdA" Content-Disposition: inline In-Reply-To: <51E7F05A.5020609@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@FreeBSD.org, Adrian Chadd X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 18:52:20 -0000 --6PtN/jU//tuarfdA Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 18, 2013 at 04:40:42PM +0300, Andriy Gapon wrote: > on 18/07/2013 14:28 Konstantin Belousov said the following: > > Well, I have no opinion. Making the fs suspended, in other words, prev= enting > > writers from entering the filesystem code, is probably good. I do not > > know zfs code to usefully comment on the approach. >=20 > OK, fair. >=20 > > Note that you must drain existing writers, i.e. call vfs_write_suspend(= ), > > to set MNTK_SUSPEND. >=20 > Here is my take on it, not tested at all. >=20 > diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c > b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c > index 0fc59cc..59c8cbd 100644 > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c > @@ -2263,8 +2263,12 @@ zfs_suspend_fs(zfsvfs_t *zfsvfs) > { > int error; >=20 > - if ((error =3D zfsvfs_teardown(zfsvfs, B_FALSE)) !=3D 0) > + if ((error =3D vfs_write_suspend(zfsvfs->z_vfs)) !=3D 0) > return (error); > + if ((error =3D zfsvfs_teardown(zfsvfs, B_FALSE)) !=3D 0) { > + vfs_write_resume(mp, 0); > + return (error); > + } > dmu_objset_disown(zfsvfs->z_os, zfsvfs); >=20 > return (0); > @@ -2339,5 +2343,6 @@ bail: > rrw_exit(&zfsvfs->z_teardown_lock, FTAG); >=20 > + vfs_write_resume(mp, 0); > if (err) { > /* > * Since we couldn't reopen zfsvfs::z_os, or There is VFS method VFS_SUSP_CLEAN, called when the suspension is lifted. UFS uses it to clean the back-queue of work which were not performed during the suspend, mostly inactivate the postponed inactive vnodes. ZFS probably does not need it, since it does not check for MNTK_SUSPEND, but if it starts care, there is a place to put the code. On the other hand, I believe that your patch is notoriously incomplete, because there should be a lot of threads which mutate ZFS mounts state and which do not call vn_start_write() around the mutations. I.e. all ZFS top-level code which calls into ZFS ops and which is not coming from VFS. --6PtN/jU//tuarfdA Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR6DlfAAoJEJDCuSvBvK1BMcAP/jw+dYeUw6RjJZCmJ7I6XrHa aMYHMVjPuUH3h5jyE3avPZXd1OU67GkKmu+pQMF9qaMzPie6IkNK7FnK8Ey2Wy9D 1chQXv1ccaAvJTDk1VZbEWtctZ38V5CPm34ZbwGWk0wImjMU03C6D+fZRc9VzozM hyxQAyDc89Nmmcxn6BnL4INJAAIASVB3QcRY2lO7/FKjAR/yxmy6R74/HvvRNDhk QYvvSFJzNnB9wQByupYY69hbz18VQI3hyTMzh3xZkt9JcH2oCHanl8WkQUIxB4eJ ZCBsKFZdV+rKLnXHfP4tlXmmXLTzFVJlf5u+vS1l4PhLWsV2IZceImFDRLaxq1OC v2pF0DM2txWRODrsGY5Ie/DUwm7DoUghpu27rirkTcx84w3BWoD4/F5iEM1J+ZWY MXyC7Gj4560v0lE4mDyw0ZKDVfOsmWH5dx4ElwFCk6Wvxk4/+Wg3qtnZlTsKMZkM tJkw8lCj7xzg/FCg5ukXRnfuPixB3eiAbyEaQF1qR391YYypwfwSTfg3E2lYnSs0 BoijDzDaCj+8xlvl6CPY+/YKY6j1rTXFXcQQ5ayviOHVJjWo3kQAHukyWnZw/glx fZR8rZkv13iAnXSxulJW4AZA3O0Ahy42IuWfVdL7g3cRFEzHPvkXkbi4h/l6J/Wf uLiHucS8MyFMDh5GyDg/ =Bddh -----END PGP SIGNATURE----- --6PtN/jU//tuarfdA-- From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 19:18:22 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 54DB782D; Thu, 18 Jul 2013 19:18:22 +0000 (UTC) (envelope-from joe@karthauser.co.uk) Received: from babel.karthauser.co.uk (212-13-197-151.karthauser.co.uk [212.13.197.151]) by mx1.freebsd.org (Postfix) with ESMTP id 24191188; Thu, 18 Jul 2013 19:18:21 +0000 (UTC) Received: from phoenix.fritz.box (unknown [81.187.183.70]) (Authenticated sender: joemail@tao.org.uk) by babel.karthauser.co.uk (Postfix) with ESMTPSA id 2C0A12AF5; Thu, 18 Jul 2013 19:18:21 +0000 (UTC) Subject: Re: Drive failures with ada on FreeBSD-9.1, driver bug or wiring issue? Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Content-Type: text/plain; charset=windows-1252 From: Dr Josef Karthauser X-Priority: 3 In-Reply-To: <33EF2240EDC1446D8E45F8C51974136B@multiplay.co.uk> Date: Thu, 18 Jul 2013 20:18:20 +0100 Content-Transfer-Encoding: 7bit Message-Id: References: <20130716225013.1C63B23A@babel.karthauser.co.uk> <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk> <33EF2240EDC1446D8E45F8C51974136B@multiplay.co.uk> To: "Steven Hartland" X-Mailer: Apple Mail (2.1508) Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 19:18:22 -0000 On 18 Jul 2013, at 08:33, "Steven Hartland" wrote: > What chassis is this? Hey Steven, It's a Supermicro CSE-813MTQ-350CB. Cheers, Joe From owner-freebsd-fs@FreeBSD.ORG Thu Jul 18 19:35:45 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 12B244A2; Thu, 18 Jul 2013 19:35:45 +0000 (UTC) (envelope-from prvs=1911771df7=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 8719A2BC; Thu, 18 Jul 2013 19:35:44 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50005048668.msg; Thu, 18 Jul 2013 20:35:41 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Thu, 18 Jul 2013 20:35:41 +0100 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=1911771df7=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <964B87D56B7C4B529E995E7A660E9AAD@multiplay.co.uk> From: "Steven Hartland" To: "Dr Josef Karthauser" References: <20130716225013.1C63B23A@babel.karthauser.co.uk> <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk> <33EF2240EDC1446D8E45F8C51974136B@multiplay.co.uk> Subject: Re: Drive failures with ada on FreeBSD-9.1, driver bug or wiring issue? Date: Thu, 18 Jul 2013 20:36:03 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2013 19:35:45 -0000 ----- Original Message ----- From: "Dr Josef Karthauser" > On 18 Jul 2013, at 08:33, "Steven Hartland" wrote: > >> What chassis is this? > > Hey Steven, > > It's a Supermicro CSE-813MTQ-350CB. We've seen issues on supermicro chassis before which cause timeouts and in extreme cases device drops so if you can try wiring the disks up directly to the MB via sata cables bypassing the hotswap midplane and see if that helps. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 10:19:31 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id CC5088D4; Fri, 19 Jul 2013 10:19:31 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id E07A5EB2; Fri, 19 Jul 2013 10:19:30 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id NAA07949; Fri, 19 Jul 2013 13:19:27 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1V07mZ-0009cr-FX; Fri, 19 Jul 2013 13:19:27 +0300 Message-ID: <51E91277.3070309@FreeBSD.org> Date: Fri, 19 Jul 2013 13:18:31 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: Deadlock in nullfs/zfs somewhere References: <51DCFEDA.1090901@FreeBSD.org> <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> <20130718112814.GA5991@kib.kiev.ua> <51E7F05A.5020609@FreeBSD.org> <20130718185215.GE5991@kib.kiev.ua> In-Reply-To: <20130718185215.GE5991@kib.kiev.ua> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, Adrian Chadd X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 10:19:31 -0000 on 18/07/2013 21:52 Konstantin Belousov said the following: > There is VFS method VFS_SUSP_CLEAN, called when the suspension is > lifted. UFS uses it to clean the back-queue of work which were > not performed during the suspend, mostly inactivate the postponed > inactive vnodes. ZFS probably does not need it, since it does > not check for MNTK_SUSPEND, but if it starts care, there is a place > to put the code. I will keep this in mind. > On the other hand, I believe that your patch is notoriously incomplete, > because there should be a lot of threads which mutate ZFS mounts state > and which do not call vn_start_write() around the mutations. I.e. > all ZFS top-level code which calls into ZFS ops and which is not > coming from VFS. I agree. What I am trying to fix right now is VFS<->ZFS interaction. I think that ZFS<->ZFS should already be fine - it's protected by internal ZFS locking. OTOH, perhaps my understanding of what you said is incomplete or incorrect, because VFS suspension mechanism is completely unknown to me yet. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 10:22:18 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id BBB96A92; Fri, 19 Jul 2013 10:22:18 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id D466BED5; Fri, 19 Jul 2013 10:22:17 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id NAA07981; Fri, 19 Jul 2013 13:22:15 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1V07pG-0009d9-Vk; Fri, 19 Jul 2013 13:22:15 +0300 Message-ID: <51E9131F.1060707@FreeBSD.org> Date: Fri, 19 Jul 2013 13:21:19 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: zfs_rename: another zfs+vfs deadlock References: <51E679FD.3040306@FreeBSD.org> <20130717194557.GU5991@kib.kiev.ua> In-Reply-To: <20130717194557.GU5991@kib.kiev.ua> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, zfs-devel@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 10:22:18 -0000 on 17/07/2013 22:45 Konstantin Belousov said the following: > On Wed, Jul 17, 2013 at 02:03:25PM +0300, Andriy Gapon wrote: >> A scenario to reproduce this bug could be like this. >> mkdir a >> mkdir a/b >> mv some-file a/b/ (in parallel with) stat a/b >> Of course it would have to be repeated many times to hit the right timing >> window. Also, namecache could interfere with this scenario, but I am not sure. >> > > There is no questions or proposals on how to approach the fix, JFYI mail ? I was just reporting the problem and my analysis of it. A question of "how to fix" was implied. > I recommend you to look at the ufs_checkpath() and its use in the > ufs_rename(). Thank you. That code is enlightening. I do not think that the approach is directly applicable to zfs_rename, unfortunately. But I will try to see if the same kind of approach could be used. Also, I noticed that ufs_rename() checks for cross-device rename. Should all filesystems do that or should that check belong to VFS layer (if not already done there)? -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 10:30:29 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 83AD6D48; Fri, 19 Jul 2013 10:30:29 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 2654AF3A; Fri, 19 Jul 2013 10:30:28 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6JAUPGc027572; Fri, 19 Jul 2013 13:30:25 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6JAUPGc027572 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6JAUPxH027549; Fri, 19 Jul 2013 13:30:25 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 19 Jul 2013 13:30:25 +0300 From: Konstantin Belousov To: Andriy Gapon Subject: Re: Deadlock in nullfs/zfs somewhere Message-ID: <20130719103025.GJ5991@kib.kiev.ua> References: <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> <20130718112814.GA5991@kib.kiev.ua> <51E7F05A.5020609@FreeBSD.org> <20130718185215.GE5991@kib.kiev.ua> <51E91277.3070309@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="WHG05yakhlzm8Hk1" Content-Disposition: inline In-Reply-To: <51E91277.3070309@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@FreeBSD.org, Adrian Chadd X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 10:30:29 -0000 --WHG05yakhlzm8Hk1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jul 19, 2013 at 01:18:31PM +0300, Andriy Gapon wrote: > on 18/07/2013 21:52 Konstantin Belousov said the following: > > There is VFS method VFS_SUSP_CLEAN, called when the suspension is > > lifted. UFS uses it to clean the back-queue of work which were > > not performed during the suspend, mostly inactivate the postponed > > inactive vnodes. ZFS probably does not need it, since it does > > not check for MNTK_SUSPEND, but if it starts care, there is a place > > to put the code. >=20 > I will keep this in mind. >=20 > > On the other hand, I believe that your patch is notoriously incomplete, > > because there should be a lot of threads which mutate ZFS mounts state > > and which do not call vn_start_write() around the mutations. I.e. > > all ZFS top-level code which calls into ZFS ops and which is not > > coming from VFS. >=20 > I agree. What I am trying to fix right now is VFS<->ZFS interaction. I = think > that ZFS<->ZFS should already be fine - it's protected by internal ZFS lo= cking. > OTOH, perhaps my understanding of what you said is incomplete or incorrec= t, > because VFS suspension mechanism is completely unknown to me yet. >=20 I think that you should satisfy the VFS invariants, and prevent mutators =66rom operating on the filesystem when MNTK_SUSPEND is set, for the case mutators are running outside the context where VFS could call vn_start_write() around. --WHG05yakhlzm8Hk1 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR6RVAAAoJEJDCuSvBvK1Bjq4P/2tWXpxKdyvuEJiqeer5wEsm DfErzv7U3fE9tDE0XNUgzDroXTWEQATJr3brdxUTpOvBVcQYKGWVR2jEAavZaWz9 YIH8tbat783yjiem0mvULlNRUJ7QRY12yPLMzetJkgmZAt3ocH4P6k+aHgqyisVu InzNL+Ekc1+0uD4AqEShuueQ2raypLUnY8B7FfAM6APcSO4ARvo8O8Z808hjXk4g cO8VGwvwwFxVT8j+7Woocs0pypRXyQkIhR6xVeBjst81VOzPdvvut8Ic9EH0nOdu 62YPkq4zwGQnyNLoYlWWYMYqNoA1D8AyzPpnmrT2PlVI6lZ3uBcRTRVIKxoQ37b9 h8zIrkHZK7f0o/f8X77VDVlFgzxQst637CjtDio+t9FKWYh5fG3DnR5kFqetXM6G uRuGjn2f2YLKfL2om2bYNdb0CQePdwhehnnegiIA/atAnPGHY6+YZTLi/CTD1Aal 3RwGChQuVsFecZuhlCdaEAeiWMCj+e2wuJEnkX5zzwrWc93t7QYUqBPMXLvYEcEy RwSlm2oN1HO3NX5q7vn9bWlRCyiYQILB4iGC5TIEM8RxYI46kQC8/+NVnIP5AC+n JzWd7cS99E8QdCBkr1DLmEa8H9dQMAtpJegeN6pILtzh56DmASFtXgcKo58PkIHj S8rrSKhnP9Cu5faDKVS2 =9m4K -----END PGP SIGNATURE----- --WHG05yakhlzm8Hk1-- From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 15:36:56 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 68B131FF for ; Fri, 19 Jul 2013 15:36:56 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id AD391DD for ; Fri, 19 Jul 2013 15:36:55 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id SAA11234; Fri, 19 Jul 2013 18:36:52 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1V0Cjk-000A2D-Md; Fri, 19 Jul 2013 18:36:52 +0300 Message-ID: <51E95CDD.7030702@FreeBSD.org> Date: Fri, 19 Jul 2013 18:35:57 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: Deadlock in nullfs/zfs somewhere References: <51E59FD9.4020103@FreeBSD.org> <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> <20130718112814.GA5991@kib.kiev.ua> <51E7F05A.5020609@FreeBSD.org> <20130718185215.GE5991@kib.kiev.ua> <51E91277.3070309@FreeBSD.org> <20130719103025.GJ5991@kib.kiev.ua> In-Reply-To: <20130719103025.GJ5991@kib.kiev.ua> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 15:36:56 -0000 on 19/07/2013 13:30 Konstantin Belousov said the following: > I think that you should satisfy the VFS invariants, and prevent mutators > from operating on the filesystem when MNTK_SUSPEND is set, for the > case mutators are running outside the context where VFS could call > vn_start_write() around. I would like to inquire more about this suggestion. With the proposed patch zfs_suspend_fs would first call vfs_write_suspend, which would wait for all threads that came via VFS (and called vn_start_write) to leave and it would also mark a filesystem as suspended and that would prevent new VFS writers. Then zfs_suspend_fs calls zfsvfs_teardown, which would wait for all threads in ZFS vnode ops and vfs ops to leave and would block new calls to those ops. So there is a window between the filesystem being marked as "VFS-suspended" and it becoming fully "ZFS-suspended". As I understand you are concerned about this window. I would like to understand what assumptions VFS code makes or could make about a filesystem marked as suspended. I also would like to be pointed to the code that makes any such assumptions. I need to understand this, because if there is any code that assumes that a suspended filesystem is really frozen, then there can be a much larger problem. Unlike UFS, ZFS does not use fs suspension for creating snapshots. It does not need to because of its COW nature and use of transactions. ZFS uses suspension for rollbacks, receiving of ZFS streams and fs version upgrades. That is for operations that modify the on-disk and in-memory data and metadata. So even without that window the filesystem is going to be modified. That's the whole purpose of ZFS suspend. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 16:28:39 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id B970FF8; Fri, 19 Jul 2013 16:28:39 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id BA03A2D6; Fri, 19 Jul 2013 16:28:38 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id TAA11557; Fri, 19 Jul 2013 19:28:36 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1V0DXo-000A5u-FQ; Fri, 19 Jul 2013 19:28:36 +0300 Message-ID: <51E968FC.20905@FreeBSD.org> Date: Fri, 19 Jul 2013 19:27:40 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: VOP_MKDIR/VOP_CREATE and namecache X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=X-VIET-VPS Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 16:28:39 -0000 Should VOP_MKDIR and VOP_CREATE immediately insert newly created vnodes into the namecache? If yes, where would it be done best? FS code, VFS code, VOP post-hooks, something else? -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 18:35:04 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id D9C407A5; Fri, 19 Jul 2013 18:35:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 501CDB06; Fri, 19 Jul 2013 18:35:04 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6JIZ0E7029586; Fri, 19 Jul 2013 21:35:00 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6JIZ0E7029586 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6JIZ0Kr029585; Fri, 19 Jul 2013 21:35:00 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 19 Jul 2013 21:35:00 +0300 From: Konstantin Belousov To: Andriy Gapon Subject: Re: zfs_rename: another zfs+vfs deadlock Message-ID: <20130719183500.GL5991@kib.kiev.ua> References: <51E679FD.3040306@FreeBSD.org> <20130717194557.GU5991@kib.kiev.ua> <51E9131F.1060707@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Vo48LVc30GAQuLuW" Content-Disposition: inline In-Reply-To: <51E9131F.1060707@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@FreeBSD.org, zfs-devel@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 18:35:04 -0000 --Vo48LVc30GAQuLuW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Jul 19, 2013 at 01:21:19PM +0300, Andriy Gapon wrote: > Also, I noticed that ufs_rename() checks for cross-device rename. Should all > filesystems do that or should that check belong to VFS layer (if not already > done there)? In principle yes, this sounds right. The only concern I see is layered filesystems like nullfs interaction with filesystems below the bypass. In other words, if any bypass provided the aggregation, this should be checked at the bypass layer too, in addition to the kern_renameat(). --Vo48LVc30GAQuLuW Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR6YbTAAoJEJDCuSvBvK1BhGsQAJ7PmCb57BzSzDUJCwydMcr3 fOD9M8UwR1NKxAWboxmIbhqjL7bfzzzeNGTHwhPj6NqQZQeBg8Lq0lvoKEqKTT6Z lPl0acR7+V2IIwBD5wj7NBN6LkZvztXc92pUt7PmLOTi7sbNOC2r8eUIvEjyMjCC O1tN4/eZiKGOk3F6ityRNjn4h2JUkwAhfn85gMrJOQvOuxVvo/AgARcxdplZdZIv 1WzZFtfWYrRGCjNwxQ0w4qE2amZ5aJudcXJdU3qiKh8Ss9s9TkLV+ZDj6+kofng+ YCbVuQ3xD9N8EpG/bmYnZV4gzWuD4hDsHBYf3Ba3DE7rdJfek7/K4TRVLnQxBCa6 toTkJijznXFjM33qpjORaNwOvFu+dWnWKmzgDMs6Ky32eeRPPqQz7Fe8IgJMD1C9 JDMZbGHJ/wqCR+vNKGaGrlZO4EL/L54IhqY2i1r3f2/fyMKBVq5bxwIxs3c3F2sw qqF64vwsnfd1aeKUTtgCVVdaSRmrsG6hdjfgri4sMqX6GfjppAcqXf6sah9SzEcv ibNiMut4q8Z6lfb9xPwsYrzubmyelQilf111bB9g7VzZuEDsEfTJoSbIPWKXPikP 6tn29wD6+E04zvL0KrCB5QtMHUhoS1l6mfrEMPXrm2wDgGhI1zBGNIdvZfSE4XxG ToPDy+vHlxrf9lDmwM3J =ojft -----END PGP SIGNATURE----- --Vo48LVc30GAQuLuW-- From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 18:42:47 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 7E326A39; Fri, 19 Jul 2013 18:42:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id CCCABB53; Fri, 19 Jul 2013 18:42:46 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id r6JIghO5031678; Fri, 19 Jul 2013 21:42:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua r6JIghO5031678 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id r6JIghWh031677; Fri, 19 Jul 2013 21:42:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 19 Jul 2013 21:42:43 +0300 From: Konstantin Belousov To: Andriy Gapon Subject: Re: Deadlock in nullfs/zfs somewhere Message-ID: <20130719184243.GM5991@kib.kiev.ua> References: <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> <20130718112814.GA5991@kib.kiev.ua> <51E7F05A.5020609@FreeBSD.org> <20130718185215.GE5991@kib.kiev.ua> <51E91277.3070309@FreeBSD.org> <20130719103025.GJ5991@kib.kiev.ua> <51E95CDD.7030702@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="GD0jJf8rm+K0B4Sk" Content-Disposition: inline In-Reply-To: <51E95CDD.7030702@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 18:42:47 -0000 --GD0jJf8rm+K0B4Sk Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jul 19, 2013 at 06:35:57PM +0300, Andriy Gapon wrote: > on 19/07/2013 13:30 Konstantin Belousov said the following: > > I think that you should satisfy the VFS invariants, and prevent mutators > > from operating on the filesystem when MNTK_SUSPEND is set, for the > > case mutators are running outside the context where VFS could call > > vn_start_write() around. >=20 > I would like to inquire more about this suggestion. > > With the proposed patch zfs_suspend_fs would first call > vfs_write_suspend, which would wait for all threads that came via > VFS (and called vn_start_write) to leave and it would also mark a > filesystem as suspended and that would prevent new VFS writers. Then > zfs_suspend_fs calls zfsvfs_teardown, which would wait for all threads > in ZFS vnode ops and vfs ops to leave and would block new calls to > those ops. > > So there is a window between the filesystem being marked as > "VFS-suspended" and it becoming fully "ZFS-suspended". As I understand > you are concerned about this window. I would like to understand what > assumptions VFS code makes or could make about a filesystem marked as > suspended. I also would like to be pointed to the code that makes any > such assumptions. > > I need to understand this, because if there is any code that assumes > that a suspended filesystem is really frozen, then there can be a much > larger problem. The expectation that the suspended filesystem does not have user-visible changes (e.g. seeing changes using the syscalls) or on-disk structures changes is the guarantee of the suspend mechanism. > > Unlike UFS, ZFS does not use fs suspension for creating snapshots. It > does not need to because of its COW nature and use of transactions. > ZFS uses suspension for rollbacks, receiving of ZFS streams and fs > version upgrades. That is for operations that modify the on-disk and > in-memory data and metadata. > > So even without that window the filesystem is going to be modified. > That's the whole purpose of ZFS suspend. > Then, you cannot use VFS suspension. Or, in other words, you are directed to abuse the VFS interface. I assure you that any changes to the interface would not take into account such abuse and probably break your hack. --GD0jJf8rm+K0B4Sk Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR6YiiAAoJEJDCuSvBvK1BjiYP/RaiZQSt+pHZaceUt8aUrNUl iAOoEsM+pwOzOcbHHovn4m/XnXWtC5UJDAJZH6M1HXjehOlRLx8tphJtanE9yorq Q0mMzq3SjnoMRf9ZvUzA0xDakplA/Zlk4CfxyQ/KdizCFVM6QlrfTyw/OOQijvl+ uncNQ/6t6HYxh/UVqZPkUZvOKtlH1soG7qyBV5XDi7FVGhvweJlLdJCkKlidEaZi XQMsLtoIYSCJrtldpZ/1Ah7sYUEPXOLbktTCdlhEr17YD+N0OPfrISEZO+vL4HW6 vK5yAAXiH730b+jgsAt/PuqIQCDjeIoWz/1v68deBQilZJQElV78aE4Iv8uP+w0e 5+4IPjvu1iM43sBzQG9f1gfUB3JuqgvgFQoQ1nDgXLhuops9+hAQpQC1Qv1Uzkrj dYR5aoHEVHR5WIuJfunRPwpqWKPJR0VcO8YNtBzsIdbZ9Xwl+dRbSQYbHd9vY1ng WAT/zK8PC2ntH13PQIVCHTdLU24/2gXEI6LnR8LWVm40ap0WVUn6fyDt/h55txcA KmaSFghN21/S6atZm/Gx6vf8Y/TJAuoOLTU/ikNNCw1qY+ejpR34JeSYu/700kP+ A77JnWZP9XkwA7x7Q4HQZT5GU63Zy87uK497S/d+lDKYaCLY1xmjXyHzx+h9P1lR hJ4/E5DCcV2dieownWKj =Veoy -----END PGP SIGNATURE----- --GD0jJf8rm+K0B4Sk-- From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 19:34:11 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 83BF578B for ; Fri, 19 Jul 2013 19:34:11 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id CB4BDD76 for ; Fri, 19 Jul 2013 19:34:10 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id WAA12708; Fri, 19 Jul 2013 22:34:07 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1V0GRL-000AKP-G5; Fri, 19 Jul 2013 22:34:07 +0300 Message-ID: <51E99477.1030308@FreeBSD.org> Date: Fri, 19 Jul 2013 22:33:11 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130708 Thunderbird/17.0.7 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: Deadlock in nullfs/zfs somewhere References: <51E67F54.9080800@FreeBSD.org> <51E7B686.4090509@FreeBSD.org> <20130718112814.GA5991@kib.kiev.ua> <51E7F05A.5020609@FreeBSD.org> <20130718185215.GE5991@kib.kiev.ua> <51E91277.3070309@FreeBSD.org> <20130719103025.GJ5991@kib.kiev.ua> <51E95CDD.7030702@FreeBSD.org> <20130719184243.GM5991@kib.kiev.ua> In-Reply-To: <20130719184243.GM5991@kib.kiev.ua> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 19:34:11 -0000 on 19/07/2013 21:42 Konstantin Belousov said the following: > Then, you cannot use VFS suspension. Or, in other words, you are directed > to abuse the VFS interface. I assure you that any changes to the interface > would not take into account such abuse and probably break your hack. So what would be your recommendation about this problem? Should we add another flavor of VFS suspension? The one that would mean "all external accesses to this fs must be put on hold", but would not imply "this fs is frozen". -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Jul 19 20:08:29 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D758CF95 for ; Fri, 19 Jul 2013 20:08:29 +0000 (UTC) (envelope-from david.i.noel@gmail.com) Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com [IPv6:2a00:1450:400c:c05::235]) by mx1.freebsd.org (Postfix) with ESMTP id 74F1CEB4 for ; Fri, 19 Jul 2013 20:08:29 +0000 (UTC) Received: by mail-wi0-f181.google.com with SMTP id hq4so192546wib.14 for ; Fri, 19 Jul 2013 13:08:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=e5t5AjGio/+oZfur6aZ2BerWgq6RPVcFkMdDhQT7abE=; b=ub0W/93Uoe8hfQUUvyJSdhkZDDpwhanNczWbo2AcRhh3qGwoVkDHOfr3xYzBiUpWlq tPjBvaO1McWegQAKA/5WLtolruZtOMf4jcHIkZPHNBp69XayWznrof8QJ2oXZjJNpQlz D1dJ3W8YnQWa0DTs6E+TcGF8RtJMJoEV5b6ADdKFheZo2KOz3cIY/YrUPkgE6kuIq4ZH 5Cx9RXmfqMaTskRAiRlZQ3yrvDBfXUo8gBWyZKrWoadq28/RvPsKAkWomTTuGH3eCCnh tndZ6Q/YlsckS8iUaRHOnWrBidAdCNeXBEnNtIZuCzzzb5DVDirRQWsaC/VNnsDcnPoi 7OMg== MIME-Version: 1.0 X-Received: by 10.180.20.228 with SMTP id q4mr12685116wie.1.1374264508482; Fri, 19 Jul 2013 13:08:28 -0700 (PDT) Received: by 10.216.180.138 with HTTP; Fri, 19 Jul 2013 13:08:28 -0700 (PDT) In-Reply-To: References: Date: Fri, 19 Jul 2013 15:08:28 -0500 Message-ID: Subject: Re: FreeBSD upgrade woes (8.3 -> 8.4) From: David Noel To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: David.I.Noel@gmail.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jul 2013 20:08:29 -0000 On 7/11/13, David Noel wrote: > I've been directed to the freebsd-fs list, so hopefully I'm in the > right place for this question. > > I have 4 servers I'm upgrading from 8.3 to 8.4. Two of them went > without a hitch, two of them blew up in my face. The only difference > between the two is the ones that worked have a 2-disk ZFS mirror and > the ones that didn't have a 4-disk ZFS striped mirror configuration > (RAID10). They both use the GPT. > > After installworld && installkernel they made it through boot, but > right before the login prompt I'm getting a panic and stack dump. The > backtrace looks something like this (roughly): > > 0 kdb_backtrace > 1 panic > 2 trap_fatal > 3 trap_pfault > 4 trap > 5 calltrap > 6 vdev_mirror_child_select > 7 vdev_mirror_io_start > 8 zio_vdev_io_start > 9 zio_execute > 10 arc_read > 11 dbuf_read > 12 dbuf_findbp > 13 dbuf_hold_impl > 14 dbuf_hold > 15 dnode_hold_impl > 16 dmu_buf_hold > 17 zap_lockdir > > Does anyone have any idea what went wrong? > > Does anyone have any suggestions on how to get past this? > > Is there any more information I could provide to help debug this? > > Thanks, > > David I replaced the kernel with the one on the 8.4 memstick and it booted just fine. I then built and installed a kernel without using the j flag to test the idea suggested on freebsd-questions@ that it could have been a buggy kernel caused by j>1. It booted without problem. Maybe there's something to this -j >1 causing buggy kernels rumor? At any rate, I don't think I'll try buildkernel with j>1 again. From owner-freebsd-fs@FreeBSD.ORG Sat Jul 20 17:10:03 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 82BAEF55; Sat, 20 Jul 2013 17:10:03 +0000 (UTC) (envelope-from linimon@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 5E8E3BAC; Sat, 20 Jul 2013 17:10:03 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r6KHA32B086702; Sat, 20 Jul 2013 17:10:03 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r6KHA3Q8086701; Sat, 20 Jul 2013 17:10:03 GMT (envelope-from linimon) Date: Sat, 20 Jul 2013 17:10:03 GMT Message-Id: <201307201710.r6KHA3Q8086701@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org From: linimon@FreeBSD.org Subject: Re: kern/180678: [NFS] succesfully exported filesystems being reported as failed X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Jul 2013 17:10:03 -0000 Old Synopsis: succesfully exported filesystems being reported as failed New Synopsis: [NFS] succesfully exported filesystems being reported as failed Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: linimon Responsible-Changed-When: Sat Jul 20 17:09:44 UTC 2013 Responsible-Changed-Why: reclassify. http://www.freebsd.org/cgi/query-pr.cgi?pr=180678