From owner-freebsd-fs@FreeBSD.ORG Sun Apr 27 07:26:56 2008 Return-Path: Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1C7011065670 for ; Sun, 27 Apr 2008 07:26:56 +0000 (UTC) (envelope-from randy@psg.com) Received: from rip.psg.com (rip.psg.com [IPv6:2001:418:1::39]) by mx1.freebsd.org (Postfix) with ESMTP id 07AEB8FC1C for ; Sun, 27 Apr 2008 07:26:56 +0000 (UTC) (envelope-from randy@psg.com) Received: from 50.216.138.210.bn.2iij.net ([210.138.216.50] helo=rmac.psg.com) by rip.psg.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69 (FreeBSD)) (envelope-from ) id 1Jq1Hj-0006zX-DA for freebsd-fs@FreeBSD.ORG; Sun, 27 Apr 2008 07:26:55 +0000 Message-ID: <48142ABE.4050107@psg.com> Date: Sun, 27 Apr 2008 16:26:54 +0900 From: Randy Bush User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: freebsd-fs@FreeBSD.ORG X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Subject: zfs and vfs.zfs.prefetch_disable="1" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Apr 2008 07:26:56 -0000 i have in my incantations for zfs on i386 to stick the following in /boot/loader.conf.local vm.kmem_size=600M vm.kmem_size_max=600M zfs_load=YES vfs.zfs.prefetch_disable=1 but i have no idea where that last one crept in. any clues? randy From owner-freebsd-fs@FreeBSD.ORG Sun Apr 27 12:01:42 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7DE5E1065679 for ; Sun, 27 Apr 2008 12:01:42 +0000 (UTC) (envelope-from engywook@gmail.com) Received: from wf-out-1314.google.com (wf-out-1314.google.com [209.85.200.172]) by mx1.freebsd.org (Postfix) with ESMTP id 52C6B8FC23 for ; Sun, 27 Apr 2008 12:01:42 +0000 (UTC) (envelope-from engywook@gmail.com) Received: by wf-out-1314.google.com with SMTP id 25so3522584wfa.7 for ; Sun, 27 Apr 2008 05:01:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=8Ap+hTVtDT3nzp3NnAKPupTCwz3uORQHQ30OXVhyyD4=; b=SGtoF4Fo8R8k9CWLK5+muP63u2ETEUEz3iKQ1u6oWcpoHMGJyovlDRr+dryg79/COuRm9ZcjdQUcdYvK5M3BU9kjj9/P71kGVOib8WvlR9pOuWiVx9l9IXyCn81q12qqueXVU3X7gUBPejiL847SJSvPZCrYTP6bwLUL+Oyz0lc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=o3acPjaBc/RpZ9yD84qtt+27eyYeRcRehtGdBsa+FgXiP5NUquPdC46GuY6MdzwPZNNlwpHc368lMhqHOHP7ZA1cBUKxBj14XKx0eYMS3DwGBeYKQzoqaE1uzC1GVgDhvNm0L35aoyYxmQ+XPNjedNETF5TpGpvGONkuPkZW4Hw= Received: by 10.142.104.9 with SMTP id b9mr1570188wfc.48.1209297702071; Sun, 27 Apr 2008 05:01:42 -0700 (PDT) Received: by 10.143.3.10 with HTTP; Sun, 27 Apr 2008 05:01:42 -0700 (PDT) Message-ID: <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> Date: Sun, 27 Apr 2008 14:01:42 +0200 From: "Daniel Andersson" To: yalur@mail.ru In-Reply-To: <200804162212.32560.yalur@mail.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <24adbbc00804151529m2a74085ds468eaac55ba94a32@mail.gmail.com> <200804162212.32560.yalur@mail.ru> Cc: freebsd-fs@freebsd.org Subject: Re: Choppy performance. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Apr 2008 12:01:42 -0000 >How do you calculate totall memory use in top? Real memory use is present >in "RES" column but not in "SIZE" column. > >################################ >PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND >1215 engy 1 99 0 2085M 139M zfs:(& 49:32 >19.53% rtorrent >################################ Well, I checked memory usage in rtorrent and it said it was higher than 2GB(total physical RAM) then used top to read the swap line: Swap: 1024M Total, 39M Used, 985M Free, 3% Inuse But if it isn't really using that much memory how come I get memory allocation errors in rtorrent if there's more memory avaliable? Cheers, Daniel From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 05:35:33 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5F0EF1065673; Mon, 28 Apr 2008 05:35:33 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.delphij.net (unknown [IPv6:2001:470:1f03:2c9::2]) by mx1.freebsd.org (Postfix) with ESMTP id 1468C8FC13; Mon, 28 Apr 2008 05:35:32 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.geekcn.org (tarsier.geekcn.org [202.108.54.204]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by tarsier.delphij.net (Postfix) with ESMTPS id 0372528449; Mon, 28 Apr 2008 13:35:28 +0800 (CST) Received: from localhost (tarsier.geekcn.org [202.108.54.204]) by tarsier.geekcn.org (Postfix) with ESMTP id B5B3BEB73F9; Mon, 28 Apr 2008 13:35:27 +0800 (CST) X-Virus-Scanned: amavisd-new at geekcn.org Received: from tarsier.geekcn.org ([202.108.54.204]) by localhost (mail.geekcn.org [202.108.54.204]) (amavisd-new, port 10024) with ESMTP id PG8oHbNk0H0K; Mon, 28 Apr 2008 13:35:18 +0800 (CST) Received: from charlie.delphij.net (c-69-181-135-56.hsd1.ca.comcast.net [69.181.135.56]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by tarsier.geekcn.org (Postfix) with ESMTPSA id DE9CDEB73D7; Mon, 28 Apr 2008 13:35:15 +0800 (CST) DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns; h=message-id:date:from:reply-to:organization:user-agent: mime-version:to:cc:subject:x-enigmail-version:openpgp:content-type:content-transfer-encoding; b=Qwyxu08d6yEWoAuQm02dR7Zf9mTnFhKEq71MvXb36bNJqaI0OduipHwIyII+zix42 FoIiovsahlh1KJBpHAWTg== Message-ID: <4815620F.3090005@delphij.net> Date: Sun, 27 Apr 2008 22:35:11 -0700 From: Xin LI Organization: The FreeBSD Project User-Agent: Thunderbird 2.0.0.12 (X11/20080422) MIME-Version: 1.0 To: freebsd-fs@freebsd.org X-Enigmail-Version: 0.95.6 OpenPGP: id=18EDEBA0; url=http://www.delphij.net/delphij.asc Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Jeff Roberson , kib@FreeBSD.org Subject: [7.0-R] Possible ufs livelock during coredump path? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: d@delphij.net List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 05:35:33 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, It seems that we have a potential livelock during coredump on 7.0-R, the case was that two processes trying to coredump in the same time (e.g. if I configure kern.corefile=/var/tmp/%N.core and a lot of instances coredump in the same time), perhaps when paging involved with it. Upon reboot, it would not recover but wait infinitely. The box is running 7.0-R/i386, UP (Origin = "GenuineIntel" Id = 0xf34 Stepping = 4). Is this an known issue? This is my own server but I do not have my hands on it because it is in China, however I can provide some help if the experiment can be recovered with a power-cycle :) Cheers, - -- Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkgVYg8ACgkQi+vbBBjt66DTMQCfXQ4q327phAzDeEmUhtgUoJxS Ap8AniSdbCY0HN9m5wf9nAbKyLFifUQg =V94G -----END PGP SIGNATURE----- From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 07:19:32 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2865F1065670; Mon, 28 Apr 2008 07:19:32 +0000 (UTC) (envelope-from peter@holm.cc) Received: from wbm3.pair.net (wbm3.pair.net [209.68.3.66]) by mx1.freebsd.org (Postfix) with ESMTP id 033F08FC12; Mon, 28 Apr 2008 07:19:31 +0000 (UTC) (envelope-from peter@holm.cc) Received: by wbm3.pair.net (Postfix, from userid 65534) id 1B9C76B178; Mon, 28 Apr 2008 03:00:28 -0400 (EDT) Received: from 193.234.247.50 ([193.234.247.50]) (SquirrelMail authenticated user holm@aedde.pair.com) by webmail3.pair.com with HTTP; Mon, 28 Apr 2008 09:00:28 +0200 (CEST) Message-ID: <64011.193.234.247.50.1209366028.squirrel@webmail3.pair.com> In-Reply-To: <4815620F.3090005@delphij.net> References: <4815620F.3090005@delphij.net> Date: Mon, 28 Apr 2008 09:00:28 +0200 (CEST) From: "Peter Holm" To: d@delphij.net User-Agent: SquirrelMail/1.4.5 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal Cc: freebsd-fs@freebsd.org, Jeff Roberson , kib@freebsd.org Subject: Re: [7.0-R] Possible ufs livelock during coredump path? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 07:19:32 -0000 > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > It seems that we have a potential livelock during coredump on 7.0-R, the > case was that two processes trying to coredump in the same time (e.g. if > I configure kern.corefile=/var/tmp/%N.core and a lot of instances > coredump in the same time), perhaps when paging involved with it. Upon > reboot, it would not recover but wait infinitely. The box is running > 7.0-R/i386, UP (Origin = "GenuineIntel" Id = 0xf34 Stepping = 4). > > Is this an known issue? This is my own server but I do not have my > hands on it because it is in China, however I can provide some help if > the experiment can be recovered with a power-cycle :) > AFAIK it is an old problem. I have some test where I had to disable core dumps for the same reason. I seem to remember that the problem is related to running out of VM? - Peter > Cheers, > - -- > Xin LI http://www.delphij.net/ > FreeBSD - The Power to Serve! > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (FreeBSD) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkgVYg8ACgkQi+vbBBjt66DTMQCfXQ4q327phAzDeEmUhtgUoJxS > Ap8AniSdbCY0HN9m5wf9nAbKyLFifUQg > =V94G > -----END PGP SIGNATURE----- > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 07:46:44 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3F183106564A; Mon, 28 Apr 2008 07:46:44 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.delphij.net (unknown [IPv6:2001:470:1f03:2c9::2]) by mx1.freebsd.org (Postfix) with ESMTP id B33D18FC22; Mon, 28 Apr 2008 07:46:42 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.geekcn.org (tarsier.geekcn.org [202.108.54.204]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by tarsier.delphij.net (Postfix) with ESMTPS id 9012D28449; Mon, 28 Apr 2008 15:46:41 +0800 (CST) Received: from localhost (tarsier.geekcn.org [202.108.54.204]) by tarsier.geekcn.org (Postfix) with ESMTP id A2513EB77EF; Mon, 28 Apr 2008 15:46:39 +0800 (CST) X-Virus-Scanned: amavisd-new at geekcn.org Received: from tarsier.geekcn.org ([202.108.54.204]) by localhost (mail.geekcn.org [202.108.54.204]) (amavisd-new, port 10024) with ESMTP id YU4odJ+QeeBY; Mon, 28 Apr 2008 15:46:29 +0800 (CST) Received: from charlie.delphij.net (c-69-181-135-56.hsd1.ca.comcast.net [69.181.135.56]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by tarsier.geekcn.org (Postfix) with ESMTPSA id B7C4CEB77DA; Mon, 28 Apr 2008 15:46:23 +0800 (CST) DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns; h=message-id:date:from:reply-to:organization:user-agent: mime-version:to:cc:subject:references:in-reply-to: x-enigmail-version:openpgp:content-type:content-transfer-encoding; b=jrdo0Zv6AtxMOfxXvznaj++Pcz3I8H3uUiOlClT4A2ysl2AoPcPtm0WOimZ3xUKAy PB2ThUdZW78s4oWa0f9bg== Message-ID: <481580CB.1000800@delphij.net> Date: Mon, 28 Apr 2008 00:46:19 -0700 From: Xin LI Organization: The FreeBSD Project User-Agent: Thunderbird 2.0.0.12 (X11/20080422) MIME-Version: 1.0 To: Peter Holm References: <4815620F.3090005@delphij.net> <64011.193.234.247.50.1209366028.squirrel@webmail3.pair.com> In-Reply-To: <64011.193.234.247.50.1209366028.squirrel@webmail3.pair.com> X-Enigmail-Version: 0.95.6 OpenPGP: id=18EDEBA0; url=http://www.delphij.net/delphij.asc Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, kib@freebsd.org, Jeff Roberson , d@delphij.net Subject: Re: [7.0-R] Possible ufs livelock during coredump path? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: d@delphij.net List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 07:46:44 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Peter Holm wrote: | Hi, | | It seems that we have a potential livelock during coredump on 7.0-R, the | case was that two processes trying to coredump in the same time (e.g. if | I configure kern.corefile=/var/tmp/%N.core and a lot of instances | coredump in the same time), perhaps when paging involved with it. Upon | reboot, it would not recover but wait infinitely. The box is running | 7.0-R/i386, UP (Origin = "GenuineIntel" Id = 0xf34 Stepping = 4). | | Is this an known issue? This is my own server but I do not have my | hands on it because it is in China, however I can provide some help if | the experiment can be recovered with a power-cycle :) | | |> AFAIK it is an old problem. I have some test where I had to disable core |> dumps for the same reason. I seem to remember that the problem is related |> to running out of VM? For my case it does not seem to be ran out of VM (at least the system did not printed out any messages, the log has a lot of kernel: pid 27223 (httpd), uid 80: exited on signal 11 (core dumped) but not the out of swap one. So, presumably we can reliably trigger this situation (or at least your ones :)? Cheers, - -- Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkgVgMsACgkQi+vbBBjt66BnOwCeJLB5xoE27b3CN/x/VIL+0EAI +c8AoJyYiqCi7tBeqZBx6cj/+gzBLmFn =qZmb -----END PGP SIGNATURE----- From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 08:10:31 2008 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 69470106566B for ; Mon, 28 Apr 2008 08:10:31 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.246.90]) by mx1.freebsd.org (Postfix) with ESMTP id 414298FC1E for ; Mon, 28 Apr 2008 08:10:31 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from parancell.ongs.co.jp (dullmdaler.ongs.co.jp [202.216.246.94]) by natial.ongs.co.jp (Postfix) with ESMTP id 12A99125438; Mon, 28 Apr 2008 17:10:30 +0900 (JST) Message-ID: <48158675.1060809@freebsd.org> Date: Mon, 28 Apr 2008 17:10:29 +0900 From: Daichi GOTO User-Agent: Thunderbird 2.0.0.12 (X11/20080423) MIME-Version: 1.0 To: Kostik Belousov References: <4811B0A0.8040702@freebsd.org> <20080426100116.GL18958@deviant.kiev.zoral.com.ua> In-Reply-To: <20080426100116.GL18958@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: fs@freebsd.org Subject: Re: Approval request of some additions to src/sys/kern/vfs_subr.c and src/sys/sys/vnode.h X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 08:10:31 -0000 Kostik Belousov wrote: > On Fri, Apr 25, 2008 at 07:21:20PM +0900, Daichi GOTO wrote: >> Hi Konstantin :) >> >> To fix a unionfs issue of http://www.freebsd.org/cgi/query-pr.cgi?pr=109377, >> we need to add new functions >> >> void vkernrele(struct vnode *vp); >> void vkernref(struct vnode *vp); >> >> and one value >> >> int v_kernusecount; /* i ref count of kernel */ >> >> to src/sys/sys/vnode.h and rc/sys/kern/vfs_subr.c. >> >> Unionfs will be panic when lower fs layer is forced umounted by >> "umount -f". So to avoid this issue, we've added >> "v_kernusecount" value that means "a vnode count that kernel are >> using". vkernrele() and vkernref() are functions that manage >> "v_kernusecount" value. >> >> Please check those and give us an approve or some comments! > > There is already the vnode reference count. From your description, I cannot > understand how the kernusecount would prevent the panic when forced unmount > is performed. Could you, please, show the actual code ? PR mentioned > does not contain any patch. Oops, sorry. patch is follow: http://people.freebsd.org/~daichi/unionfs/experiments/unionfs-p20-3.diff > The problem you described is common for the kernel code, and right way > to handle it, for now, is to keep refcount _and_ check for the forced > reclaim. -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 08:29:45 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 58BD7106567D for ; Mon, 28 Apr 2008 08:29:45 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 1793F8FC13 for ; Mon, 28 Apr 2008 08:29:44 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1JqOk3-00020z-NQ for freebsd-fs@freebsd.org; Mon, 28 Apr 2008 08:29:43 +0000 Received: from lara.cc.fer.hr ([161.53.72.113]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 28 Apr 2008 08:29:43 +0000 Received: from ivoras by lara.cc.fer.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 28 Apr 2008 08:29:43 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Ivan Voras Date: Mon, 28 Apr 2008 10:29:23 +0200 Lines: 30 Message-ID: References: <48142ABE.4050107@psg.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig54E08CDECC35E8AB718D4E48" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: lara.cc.fer.hr User-Agent: Thunderbird 2.0.0.12 (X11/20080227) In-Reply-To: <48142ABE.4050107@psg.com> X-Enigmail-Version: 0.95.0 Sender: news Subject: Re: zfs and vfs.zfs.prefetch_disable="1" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 08:29:45 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig54E08CDECC35E8AB718D4E48 Randy Bush wrote: > i have in my incantations for zfs on i386 to stick the following in > /boot/loader.conf.local >=20 > vm.kmem_size=3D600M > vm.kmem_size_max=3D600M > zfs_load=3DYES > vfs.zfs.prefetch_disable=3D1 Cannot say for sure but AFAIK it was mentioned during the Great ZFS Flamewars as a possible way to reduce memory usage by ZFS, and also as a possible way of avoiding some deadlocks (possibly py Pawel). --------------enig54E08CDECC35E8AB718D4E48 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIFYrwldnAQVacBcgRAlpAAJ9ZaJRh9JzqoOxyM0tNoGQepimwhgCgjzy4 /7yuWLYVYnAUezVaMhEXzfs= =lUQJ -----END PGP SIGNATURE----- --------------enig54E08CDECC35E8AB718D4E48-- From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 08:31:33 2008 Return-Path: Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ECC69106566B; Mon, 28 Apr 2008 08:31:33 +0000 (UTC) (envelope-from jdc@parodius.com) Received: from mx01.sc1.parodius.com (mx01.sc1.parodius.com [72.20.106.3]) by mx1.freebsd.org (Postfix) with ESMTP id CA7EE8FC1B; Mon, 28 Apr 2008 08:31:33 +0000 (UTC) (envelope-from jdc@parodius.com) Received: by mx01.sc1.parodius.com (Postfix, from userid 1000) id B94AB1CC033; Mon, 28 Apr 2008 01:31:33 -0700 (PDT) Date: Mon, 28 Apr 2008 01:31:33 -0700 From: Jeremy Chadwick To: Randy Bush Message-ID: <20080428083133.GA81628@eos.sc1.parodius.com> References: <48142ABE.4050107@psg.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <48142ABE.4050107@psg.com> User-Agent: Mutt/1.5.17 (2007-11-01) Cc: freebsd-fs@FreeBSD.ORG, ivoras@freebsd.org Subject: Re: zfs and vfs.zfs.prefetch_disable="1" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 08:31:34 -0000 On Sun, Apr 27, 2008 at 04:26:54PM +0900, Randy Bush wrote: > i have in my incantations for zfs on i386 to stick the following in > /boot/loader.conf.local > > vm.kmem_size=600M > vm.kmem_size_max=600M > zfs_load=YES > vfs.zfs.prefetch_disable=1 > > but i have no idea where that last one crept in. any clues? It probably came from the old version of the "ZFS Tuning Guide" section of the ZFS on FreeBSD Wiki. It was removed on August 30th 2007 by Ivan Voras. http://wiki.freebsd.org/ZFSTuningGuide?action=diff&rev2=12&rev1=11 http://wiki.freebsd.org/ZFSTuningGuide?action=recall&rev=12 http://wiki.freebsd.org/ZFSTuningGuide?action=recall&rev=11 I've CC'd him here. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 08:33:27 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4EE16106567A; Mon, 28 Apr 2008 08:33:27 +0000 (UTC) (envelope-from peter@holm.cc) Received: from wbm3.pair.net (wbm3.pair.net [209.68.3.66]) by mx1.freebsd.org (Postfix) with ESMTP id 299698FC30; Mon, 28 Apr 2008 08:33:26 +0000 (UTC) (envelope-from peter@holm.cc) Received: by wbm3.pair.net (Postfix, from userid 65534) id 007F86B179; Mon, 28 Apr 2008 04:33:23 -0400 (EDT) Received: from 193.234.247.50 ([193.234.247.50]) (SquirrelMail authenticated user holm@aedde.pair.com) by webmail3.pair.com with HTTP; Mon, 28 Apr 2008 10:33:23 +0200 (CEST) Message-ID: <35682.193.234.247.50.1209371603.squirrel@webmail3.pair.com> In-Reply-To: <481580CB.1000800@delphij.net> References: <4815620F.3090005@delphij.net> <64011.193.234.247.50.1209366028.squirrel@webmail3.pair.com> <481580CB.1000800@delphij.net> Date: Mon, 28 Apr 2008 10:33:23 +0200 (CEST) From: "Peter Holm" To: d@delphij.net User-Agent: SquirrelMail/1.4.5 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal Cc: freebsd-fs@freebsd.org, Jeff Roberson , d@delphij.net, kib@freebsd.org Subject: Re: [7.0-R] Possible ufs livelock during coredump path? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 08:33:27 -0000 > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Peter Holm wrote: > | Hi, > | > | It seems that we have a potential livelock during coredump on 7.0-R, the > | case was that two processes trying to coredump in the same time (e.g. if > | I configure kern.corefile=/var/tmp/%N.core and a lot of instances > | coredump in the same time), perhaps when paging involved with it. Upon > | reboot, it would not recover but wait infinitely. The box is running > | 7.0-R/i386, UP (Origin = "GenuineIntel" Id = 0xf34 Stepping = 4). > | > | Is this an known issue? This is my own server but I do not have my > | hands on it because it is in China, however I can provide some help if > | the experiment can be recovered with a power-cycle :) > | > | > |> AFAIK it is an old problem. I have some test where I had to disable > core > |> dumps for the same reason. I seem to remember that the problem is > related > |> to running out of VM? > > For my case it does not seem to be ran out of VM (at least the system > did not printed out any messages, the log has a lot of kernel: pid 27223 > (httpd), uid 80: exited on signal 11 (core dumped) but not the out of > swap one. > Nor did I, as I remember. > So, presumably we can reliably trigger this situation (or at least your > ones :)? > It's a long time since I looked at this problem, but this would seem to be a good excuse to look at it again. - Peter > Cheers, > - -- > Xin LI http://www.delphij.net/ > FreeBSD - The Power to Serve! > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (FreeBSD) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkgVgMsACgkQi+vbBBjt66BnOwCeJLB5xoE27b3CN/x/VIL+0EAI > +c8AoJyYiqCi7tBeqZBx6cj/+gzBLmFn > =qZmb > -----END PGP SIGNATURE----- > From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 09:12:00 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5EEB21065670 for ; Mon, 28 Apr 2008 09:12:00 +0000 (UTC) (envelope-from ivoras@gmail.com) Received: from rv-out-0506.google.com (rv-out-0506.google.com [209.85.198.231]) by mx1.freebsd.org (Postfix) with ESMTP id 3764C8FC0C for ; Mon, 28 Apr 2008 09:11:59 +0000 (UTC) (envelope-from ivoras@gmail.com) Received: by rv-out-0506.google.com with SMTP id b25so3144944rvf.43 for ; Mon, 28 Apr 2008 02:11:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=2yIAgDssI4Sm+vBTRgfcYPI1CSMB6pV2G7M0ZbblSrs=; b=qAclsUdhRp1g/vdodbFisUOBIduUfmybPPczxjKknnkKV+x0E2BSxEG0zmfKwm3+sSvlifZRqXaqzQ8RsmdMZb+jKxD3Deec/NxXV+IwEfgosvY0QVcJTq2P/p+qOmIpxo374fVsZOo0LmS8YtG9Sc0L7POiPhKaHnzUXYu90f0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=CBOI8KnqmiYhnzLYde9kHA8ge5zSEMJLmezJcvAWMO0SiDUUANHQ63hOJRAmuwdY66MiU0C63xh8qz9vzjXKSSMz8FBsO3aE57f3dee1dUNUcIK5VKaP91T2+OguJuD/JFvOLftsfkiqKF4zTr5rB30H9uu6BAxpUFam0VDZk3k= Received: by 10.141.37.8 with SMTP id p8mr3133220rvj.53.1209372311759; Mon, 28 Apr 2008 01:45:11 -0700 (PDT) Received: by 10.141.212.1 with HTTP; Mon, 28 Apr 2008 01:45:11 -0700 (PDT) Message-ID: <9bbcef730804280145x6961c43ekab916ec289396361@mail.gmail.com> Date: Mon, 28 Apr 2008 10:45:11 +0200 From: "Ivan Voras" Sender: ivoras@gmail.com To: "Jeremy Chadwick" In-Reply-To: <20080428083133.GA81628@eos.sc1.parodius.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <48142ABE.4050107@psg.com> <20080428083133.GA81628@eos.sc1.parodius.com> X-Google-Sender-Auth: f32a2f258c169cce Cc: Randy Bush , freebsd-fs@freebsd.org Subject: Re: zfs and vfs.zfs.prefetch_disable="1" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 09:12:00 -0000 2008/4/28 Jeremy Chadwick : > On Sun, Apr 27, 2008 at 04:26:54PM +0900, Randy Bush wrote: > > i have in my incantations for zfs on i386 to stick the following in > > /boot/loader.conf.local > > > > vm.kmem_size=600M > > vm.kmem_size_max=600M > > zfs_load=YES > > vfs.zfs.prefetch_disable=1 > > > > but i have no idea where that last one crept in. any clues? > > It probably came from the old version of the "ZFS Tuning Guide" section > of the ZFS on FreeBSD Wiki. It was removed on August 30th 2007 by Ivan > Voras. > > http://wiki.freebsd.org/ZFSTuningGuide?action=diff&rev2=12&rev1=11 > http://wiki.freebsd.org/ZFSTuningGuide?action=recall&rev=12 > http://wiki.freebsd.org/ZFSTuningGuide?action=recall&rev=11 > > I've CC'd him here. The change you reference is apparently about zil_disable, which wasn't removed, just moved. But the prefetch_disable setting was added and removed a couple of times to the page, the latest being that it was added since Pawel uses it in his post. From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 09:37:10 2008 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9C452106566C for ; Mon, 28 Apr 2008 09:37:10 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.246.90]) by mx1.freebsd.org (Postfix) with ESMTP id 75C3B8FC0A for ; Mon, 28 Apr 2008 09:37:10 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from parancell.ongs.co.jp (dullmdaler.ongs.co.jp [202.216.246.94]) by natial.ongs.co.jp (Postfix) with ESMTP id C3195125438; Mon, 28 Apr 2008 18:37:09 +0900 (JST) Message-ID: <48159AC5.3030000@freebsd.org> Date: Mon, 28 Apr 2008 18:37:09 +0900 From: Daichi GOTO User-Agent: Thunderbird 2.0.0.12 (X11/20080423) MIME-Version: 1.0 To: Kostik Belousov References: <4811B0A0.8040702@freebsd.org> <20080426100116.GL18958@deviant.kiev.zoral.com.ua> In-Reply-To: <20080426100116.GL18958@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: fs@freebsd.org Subject: Re: Approval request of some additions to src/sys/kern/vfs_subr.c and src/sys/sys/vnode.h X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 09:37:10 -0000 Kostik Belousov wrote: > On Fri, Apr 25, 2008 at 07:21:20PM +0900, Daichi GOTO wrote: >> Hi Konstantin :) >> >> To fix a unionfs issue of http://www.freebsd.org/cgi/query-pr.cgi?pr=109377, >> we need to add new functions >> >> void vkernrele(struct vnode *vp); >> void vkernref(struct vnode *vp); >> >> and one value >> >> int v_kernusecount; /* i ref count of kernel */ >> >> to src/sys/sys/vnode.h and rc/sys/kern/vfs_subr.c. >> >> Unionfs will be panic when lower fs layer is forced umounted by >> "umount -f". So to avoid this issue, we've added >> "v_kernusecount" value that means "a vnode count that kernel are >> using". vkernrele() and vkernref() are functions that manage >> "v_kernusecount" value. >> >> Please check those and give us an approve or some comments! > > There is already the vnode reference count. From your description, I cannot > understand how the kernusecount would prevent the panic when forced unmount > is performed. Could you, please, show the actual code ? PR mentioned > does not contain any patch. Our patch realizes avoiding kernel panic by "umount -f" operation using with EBUSY process. On current implementation (not applied our patch), "umount -f" tries to release vnode at any vnode reference count value. Since that, unionfs and nullfs access invalid vnode and lead kernel panic. To prevent this issue, we need a some kind of not-umount-accept-mechanism in invalid case (e.x. fs in unionfsed stack, it must be umounted in correct order) and to realize that, current vnode reference count is not enough we are thinking. If you have any ideas to realize the same solution with current vnode reference, would you please tell us your idea :) > The problem you described is common for the kernel code, and right way > to handle it, for now, is to keep refcount _and_ check for the forced > reclaim. -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 11:06:56 2008 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4D2D210656AA for ; Mon, 28 Apr 2008 11:06:56 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 42BC98FC32 for ; Mon, 28 Apr 2008 11:06:56 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m3SB6u6b056105 for ; Mon, 28 Apr 2008 11:06:56 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.2/8.14.1/Submit) id m3SB6tmj056101 for freebsd-fs@FreeBSD.org; Mon, 28 Apr 2008 11:06:55 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 28 Apr 2008 11:06:55 GMT Message-Id: <200804281106.m3SB6tmj056101@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-fs@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 11:06:56 -0000 Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o kern/116170 fs [panic] Kernel panic when mounting /tmp o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o bin/122172 fs [amd] [fs]: amd(8) automount daemon dies on 6.3-STABLE 5 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o bin/118249 fs mv(1): moving a directory changes its mtime 6 problems total. From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 13:24:25 2008 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CA7971065677; Mon, 28 Apr 2008 13:24:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from anti-4.kiev.sovam.com (anti-4.kiev.sovam.com [62.64.120.202]) by mx1.freebsd.org (Postfix) with ESMTP id 625968FC1C; Mon, 28 Apr 2008 13:24:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from [212.82.216.226] (helo=skuns.kiev.zoral.com.ua) by anti-4.kiev.sovam.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.67) (envelope-from ) id 1JqTLD-0003Wq-8T; Mon, 28 Apr 2008 16:24:23 +0300 Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by skuns.kiev.zoral.com.ua (8.14.2/8.14.2) with ESMTP id m3SDOJZh040737 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 28 Apr 2008 16:24:19 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.2/8.14.2) with ESMTP id m3SDODBx052241; Mon, 28 Apr 2008 16:24:13 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.2/8.14.2/Submit) id m3SDODTT052240; Mon, 28 Apr 2008 16:24:13 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 28 Apr 2008 16:24:13 +0300 From: Kostik Belousov To: Daichi GOTO Message-ID: <20080428132413.GS18958@deviant.kiev.zoral.com.ua> References: <4811B0A0.8040702@freebsd.org> <20080426100116.GL18958@deviant.kiev.zoral.com.ua> <48159AC5.3030000@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="A61Eau4L8twGtri1" Content-Disposition: inline In-Reply-To: <48159AC5.3030000@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: ClamAV version 0.91.2, clamav-milter version 0.91.2 on skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.4 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on skuns.kiev.zoral.com.ua X-Scanner-Signature: e7d6a74e6a41bfc4865bf8c76b21c35c X-DrWeb-checked: yes X-SpamTest-Envelope-From: kostikbel@gmail.com X-SpamTest-Group-ID: 00000000 X-SpamTest-Info: Profiles 2733 [Apr 28 2008] X-SpamTest-Info: helo_type=3 X-SpamTest-Info: {received from trusted relay: not dialup} X-SpamTest-Method: none X-SpamTest-Method: Local Lists X-SpamTest-Rate: 0 X-SpamTest-Status: Not detected X-SpamTest-Status-Extended: not_detected X-SpamTest-Version: SMTP-Filter Version 3.0.0 [0255], KAS30/Release Cc: fs@freebsd.org Subject: Re: Approval request of some additions to src/sys/kern/vfs_subr.c and src/sys/sys/vnode.h X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 13:24:25 -0000 --A61Eau4L8twGtri1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Apr 28, 2008 at 06:37:09PM +0900, Daichi GOTO wrote: > Kostik Belousov wrote: > >On Fri, Apr 25, 2008 at 07:21:20PM +0900, Daichi GOTO wrote: > >>Hi Konstantin :) > >> > >>To fix a unionfs issue of=20 > >>http://www.freebsd.org/cgi/query-pr.cgi?pr=3D109377, > >>we need to add new functions > >> > >> void vkernrele(struct vnode *vp); > >> void vkernref(struct vnode *vp); > >> > >>and one value > >> > >> int v_kernusecount; /* i ref count of kernel */ > >> > >>to src/sys/sys/vnode.h and rc/sys/kern/vfs_subr.c. > >> > >>Unionfs will be panic when lower fs layer is forced umounted by > >>"umount -f". So to avoid this issue, we've added > >>"v_kernusecount" value that means "a vnode count that kernel are > >>using". vkernrele() and vkernref() are functions that manage > >>"v_kernusecount" value. > >> > >>Please check those and give us an approve or some comments! > > > >There is already the vnode reference count. From your description, I can= not > >understand how the kernusecount would prevent the panic when forced unmo= unt > >is performed. Could you, please, show the actual code ? PR mentioned > >does not contain any patch. >=20 > Our patch realizes avoiding kernel panic by "umount -f" operation using w= ith > EBUSY process. >=20 > On current implementation (not applied our patch), "umount -f" tries to > release vnode at any vnode reference count value. Since that, unionfs > and nullfs access invalid vnode and lead kernel panic. To prevent this > issue, we need a some kind of not-umount-accept-mechanism in invalid case > (e.x. fs in unionfsed stack, it must be umounted in correct order) and > to realize that, current vnode reference count is not enough we are=20 > thinking. >=20 > If you have any ideas to realize the same solution with current vnode > reference, would you please tell us your idea :) >=20 > >The problem you described is common for the kernel code, and right way > >to handle it, for now, is to keep refcount _and_ check for the forced > >reclaim. Your patch in essence disables the forced unmount. I would object against such decision. Even if taking this direction, I believe more cleaner solution would be to introduce a counter that disables the (forced) unmount into the struct mount, instead of the struct vnode. Having the counter in the vnode, the unmount -f behaviour is non-deterministic and depended on the presence of the cached vnodes of the upper layer. The mount counter would be incremented by unionfs cover mount. But, as I said above, this looks like a wrong solution. The right way to handle the forced reclaim with the current VFS is to add the explicit checks for the reclaimed vnodes where it is needed. The vnode cannot be reclaimed while the vnode lock is held. When obtaining the vnode lock, the reclamation can be detected. For instance, the vget() without LK_RETRY shall be checked for ENOENT. You said that that nullfs is vulnerable to the problem. Could you, please, point me to the corresponding stack trace ? At least, the nullfs vop_lock() seems to carefully check the possible problems. --A61Eau4L8twGtri1 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (FreeBSD) iEYEARECAAYFAkgVz/wACgkQC3+MBN1Mb4isvwCfbECmYEu6lJ2FXIqaU3zYPTZs 5I0AoNzrqhXvT5XHDQs+l65owxM8rfp3 =eTfF -----END PGP SIGNATURE----- --A61Eau4L8twGtri1-- From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 14:36:39 2008 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 16D251065674 for ; Mon, 28 Apr 2008 14:36:39 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.246.90]) by mx1.freebsd.org (Postfix) with ESMTP id BDCBD8FC0A for ; Mon, 28 Apr 2008 14:36:38 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from parancell.ongs.co.jp (dullmdaler.ongs.co.jp [202.216.246.94]) by natial.ongs.co.jp (Postfix) with ESMTP id E5077125438; Mon, 28 Apr 2008 23:36:37 +0900 (JST) Message-ID: <4815E0F5.30706@freebsd.org> Date: Mon, 28 Apr 2008 23:36:37 +0900 From: Daichi GOTO User-Agent: Thunderbird 2.0.0.12 (X11/20080423) MIME-Version: 1.0 To: Kostik Belousov References: <4811B0A0.8040702@freebsd.org> <20080426100116.GL18958@deviant.kiev.zoral.com.ua> <48159AC5.3030000@freebsd.org> <20080428132413.GS18958@deviant.kiev.zoral.com.ua> In-Reply-To: <20080428132413.GS18958@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: fs@freebsd.org Subject: Re: Approval request of some additions to src/sys/kern/vfs_subr.c and src/sys/sys/vnode.h X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 14:36:39 -0000 Kostik Belousov wrote: > On Mon, Apr 28, 2008 at 06:37:09PM +0900, Daichi GOTO wrote: >> Kostik Belousov wrote: >>> On Fri, Apr 25, 2008 at 07:21:20PM +0900, Daichi GOTO wrote: >>>> Hi Konstantin :) >>>> >>>> To fix a unionfs issue of >>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=109377, >>>> we need to add new functions >>>> >>>> void vkernrele(struct vnode *vp); >>>> void vkernref(struct vnode *vp); >>>> >>>> and one value >>>> >>>> int v_kernusecount; /* i ref count of kernel */ >>>> >>>> to src/sys/sys/vnode.h and rc/sys/kern/vfs_subr.c. >>>> >>>> Unionfs will be panic when lower fs layer is forced umounted by >>>> "umount -f". So to avoid this issue, we've added >>>> "v_kernusecount" value that means "a vnode count that kernel are >>>> using". vkernrele() and vkernref() are functions that manage >>>> "v_kernusecount" value. >>>> >>>> Please check those and give us an approve or some comments! >>> There is already the vnode reference count. From your description, I cannot >>> understand how the kernusecount would prevent the panic when forced unmount >>> is performed. Could you, please, show the actual code ? PR mentioned >>> does not contain any patch. >> Our patch realizes avoiding kernel panic by "umount -f" operation using with >> EBUSY process. >> >> On current implementation (not applied our patch), "umount -f" tries to >> release vnode at any vnode reference count value. Since that, unionfs >> and nullfs access invalid vnode and lead kernel panic. To prevent this >> issue, we need a some kind of not-umount-accept-mechanism in invalid case >> (e.x. fs in unionfsed stack, it must be umounted in correct order) and >> to realize that, current vnode reference count is not enough we are >> thinking. >> >> If you have any ideas to realize the same solution with current vnode >> reference, would you please tell us your idea :) >> >>> The problem you described is common for the kernel code, and right way >>> to handle it, for now, is to keep refcount _and_ check for the forced >>> reclaim. > > Your patch in essence disables the forced unmount. I would object against > such decision. > > Even if taking this direction, I believe more cleaner solution would be > to introduce a counter that disables the (forced) unmount into the > struct mount, instead of the struct vnode. Having the counter in the > vnode, the unmount -f behaviour is non-deterministic and depended on > the presence of the cached vnodes of the upper layer. The mount counter > would be incremented by unionfs cover mount. But, as I said above, this > looks like a wrong solution. > > The right way to handle the forced reclaim with the current VFS is to > add the explicit checks for the reclaimed vnodes where it is needed. The > vnode cannot be reclaimed while the vnode lock is held. When obtaining > the vnode lock, the reclamation can be detected. For instance, the > vget() without LK_RETRY shall be checked for ENOENT. > > You said that that nullfs is vulnerable to the problem. Could you, > please, point me to the corresponding stack trace ? At least, the nullfs > vop_lock() seems to carefully check the possible problems. -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 14:36:57 2008 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 391391065687 for ; Mon, 28 Apr 2008 14:36:57 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.246.90]) by mx1.freebsd.org (Postfix) with ESMTP id AC3648FC12 for ; Mon, 28 Apr 2008 14:36:56 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from parancell.ongs.co.jp (dullmdaler.ongs.co.jp [202.216.246.94]) by natial.ongs.co.jp (Postfix) with ESMTP id CAAA7125438; Mon, 28 Apr 2008 23:36:55 +0900 (JST) Message-ID: <4815E107.9030902@freebsd.org> Date: Mon, 28 Apr 2008 23:36:55 +0900 From: Daichi GOTO User-Agent: Thunderbird 2.0.0.12 (X11/20080423) MIME-Version: 1.0 To: Kostik Belousov References: <4811B0A0.8040702@freebsd.org> <20080426100116.GL18958@deviant.kiev.zoral.com.ua> <48159AC5.3030000@freebsd.org> <20080428132413.GS18958@deviant.kiev.zoral.com.ua> In-Reply-To: <20080428132413.GS18958@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: fs@freebsd.org Subject: Re: Approval request of some additions to src/sys/kern/vfs_subr.c and src/sys/sys/vnode.h X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 14:36:57 -0000 Thanks for your response and explanation :) Kostik Belousov wrote: > On Mon, Apr 28, 2008 at 06:37:09PM +0900, Daichi GOTO wrote: >> Kostik Belousov wrote: >>> On Fri, Apr 25, 2008 at 07:21:20PM +0900, Daichi GOTO wrote: >>>> Hi Konstantin :) >>>> >>>> To fix a unionfs issue of >>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=109377, >>>> we need to add new functions >>>> >>>> void vkernrele(struct vnode *vp); >>>> void vkernref(struct vnode *vp); >>>> >>>> and one value >>>> >>>> int v_kernusecount; /* i ref count of kernel */ >>>> >>>> to src/sys/sys/vnode.h and rc/sys/kern/vfs_subr.c. >>>> >>>> Unionfs will be panic when lower fs layer is forced umounted by >>>> "umount -f". So to avoid this issue, we've added >>>> "v_kernusecount" value that means "a vnode count that kernel are >>>> using". vkernrele() and vkernref() are functions that manage >>>> "v_kernusecount" value. >>>> >>>> Please check those and give us an approve or some comments! >>> There is already the vnode reference count. From your description, I cannot >>> understand how the kernusecount would prevent the panic when forced unmount >>> is performed. Could you, please, show the actual code ? PR mentioned >>> does not contain any patch. >> Our patch realizes avoiding kernel panic by "umount -f" operation using with >> EBUSY process. >> >> On current implementation (not applied our patch), "umount -f" tries to >> release vnode at any vnode reference count value. Since that, unionfs >> and nullfs access invalid vnode and lead kernel panic. To prevent this >> issue, we need a some kind of not-umount-accept-mechanism in invalid case >> (e.x. fs in unionfsed stack, it must be umounted in correct order) and >> to realize that, current vnode reference count is not enough we are >> thinking. >> >> If you have any ideas to realize the same solution with current vnode >> reference, would you please tell us your idea :) >> >>> The problem you described is common for the kernel code, and right way >>> to handle it, for now, is to keep refcount _and_ check for the forced >>> reclaim. > > Your patch in essence disables the forced unmount. I would object against > such decision. Oooooo.... OK. We understand. > Even if taking this direction, I believe more cleaner solution would be > to introduce a counter that disables the (forced) unmount into the > struct mount, instead of the struct vnode. Having the counter in the > vnode, the unmount -f behaviour is non-deterministic and depended on > the presence of the cached vnodes of the upper layer. The mount counter > would be incremented by unionfs cover mount. But, as I said above, this > looks like a wrong solution. > > The right way to handle the forced reclaim with the current VFS is to > add the explicit checks for the reclaimed vnodes where it is needed. The > vnode cannot be reclaimed while the vnode lock is held. When obtaining > the vnode lock, the reclamation can be detected. For instance, the > vget() without LK_RETRY shall be checked for ENOENT. At last, we want to check that vnode is released or not where unionfs does not know. If we can do that check, our patch is not needed for solving that issue. Would you please give us the way to check that target vnode is released or not before accessing it. > You said that that nullfs is vulnerable to the problem. Could you, > please, point me to the corresponding stack trace ? At least, the nullfs > vop_lock() seems to carefully check the possible problems. -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 16:22:51 2008 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CEB14106567B; Mon, 28 Apr 2008 16:22:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from anti-4.kiev.sovam.com (anti-4.kiev.sovam.com [62.64.120.202]) by mx1.freebsd.org (Postfix) with ESMTP id 631588FC1B; Mon, 28 Apr 2008 16:22:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from [212.82.216.226] (helo=skuns.kiev.zoral.com.ua) by anti-4.kiev.sovam.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.67) (envelope-from ) id 1JqW7t-000C1E-Fe; Mon, 28 Apr 2008 19:22:50 +0300 Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by skuns.kiev.zoral.com.ua (8.14.2/8.14.2) with ESMTP id m3SGMiaC048965 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 28 Apr 2008 19:22:44 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.2/8.14.2) with ESMTP id m3SGMdL0057121; Mon, 28 Apr 2008 19:22:39 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.2/8.14.2/Submit) id m3SGMctY057120; Mon, 28 Apr 2008 19:22:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 28 Apr 2008 19:22:38 +0300 From: Kostik Belousov To: Daichi GOTO Message-ID: <20080428162238.GT18958@deviant.kiev.zoral.com.ua> References: <4811B0A0.8040702@freebsd.org> <20080426100116.GL18958@deviant.kiev.zoral.com.ua> <48159AC5.3030000@freebsd.org> <20080428132413.GS18958@deviant.kiev.zoral.com.ua> <4815E107.9030902@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="FJ2D5YQYG6NL2pc1" Content-Disposition: inline In-Reply-To: <4815E107.9030902@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: ClamAV version 0.91.2, clamav-milter version 0.91.2 on skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.4 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on skuns.kiev.zoral.com.ua X-Scanner-Signature: 00e80b47c0af1091817fe870f3098f37 X-DrWeb-checked: yes X-SpamTest-Envelope-From: kostikbel@gmail.com X-SpamTest-Group-ID: 00000000 X-SpamTest-Info: Profiles 2737 [Apr 28 2008] X-SpamTest-Info: helo_type=3 X-SpamTest-Info: {received from trusted relay: not dialup} X-SpamTest-Method: none X-SpamTest-Method: Local Lists X-SpamTest-Rate: 0 X-SpamTest-Status: Not detected X-SpamTest-Status-Extended: not_detected X-SpamTest-Version: SMTP-Filter Version 3.0.0 [0255], KAS30/Release Cc: fs@freebsd.org Subject: Re: Approval request of some additions to src/sys/kern/vfs_subr.c and src/sys/sys/vnode.h X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 16:22:51 -0000 --FJ2D5YQYG6NL2pc1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Apr 28, 2008 at 11:36:55PM +0900, Daichi GOTO wrote: > Thanks for your response and explanation :) >=20 > Kostik Belousov wrote: > >On Mon, Apr 28, 2008 at 06:37:09PM +0900, Daichi GOTO wrote: > >>Kostik Belousov wrote: > >>>On Fri, Apr 25, 2008 at 07:21:20PM +0900, Daichi GOTO wrote: > >>>>Hi Konstantin :) > >>>> > >>>>To fix a unionfs issue of=20 > >>>>http://www.freebsd.org/cgi/query-pr.cgi?pr=3D109377, > >>>>we need to add new functions > >>>> > >>>> void vkernrele(struct vnode *vp); > >>>> void vkernref(struct vnode *vp); > >>>> > >>>>and one value > >>>> > >>>> int v_kernusecount; /* i ref count of kernel */ > >>>> > >>>>to src/sys/sys/vnode.h and rc/sys/kern/vfs_subr.c. > >>>> > >>>>Unionfs will be panic when lower fs layer is forced umounted by > >>>>"umount -f". So to avoid this issue, we've added > >>>>"v_kernusecount" value that means "a vnode count that kernel are > >>>>using". vkernrele() and vkernref() are functions that manage > >>>>"v_kernusecount" value. > >>>> > >>>>Please check those and give us an approve or some comments! > >>>There is already the vnode reference count. From your description, I= =20 > >>>cannot > >>>understand how the kernusecount would prevent the panic when forced=20 > >>>unmount > >>>is performed. Could you, please, show the actual code ? PR mentioned > >>>does not contain any patch. > >>Our patch realizes avoiding kernel panic by "umount -f" operation using= =20 > >>with > >>EBUSY process. > >> > >>On current implementation (not applied our patch), "umount -f" tries to > >>release vnode at any vnode reference count value. Since that, unionfs > >>and nullfs access invalid vnode and lead kernel panic. To prevent this > >>issue, we need a some kind of not-umount-accept-mechanism in invalid ca= se > >>(e.x. fs in unionfsed stack, it must be umounted in correct order) and > >>to realize that, current vnode reference count is not enough we are=20 > >>thinking. > >> > >>If you have any ideas to realize the same solution with current vnode > >>reference, would you please tell us your idea :) > >> > >>>The problem you described is common for the kernel code, and right way > >>>to handle it, for now, is to keep refcount _and_ check for the forced > >>>reclaim. > > > >Your patch in essence disables the forced unmount. I would object against > >such decision. >=20 > Oooooo.... OK. We understand. >=20 > >Even if taking this direction, I believe more cleaner solution would be > >to introduce a counter that disables the (forced) unmount into the > >struct mount, instead of the struct vnode. Having the counter in the > >vnode, the unmount -f behaviour is non-deterministic and depended on > >the presence of the cached vnodes of the upper layer. The mount counter > >would be incremented by unionfs cover mount. But, as I said above, this > >looks like a wrong solution. > > > >The right way to handle the forced reclaim with the current VFS is to > >add the explicit checks for the reclaimed vnodes where it is needed. The > >vnode cannot be reclaimed while the vnode lock is held. When obtaining > >the vnode lock, the reclamation can be detected. For instance, the > >vget() without LK_RETRY shall be checked for ENOENT. >=20 > At last, we want to check that vnode is released or not where > unionfs does not know. If we can do that check, our patch is > not needed for solving that issue. >=20 > Would you please give us the way to check that target vnode is > released or not before accessing it. The basic rules of our VFS are: 1. You _must_ hold the vnode unless the vnode is locked. Hold count prevents the vnode memory from being reused and guarantees the validity of the counters, v_vnlock, v_mount and vop (but please note that validity !=3D stability). E.g., v_mount may be NULLed and vop become the deadfs_vop due to reclamation. 2. The vnode lock is held when the vnode is vgone(9)'ed. In the other words, if you have a pointer to the non-reclaimed vnode that is locked, the vnode cannot be reclaimed until the lock is freed. 3. The verbs that lock a vnode (vget() and vn_lock(9)) have two mode of operations. - If you specify the LK_RETRY in the lock flags, you would get even the reclaimed vnode locked. - If you do not specified LK_RETRY, you would get ENOENT for the reclaimed vnode. [See the #1 for the reason why you must have a vnode held while calling vget() or vn_lock()]. 4. The reclaimed vnode has the VI_DOOMED flag set; you must have vnode interlock locked to check the context of the v_iflag. Most filesystems, as opposed to the VFS, use the other technique to detect the reclaimed vnode, if needed. They clear the v_data in the vop_reclaim, and verification of the (v_data !=3D NULL) is enough to check for reclamatio= n. Very good example of the practical usage of the rules above are the nullfs routines null_reclaim(), null_lock() and null_nodeget(). >=20 >=20 > >You said that that nullfs is vulnerable to the problem. Could you, > >please, point me to the corresponding stack trace ? At least, the nullfs > >vop_lock() seems to carefully check the possible problems. --FJ2D5YQYG6NL2pc1 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (FreeBSD) iEYEARECAAYFAkgV+c4ACgkQC3+MBN1Mb4haKACfXxdcHAicJTki0O0Iw60E3WmG 4y8An2qfC3GYLpvDljGmgrbxKqtJY8uS =2gda -----END PGP SIGNATURE----- --FJ2D5YQYG6NL2pc1-- From owner-freebsd-fs@FreeBSD.ORG Mon Apr 28 17:26:49 2008 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4E1B31065672 for ; Mon, 28 Apr 2008 17:26:49 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from natial.ongs.co.jp (natial.ongs.co.jp [202.216.246.90]) by mx1.freebsd.org (Postfix) with ESMTP id 1AA318FC21 for ; Mon, 28 Apr 2008 17:26:49 +0000 (UTC) (envelope-from daichi@freebsd.org) Received: from parancell.ongs.co.jp (dullmdaler.ongs.co.jp [202.216.246.94]) by natial.ongs.co.jp (Postfix) with ESMTP id 87A0D125438; Tue, 29 Apr 2008 02:26:48 +0900 (JST) Message-ID: <481608D8.1080308@freebsd.org> Date: Tue, 29 Apr 2008 02:26:48 +0900 From: Daichi GOTO User-Agent: Thunderbird 2.0.0.12 (X11/20080423) MIME-Version: 1.0 To: Kostik Belousov References: <4811B0A0.8040702@freebsd.org> <20080426100116.GL18958@deviant.kiev.zoral.com.ua> <48159AC5.3030000@freebsd.org> <20080428132413.GS18958@deviant.kiev.zoral.com.ua> <4815E107.9030902@freebsd.org> <20080428162238.GT18958@deviant.kiev.zoral.com.ua> In-Reply-To: <20080428162238.GT18958@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: fs@freebsd.org Subject: Re: Approval request of some additions to src/sys/kern/vfs_subr.c and src/sys/sys/vnode.h X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Apr 2008 17:26:49 -0000 Kostik Belousov wrote: > On Mon, Apr 28, 2008 at 11:36:55PM +0900, Daichi GOTO wrote: >> Thanks for your response and explanation :) >> >> Kostik Belousov wrote: >>> On Mon, Apr 28, 2008 at 06:37:09PM +0900, Daichi GOTO wrote: >>>> Kostik Belousov wrote: >>>>> On Fri, Apr 25, 2008 at 07:21:20PM +0900, Daichi GOTO wrote: >>>>>> Hi Konstantin :) >>>>>> >>>>>> To fix a unionfs issue of >>>>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=109377, >>>>>> we need to add new functions >>>>>> >>>>>> void vkernrele(struct vnode *vp); >>>>>> void vkernref(struct vnode *vp); >>>>>> >>>>>> and one value >>>>>> >>>>>> int v_kernusecount; /* i ref count of kernel */ >>>>>> >>>>>> to src/sys/sys/vnode.h and rc/sys/kern/vfs_subr.c. >>>>>> >>>>>> Unionfs will be panic when lower fs layer is forced umounted by >>>>>> "umount -f". So to avoid this issue, we've added >>>>>> "v_kernusecount" value that means "a vnode count that kernel are >>>>>> using". vkernrele() and vkernref() are functions that manage >>>>>> "v_kernusecount" value. >>>>>> >>>>>> Please check those and give us an approve or some comments! >>>>> There is already the vnode reference count. From your description, I >>>>> cannot >>>>> understand how the kernusecount would prevent the panic when forced >>>>> unmount >>>>> is performed. Could you, please, show the actual code ? PR mentioned >>>>> does not contain any patch. >>>> Our patch realizes avoiding kernel panic by "umount -f" operation using >>>> with >>>> EBUSY process. >>>> >>>> On current implementation (not applied our patch), "umount -f" tries to >>>> release vnode at any vnode reference count value. Since that, unionfs >>>> and nullfs access invalid vnode and lead kernel panic. To prevent this >>>> issue, we need a some kind of not-umount-accept-mechanism in invalid case >>>> (e.x. fs in unionfsed stack, it must be umounted in correct order) and >>>> to realize that, current vnode reference count is not enough we are >>>> thinking. >>>> >>>> If you have any ideas to realize the same solution with current vnode >>>> reference, would you please tell us your idea :) >>>> >>>>> The problem you described is common for the kernel code, and right way >>>>> to handle it, for now, is to keep refcount _and_ check for the forced >>>>> reclaim. >>> Your patch in essence disables the forced unmount. I would object against >>> such decision. >> Oooooo.... OK. We understand. >> >>> Even if taking this direction, I believe more cleaner solution would be >>> to introduce a counter that disables the (forced) unmount into the >>> struct mount, instead of the struct vnode. Having the counter in the >>> vnode, the unmount -f behaviour is non-deterministic and depended on >>> the presence of the cached vnodes of the upper layer. The mount counter >>> would be incremented by unionfs cover mount. But, as I said above, this >>> looks like a wrong solution. >>> >>> The right way to handle the forced reclaim with the current VFS is to >>> add the explicit checks for the reclaimed vnodes where it is needed. The >>> vnode cannot be reclaimed while the vnode lock is held. When obtaining >>> the vnode lock, the reclamation can be detected. For instance, the >>> vget() without LK_RETRY shall be checked for ENOENT. >> At last, we want to check that vnode is released or not where >> unionfs does not know. If we can do that check, our patch is >> not needed for solving that issue. >> >> Would you please give us the way to check that target vnode is >> released or not before accessing it. > > The basic rules of our VFS are: > 1. You _must_ hold the vnode unless the vnode is locked. Hold count > prevents the vnode memory from being reused and guarantees the > validity of the counters, v_vnlock, v_mount and vop (but please note > that validity != stability). E.g., v_mount may be NULLed and vop > become the deadfs_vop due to reclamation. > 2. The vnode lock is held when the vnode is vgone(9)'ed. In the other > words, if you have a pointer to the non-reclaimed vnode that > is locked, the vnode cannot be reclaimed until the lock is freed. > 3. The verbs that lock a vnode (vget() and vn_lock(9)) have two mode > of operations. > - If you specify the LK_RETRY in the lock flags, you would get > even the reclaimed vnode locked. > - If you do not specified LK_RETRY, you would get ENOENT for the > reclaimed vnode. > [See the #1 for the reason why you must have a vnode held while > calling vget() or vn_lock()]. > 4. The reclaimed vnode has the VI_DOOMED flag set; you must have vnode > interlock locked to check the context of the v_iflag. Most filesystems, > as opposed to the VFS, use the other technique to detect the reclaimed > vnode, if needed. They clear the v_data in the vop_reclaim, and > verification of the (v_data != NULL) is enough to check for reclamation. > > Very good example of the practical usage of the rules above are the > nullfs routines null_reclaim(), null_lock() and null_nodeget(). Thanks for your explanation! We'll try to research and get another new solution for this issue :) >>> You said that that nullfs is vulnerable to the problem. Could you, >>> please, point me to the corresponding stack trace ? At least, the nullfs >>> vop_lock() seems to carefully check the possible problems. -- Daichi GOTO, http://people.freebsd.org/~daichi From owner-freebsd-fs@FreeBSD.ORG Tue Apr 29 01:52:56 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4C8EE106564A for ; Tue, 29 Apr 2008 01:52:56 +0000 (UTC) (envelope-from andrew@thefrog.net) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.175]) by mx1.freebsd.org (Postfix) with ESMTP id B81DD8FC15 for ; Tue, 29 Apr 2008 01:52:55 +0000 (UTC) (envelope-from andrew@thefrog.net) Received: by ug-out-1314.google.com with SMTP id y2so951853uge.37 for ; Mon, 28 Apr 2008 18:52:54 -0700 (PDT) Received: by 10.67.30.3 with SMTP id h3mr5654698ugj.35.1209432441311; Mon, 28 Apr 2008 18:27:21 -0700 (PDT) Received: by 10.86.36.4 with HTTP; Mon, 28 Apr 2008 18:27:21 -0700 (PDT) Message-ID: <16a6ef710804281827p4b6e1ef3sbec516163ba764a@mail.gmail.com> Date: Tue, 29 Apr 2008 11:27:21 +1000 From: "Andrew Hill" Sender: andrew@thefrog.net To: freebsd-fs@freebsd.org MIME-Version: 1.0 X-Google-Sender-Auth: 9c73f03254ec42d2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: ZFS docs / info X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Apr 2008 01:52:56 -0000 Not sure if this is the right list for this (apologies if not) but here goes... Over the last week I've spent a lot of time getting to know ZFS; starting with basically no knowledge of how the bits and pieces are structured, nor how to use it, and (with a lot of late nights) getting to the point where I feel comfortable using it for a ~2TB raidz server. I've been using FreeBSD for about 8 years, so I'm comfortable using the system, my learning curve was purely with zfs. So at the suggestion of a friend I made a bunch of notes and wrote an intro to zfs as I now see it, and made some specific notes on things that I didn't find obvious from the documentation (at least the docs I found... which were the ZFS Tuning Guide on wiki.freebsd.org, the sun ZFS administrator's guide, zfs/zpool man pages and a bunch of blogs). Mostly the structure of the differnet elements of ZFS (zpools, file systems, vdevs, zvols) and how they interact, but also a few limitations of how those can be configured. I figure what I've written may be (hopefully) useful to others with UNIX experience but brand new to ZFS, or, better still, if someone is writing a wiki or documentation for ZFS on bsd, i'm happy for any of what i've written to be used for that kind of thing. post 1 - basic intro, overview of the structure of zfs (zpools, zfs, vdevs, zvols and how they all interact) http://blog.thefrog.net/2008/04/zfs-on-freebsd.html post 2 - some notable limitations and features i didn't really get from my reading of the docs (and a bug that i've yet to reproduce in a debug kernel) http://blog.thefrog.net/2008/04/more-zfs-on-freebsd.html i'm providing links because there's a rather large amount of text, which will no doubt have the odd mistake to fix as they're pointed out to me anyway, hope it helps someone From owner-freebsd-fs@FreeBSD.ORG Wed Apr 30 07:35:47 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EA9321065670 for ; Wed, 30 Apr 2008 07:35:47 +0000 (UTC) (envelope-from dudu@dudu.ro) Received: from fk-out-0910.google.com (fk-out-0910.google.com [209.85.128.189]) by mx1.freebsd.org (Postfix) with ESMTP id 920528FC1C for ; Wed, 30 Apr 2008 07:35:47 +0000 (UTC) (envelope-from dudu@dudu.ro) Received: by fk-out-0910.google.com with SMTP id k31so203867fkk.11 for ; Wed, 30 Apr 2008 00:35:46 -0700 (PDT) Received: by 10.82.159.15 with SMTP id h15mr28189bue.29.1209539294954; Wed, 30 Apr 2008 00:08:14 -0700 (PDT) Received: by 10.82.185.8 with HTTP; Wed, 30 Apr 2008 00:08:14 -0700 (PDT) Message-ID: Date: Wed, 30 Apr 2008 10:08:14 +0300 From: "Vlad GALU" To: freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Subject: [FYI] Unionfs hosed by weekly cronjobs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Apr 2008 07:35:48 -0000 I added the FYI tag in the subject of this message in order to let you know that apart from noticing the symptom, I don't have any other useful info. The machine in question runs the latest RELENG_7 and used to have /usr/ports mounted "below" twice in two different jails. Other mount flags are rw and noatime. Whenever the weekly jobs start, the system freezes. Ping still works, however. I couldn't test each weekly script because I don't have physical access to this machine and am currently away from my office. Switching to nullfs for the aforementioned mountpoints worked around the issue, at the cost of eliminating the possibility of building the same port in different jails. When I get back to my office I'll try to reproduce the problem, but if anybody can do it in the meantime, even better. Thanks, Vlad. -- ~/.signature: no such file or directory From owner-freebsd-fs@FreeBSD.ORG Wed Apr 30 13:36:52 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 875B61065685 for ; Wed, 30 Apr 2008 13:36:52 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 3A7BF8FC19 for ; Wed, 30 Apr 2008 13:36:52 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1JrCUN-0001UC-Bb for freebsd-fs@freebsd.org; Wed, 30 Apr 2008 13:36:51 +0000 Received: from lara.cc.fer.hr ([161.53.72.113]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 30 Apr 2008 13:36:51 +0000 Received: from ivoras by lara.cc.fer.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 30 Apr 2008 13:36:51 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Ivan Voras Date: Wed, 30 Apr 2008 15:36:41 +0200 Lines: 30 Message-ID: References: <16a6ef710804281827p4b6e1ef3sbec516163ba764a@mail.gmail.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigCF37EA0F7776B7D6006BFC9F" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: lara.cc.fer.hr User-Agent: Thunderbird 2.0.0.12 (X11/20080227) In-Reply-To: <16a6ef710804281827p4b6e1ef3sbec516163ba764a@mail.gmail.com> X-Enigmail-Version: 0.95.0 Sender: news Subject: Re: ZFS docs / info X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Apr 2008 13:36:52 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigCF37EA0F7776B7D6006BFC9F Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Andrew Hill wrote: > Not sure if this is the right list for this (apologies if not) but here= > goes... >=20 > Over the last week I've spent a lot of time getting to know ZFS;=20 Do you know about http://wiki.freebsd.org/ZFS ? --------------enigCF37EA0F7776B7D6006BFC9F Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIGHXpldnAQVacBcgRAsbAAJ444uZsQfklgwglyjlMx1Hb4QBhcQCfc8In Zmm888YLi2Sc7Z9UoRPYYN8= =Padf -----END PGP SIGNATURE----- --------------enigCF37EA0F7776B7D6006BFC9F-- From owner-freebsd-fs@FreeBSD.ORG Thu May 1 08:54:40 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E4A501065670 for ; Thu, 1 May 2008 08:54:40 +0000 (UTC) (envelope-from andrew@thefrog.net) Received: from rv-out-0506.google.com (rv-out-0506.google.com [209.85.198.224]) by mx1.freebsd.org (Postfix) with ESMTP id C40CC8FC0A for ; Thu, 1 May 2008 08:54:40 +0000 (UTC) (envelope-from andrew@thefrog.net) Received: by rv-out-0506.google.com with SMTP id b25so540239rvf.43 for ; Thu, 01 May 2008 01:54:40 -0700 (PDT) Received: by 10.141.212.5 with SMTP id o5mr746197rvq.20.1209632080265; Thu, 01 May 2008 01:54:40 -0700 (PDT) Received: from pc-150.acfr.usyd.edu.au ( [129.78.210.150]) by mx.google.com with ESMTPS id g22sm2341477rvb.7.2008.05.01.01.54.37 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 01 May 2008 01:54:39 -0700 (PDT) Message-Id: From: Andrew Hill To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Date: Thu, 1 May 2008 18:54:35 +1000 X-Mailer: Apple Mail (2.919.2) Sender: Andrew Hill Subject: ZFS docs / info X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 May 2008 08:54:41 -0000 Ivan Voras wrote: > Do you know about http://wiki.freebsd.org/ZFS ? Yes, that was my starting point as I learnt about ZFS. I simply wanted to offer documentation aimed at a different level of user. I found that the documentation on that wiki and the docs it links to tended to fit into one of three categories 1. it provided a very high level listing of features of the whole system, without talking about specific components, what each one is responsible for and how they fit together (e.g. is the zpool or the zfs responsible for checksumming, compression, redundancy, etc) - great for convincing people of the worth of ZFS 2. it assumes the reader has full knowledge of how the zfs pieces fit together (i.e. they what they want to create and when) and was simply there to document the syntax of the zpool and zfs commands - a good quick-reference guide for those familiar with zfs 3. it provided very detailed information about commands, which must of course include how to use every single component available to ZFS, a lot of which is far beyond what a typical 'home' bsd user would want, and perhaps confusing due to the level of detail - but perfect for an engineer or administrator Obviously the right documentation for a specific user really depends on their background knowledge, and I felt that the first category was great for convincing someone to use ZFS, but if they knew nothing of how the pieces fit together then 2 and 3 were a very deep pool to dive into. So I've tried to summarise the info I found from all three into a simpler document aimed somewhere in between high-level-overview and detailed-man-pages, containing what I found most useful from the documentation available I don't imagine anyone who's actually bothered to sign up to freebsd- fs will want documentation at the level I've written it (they'll be going for #2 or 3 above), but I figured those trying to find out how it fits together might stumble across the archives, or maybe someone involved in documentation will see some utility (for new zfs users) in what i've written. Andrew From owner-freebsd-fs@FreeBSD.ORG Fri May 2 20:58:40 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0BB67106566C for ; Fri, 2 May 2008 20:58:40 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.ltdomains.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id CF5A58FC0C for ; Fri, 2 May 2008 20:58:39 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.storspeed.com (209-163-168-124.static.tenantsolutions.com [209.163.168.124] (may be forged)) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id m42KeCib098887; Fri, 2 May 2008 15:40:12 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-Id: <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> From: Eric Anderson To: Attila Nagy In-Reply-To: <48070DCF.9090902@fsn.hu> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Date: Fri, 2 May 2008 15:40:11 -0500 References: <48070DCF.9090902@fsn.hu> X-Mailer: Apple Mail (2.919.2) X-Spam-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org Subject: Re: Consistent inodes between distinct machines X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 May 2008 20:58:40 -0000 On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: > Hello, > > I have several NFS servers, where the service must be available > 0-24. The servers are mounted read only on the clients and I've > solved the problem of maintaining consistent inodes between them by > rsyncing an UFS image and mounting it via md on the NFS servers. > The machines have a common IP address with CARP, so if one of them > falls out, the other(s) can take over. > > This works nice, but rsyncing multi gigabyte files are becoming more > and more annoying, so I've wondered whether it would be possible to > get constant inodes between machines via alternative ways. Why not avoid syncing multi-gigabyte files by splitting your huge FS image into many smaller say 512MB files, then use md and geom concat/ stripe/etc to make them all one image that you mount? Eric From owner-freebsd-fs@FreeBSD.ORG Sat May 3 12:51:05 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8F270106564A for ; Sat, 3 May 2008 12:51:05 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from raven.bwct.de (raven.bwct.de [85.159.14.73]) by mx1.freebsd.org (Postfix) with ESMTP id 4BE898FC19 for ; Sat, 3 May 2008 12:51:04 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from cicely5.cicely.de ([10.1.1.7]) by raven.bwct.de (8.13.4/8.13.4) with ESMTP id m43Cp1q9011127 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 3 May 2008 14:51:02 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (cicely12.cicely.de [10.1.1.14]) by cicely5.cicely.de (8.13.4/8.13.4) with ESMTP id m43Copdt001744 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 3 May 2008 14:50:52 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (localhost [127.0.0.1]) by cicely12.cicely.de (8.13.4/8.13.3) with ESMTP id m43CopgF043266; Sat, 3 May 2008 14:50:51 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: (from ticso@localhost) by cicely12.cicely.de (8.13.4/8.13.3/Submit) id m43Copwa043265; Sat, 3 May 2008 14:50:51 +0200 (CEST) (envelope-from ticso) Date: Sat, 3 May 2008 14:50:51 +0200 From: Bernd Walter To: Eric Anderson Message-ID: <20080503125050.GG40730@cicely12.cicely.de> References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> X-Operating-System: FreeBSD cicely12.cicely.de 5.4-STABLE alpha User-Agent: Mutt/1.5.9i X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED=-1.8, BAYES_00=-2.599 autolearn=ham version=3.2.3 X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on cicely12.cicely.de Cc: freebsd-fs@freebsd.org Subject: Re: Consistent inodes between distinct machines X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: ticso@cicely.de List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 May 2008 12:51:05 -0000 On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: > On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: > > >Hello, > > > >I have several NFS servers, where the service must be available > >0-24. The servers are mounted read only on the clients and I've > >solved the problem of maintaining consistent inodes between them by > >rsyncing an UFS image and mounting it via md on the NFS servers. > >The machines have a common IP address with CARP, so if one of them > >falls out, the other(s) can take over. > > > >This works nice, but rsyncing multi gigabyte files are becoming more > >and more annoying, so I've wondered whether it would be possible to > >get constant inodes between machines via alternative ways. > > > Why not avoid syncing multi-gigabyte files by splitting your huge FS > image into many smaller say 512MB files, then use md and geom concat/ > stripe/etc to make them all one image that you mount? Where would be the positive effect by doing this? FFS distributes data over the media, so all the small files changes in almost every case and you have to checksum-compare the whole virtual disk anyway. With multiple files the syncing is more complex. For example a normal rsync run can garantie that you get a complete file synced or none at all, but this doesn't work out of the box with multiple files, so you risk half updated data. Nevertheless I think that the UFS/NFS combo is not very good for this problem. With ZFS send/receive however inode numbers are consistent. Together with the differential stream creation it is quite efficient to sync large volumes as well. [75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03 received 126Mb stream in 28 seconds (4.50Mb/sec) 0.008u 5.046s 0:27.93 18.0% 53+2246k 0+0io 0pf+0w [56]cicely14# ls -ali /usr/local/arm-elf/bin/ total 22585 147 drwxr-xr-x 2 root wheel 20 Mar 25 2006 . 3 drwxr-xr-x 11 root wheel 11 Dec 25 04:58 .. 154 -rwxr-xr-x 1 root wheel 1514107 Mar 25 2006 arm-elf-addr2line 150 -rwxr-xr-x 2 root wheel 1495219 Mar 25 2006 arm-elf-ar 159 -rwxr-xr-x 2 root wheel 2275463 Mar 25 2006 arm-elf-as 158 -rwxr-xr-x 1 root wheel 1481234 Mar 25 2006 arm-elf-c++filt 163 -rwxr-xr-x 1 root wheel 300233 Mar 25 2006 arm-elf-cpp 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc-4.1.0 162 -rwxr-xr-x 1 root wheel 15949 Mar 25 2006 arm-elf-gccbug 161 -rwxr-xr-x 1 root wheel 126715 Mar 25 2006 arm-elf-gcov 160 -rwxr-xr-x 2 root wheel 2162285 Mar 25 2006 arm-elf-ld 156 -rwxr-xr-x 2 root wheel 1541809 Mar 25 2006 arm-elf-nm 153 -rwxr-xr-x 1 root wheel 1871104 Mar 25 2006 arm-elf-objcopy 149 -rwxr-xr-x 2 root wheel 2008424 Mar 25 2006 arm-elf-objdump 152 -rwxr-xr-x 2 root wheel 1495214 Mar 25 2006 arm-elf-ranlib 155 -rwxr-xr-x 1 root wheel 389000 Mar 25 2006 arm-elf-readelf 148 -rwxr-xr-x 1 root wheel 1430608 Mar 25 2006 arm-elf-size 151 -rwxr-xr-x 1 root wheel 1412788 Mar 25 2006 arm-elf-strings 157 -rwxr-xr-x 2 root wheel 1871103 Mar 25 2006 arm-elf-strip [57]cicely14# ls -ali /data/test/bin/ total 22585 147 drwxr-xr-x 2 root wheel 20 Mar 25 2006 . 3 drwxr-xr-x 11 root wheel 11 Dec 25 04:58 .. 154 -rwxr-xr-x 1 root wheel 1514107 Mar 25 2006 arm-elf-addr2line 150 -rwxr-xr-x 2 root wheel 1495219 Mar 25 2006 arm-elf-ar 159 -rwxr-xr-x 2 root wheel 2275463 Mar 25 2006 arm-elf-as 158 -rwxr-xr-x 1 root wheel 1481234 Mar 25 2006 arm-elf-c++filt 163 -rwxr-xr-x 1 root wheel 300233 Mar 25 2006 arm-elf-cpp 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc 164 -rwxr-xr-x 2 root wheel 296938 Mar 25 2006 arm-elf-gcc-4.1.0 162 -rwxr-xr-x 1 root wheel 15949 Mar 25 2006 arm-elf-gccbug 161 -rwxr-xr-x 1 root wheel 126715 Mar 25 2006 arm-elf-gcov 160 -rwxr-xr-x 2 root wheel 2162285 Mar 25 2006 arm-elf-ld 156 -rwxr-xr-x 2 root wheel 1541809 Mar 25 2006 arm-elf-nm 153 -rwxr-xr-x 1 root wheel 1871104 Mar 25 2006 arm-elf-objcopy 149 -rwxr-xr-x 2 root wheel 2008424 Mar 25 2006 arm-elf-objdump 152 -rwxr-xr-x 2 root wheel 1495214 Mar 25 2006 arm-elf-ranlib 155 -rwxr-xr-x 1 root wheel 389000 Mar 25 2006 arm-elf-readelf 148 -rwxr-xr-x 1 root wheel 1430608 Mar 25 2006 arm-elf-size 151 -rwxr-xr-x 1 root wheel 1412788 Mar 25 2006 arm-elf-strings 157 -rwxr-xr-x 2 root wheel 1871103 Mar 25 2006 arm-elf-strip -- B.Walter http://www.bwct.de Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm. From owner-freebsd-fs@FreeBSD.ORG Sat May 3 15:55:45 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EDCFE106566C for ; Sat, 3 May 2008 15:55:45 +0000 (UTC) (envelope-from yalur@mail.ru) Received: from mx39.mail.ru (mx39.mail.ru [194.67.23.35]) by mx1.freebsd.org (Postfix) with ESMTP id A4A488FC18 for ; Sat, 3 May 2008 15:55:45 +0000 (UTC) (envelope-from yalur@mail.ru) Received: from [77.123.105.27] (port=51133 helo=reluctant-operater.volia.net) by mx39.mail.ru with asmtp id 1JsK5Q-000PRP-00; Sat, 03 May 2008 19:55:44 +0400 From: Ruslan Kovtun Organization: Home To: "Daniel Andersson" Date: Sat, 3 May 2008 18:55:43 +0300 User-Agent: KMail/1.9.7 References: <24adbbc00804151529m2a74085ds468eaac55ba94a32@mail.gmail.com> <200804162212.32560.yalur@mail.ru> <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> In-Reply-To: <24adbbc00804270501t48b9a1c5le2f1d0bce18572cf@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200805031855.43218.yalur@mail.ru> X-Spam: Not detected Cc: freebsd-fs@freebsd.org Subject: Re: Choppy performance. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: yalur@mail.ru List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 May 2008 15:55:46 -0000 Sorry, maybe I miss something. What "memory allocation errors in rtorrent" do you mean? > But if it isn't really using that much memory how come I get > memory allocation errors in rtorrent if there's more memory > avaliable? One week ago was observed problem with write speed on ZFS pool with followi= ng=20 configuration on i386: vm.kmem_size_max=3D"1073741824" vm.kmem_size=3D"1073741824" KVA_PAGES=3D512 Write speed in 8 disks (raidz) is 40 Mb/sec and very choppy.=20 If I change to vm.kmem_size_max=3D"999M", write speed increase in 4 times=20 (160Mb/sec). I think this is bug.=20 What is yours configuration?=20 =2D-=20 ________________ =F3 =D5=D7=C1=D6=C5=CE=C9=C5=CD =EB=CF=D7=D4=D5=CE =F2=D5=D3=CC=C1=CE mailto From owner-freebsd-fs@FreeBSD.ORG Sat May 3 18:09:34 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6FBA0106566B for ; Sat, 3 May 2008 18:09:34 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id 5ADE78FC0A for ; Sat, 3 May 2008 18:09:32 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from [172.27.51.1] (fw.axelero.hu [195.228.243.120]) by people.fsn.hu (Postfix) with ESMTP id 769D8C7653; Sat, 3 May 2008 20:09:27 +0200 (CEST) Message-ID: <481CAA55.2030506@fsn.hu> Date: Sat, 03 May 2008 20:09:25 +0200 From: Attila Nagy User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: ticso@cicely.de References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> In-Reply-To: <20080503125050.GG40730@cicely12.cicely.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org Subject: Re: Consistent inodes between distinct machines X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 May 2008 18:09:34 -0000 Hello, On 2008.05.03. 14:50, Bernd Walter wrote: > On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: > >> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: >> >> >>> Hello, >>> >>> I have several NFS servers, where the service must be available >>> 0-24. The servers are mounted read only on the clients and I've >>> solved the problem of maintaining consistent inodes between them by >>> rsyncing an UFS image and mounting it via md on the NFS servers. >>> The machines have a common IP address with CARP, so if one of them >>> falls out, the other(s) can take over. >>> >>> This works nice, but rsyncing multi gigabyte files are becoming more >>> and more annoying, so I've wondered whether it would be possible to >>> get constant inodes between machines via alternative ways. >>> >> Why not avoid syncing multi-gigabyte files by splitting your huge FS >> image into many smaller say 512MB files, then use md and geom concat/ >> stripe/etc to make them all one image that you mount? >> > > Where would be the positive effect by doing this? > FFS distributes data over the media, so all the small files changes > in almost every case and you have to checksum-compare the whole virtual > disk anyway. > With multiple files the syncing is more complex. For example a normal > rsync run can garantie that you get a complete file synced or none > at all, but this doesn't work out of the box with multiple files, so > you risk half updated data. > I haven't got Eric's e-mail, but I agree with the above. > Nevertheless I think that the UFS/NFS combo is not very good for this > problem. > I don't think so. I need a stable system and UFS/NFS is in that state in FreeBSD. > With ZFS send/receive however inode numbers are consistent. > Yes, they are, but the filesystem IDs are not, so you cannot have CARP failover for the NFS servers, because all clients will have ESTALE errors on everything. I've already tried that, see my e-mails about this topic in the archives (it would be good if we could synchronize the filesystem IDs and therefore the filehandles too). > Together with the differential stream creation it is quite efficient > to sync large volumes as well. > [75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test > receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03 > received 126Mb stream in 28 seconds (4.50Mb/sec) > 0.008u 5.046s 0:27.93 18.0% 53+2246k 0+0io 0pf+0w > Yes, that's why I thought of this in the first place. But there is another problem, which hits us today (with the loopbacked image mount) as well: you have to unmount the image and restart the NFS server (it can panic the machine otherwise), so we have to flip the active state from one machine to the other during the sync. The exact process looks like this: - rsync the image to the inactive server - when it's done, remount the image and restart the nfsd - flip CARP (this is when the new content will go into production) - sync the image to the now inactive, previously active server This is a painful, slow (because of the rsync) and fragile process. And if the active server crashes while the sync is going, you are there with a possibly non-working state. With ZFS, the sync time is much smaller, but you have to flip the active state and restart nfsd as well. Currently I'm experimenting with a silly kernel patch, which replaces the following arc4random()s with a constant value: ./ffs/ffs_alloc.c: ip->i_gen = arc4random() / 2 + 1; ./ffs/ffs_alloc.c: prefcg = arc4random() % fs->fs_ncg; ./ffs/ffs_alloc.c: dp2->di_gen = arc4random() / 2 + 1; ./ffs/ffs_vfsops.c: ip->i_gen = arc4random() / 2 + 1; It seems that this works when I don't use soft updates on the volumes. So what I have now: - all of the machines have the above arc4random()s removed - all machines run the data file system in async mode (for speed and because soft updates seems to mess up the constant inodes) - I have all the data in a subversion repository (better than a plain "master image", because it's versioned, logged, etc) - I do updates in this way on the machines: mount -o rw,async /data; svn up; mount -o ro /data So far it seems to be OK, but I'm not yet finished with the testing. From owner-freebsd-fs@FreeBSD.ORG Sat May 3 18:52:07 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 05D72106567D; Sat, 3 May 2008 18:52:07 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from raven.bwct.de (raven.bwct.de [85.159.14.73]) by mx1.freebsd.org (Postfix) with ESMTP id 553558FC1C; Sat, 3 May 2008 18:52:06 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from cicely5.cicely.de ([10.1.1.7]) by raven.bwct.de (8.13.4/8.13.4) with ESMTP id m43Iq324022066 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 3 May 2008 20:52:04 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (cicely12.cicely.de [10.1.1.14]) by cicely5.cicely.de (8.13.4/8.13.4) with ESMTP id m43IpuwU004876 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 3 May 2008 20:51:56 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (localhost [127.0.0.1]) by cicely12.cicely.de (8.13.4/8.13.3) with ESMTP id m43IptGn044067; Sat, 3 May 2008 20:51:56 +0200 (CEST) (envelope-from ticso@cicely12.cicely.de) Received: (from ticso@localhost) by cicely12.cicely.de (8.13.4/8.13.3/Submit) id m43IptsD044066; Sat, 3 May 2008 20:51:55 +0200 (CEST) (envelope-from ticso) Date: Sat, 3 May 2008 20:51:55 +0200 From: Bernd Walter To: Attila Nagy Message-ID: <20080503185155.GA44005@cicely12.cicely.de> References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> <481CAA55.2030506@fsn.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <481CAA55.2030506@fsn.hu> X-Operating-System: FreeBSD cicely12.cicely.de 5.4-STABLE alpha User-Agent: Mutt/1.5.9i X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED=-1.8, BAYES_00=-2.599 autolearn=ham version=3.2.3 X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on cicely12.cicely.de Cc: freebsd-fs@freebsd.org, ticso@cicely.de Subject: Re: Consistent inodes between distinct machines X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: ticso@cicely.de List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 May 2008 18:52:07 -0000 On Sat, May 03, 2008 at 08:09:25PM +0200, Attila Nagy wrote: > Hello, > > On 2008.05.03. 14:50, Bernd Walter wrote: > >On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: > > > >>On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: > >Nevertheless I think that the UFS/NFS combo is not very good for this > >problem. > > > I don't think so. I need a stable system and UFS/NFS is in that state in > FreeBSD. ZFS is pretty stable as well, although it has some points you need to care and tune about. > >With ZFS send/receive however inode numbers are consistent. > > > Yes, they are, but the filesystem IDs are not, so you cannot have CARP > failover for the NFS servers, because all clients will have ESTALE > errors on everything. Havn't though about this. Of course this is a real problem. Have you tried the following: Setup Server A with all required ZFS filesystems. Replicate everything to Server B using dd. Then the filesystem ID should be the same on both systems. This will not work for newly created filesystems however and you may need to take extra care about not accidently change disks between the machines, since they have the same disk IDs as well. I admit - not very perfect :( > I've already tried that, see my e-mails about this topic in the archives > (it would be good if we could synchronize the filesystem IDs and > therefore the filehandles too). > >Together with the differential stream creation it is quite efficient > >to sync large volumes as well. > >[75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test > >receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03 > >received 126Mb stream in 28 seconds (4.50Mb/sec) > >0.008u 5.046s 0:27.93 18.0% 53+2246k 0+0io 0pf+0w > > > Yes, that's why I thought of this in the first place. But there is > another problem, which hits us today (with the loopbacked image mount) > as well: you have to unmount the image and restart the NFS server (it > can panic the machine otherwise), so we have to flip the active state > from one machine to the other during the sync. Of course you have to do this - readonly mounts mean not writing, but it doesn't mean not caching metadata and expecting the underlying media to change contents, so to stay in sync you have to remount. > The exact process looks like this: > - rsync the image to the inactive server > - when it's done, remount the image and restart the nfsd You also have to sync the image to a different file, since you can't pollute the original file with new content, while it is mounted. But with propper (IIRC default) options rsync already writes a new file and than exchanges it with the old one. > - flip CARP (this is when the new content will go into production) > - sync the image to the now inactive, previously active server > > This is a painful, slow (because of the rsync) and fragile process. And > if the active server crashes while the sync is going, you are there with > a possibly non-working state. > > With ZFS, the sync time is much smaller, but you have to flip the active > state and restart nfsd as well. Sounds plausible to me. > Currently I'm experimenting with a silly kernel patch, which replaces > the following arc4random()s with a constant value: > ./ffs/ffs_alloc.c: ip->i_gen = arc4random() / 2 + 1; > ./ffs/ffs_alloc.c: prefcg = arc4random() % fs->fs_ncg; > ./ffs/ffs_alloc.c: dp2->di_gen = arc4random() / 2 + 1; > ./ffs/ffs_vfsops.c: ip->i_gen = arc4random() / 2 + 1; > > It seems that this works when I don't use soft updates on the volumes. But it is very fragile and it is there for a good reason. Namely to distribute the allocated inodes over the media and since AFAIK at leasy small files have their data allocated near the inode you influece data distribution as well. This will very likely lead to lower speed after some usage. > So what I have now: > - all of the machines have the above arc4random()s removed > - all machines run the data file system in async mode (for speed and > because soft updates seems to mess up the constant inodes) > - I have all the data in a subversion repository (better than a plain > "master image", because it's versioned, logged, etc) > - I do updates in this way on the machines: mount -o rw,async /data; svn > up; mount -o ro /data > > So far it seems to be OK, but I'm not yet finished with the testing. Honestly said - I wouldn't trust that very much. Say you use two disk stations with fibre channel, which are connetced to two hosts. Use the disk stations with different power supply rails. Then use a solid constructed single server and have the same machine as cold or maybe already booted standby. Use the disk stations to mirror - one half on each station. If the host dies you can easily take over the service to the other machine by just mounting the disks. If you do this with ZFS it even takes care that the original host will not automatically mount them, since the host-id for the pool has been changed to that of the other host. It is not a hot standby as your solution, but talking about service failures I would assume this will outperform any hackish solution. I see so many people trying to do freaky failover with additional complexity and additional failure points, instead of just to increase the quality of their hardware. -- B.Walter http://www.bwct.de Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm. From owner-freebsd-fs@FreeBSD.ORG Sat May 3 19:53:44 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 92B281065676; Sat, 3 May 2008 19:53:44 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id A4C1D8FC13; Sat, 3 May 2008 19:53:43 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from [172.27.51.1] (fw.axelero.hu [195.228.243.120]) by people.fsn.hu (Postfix) with ESMTP id B89FAC8CA8; Sat, 3 May 2008 21:53:37 +0200 (CEST) Message-ID: <481CC2B8.5080205@fsn.hu> Date: Sat, 03 May 2008 21:53:28 +0200 From: Attila Nagy User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: ticso@cicely.de References: <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> <481CAA55.2030506@fsn.hu> <20080503185155.GA44005@cicely12.cicely.de> In-Reply-To: <20080503185155.GA44005@cicely12.cicely.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org Subject: Re: Consistent inodes between distinct machines X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 May 2008 19:53:44 -0000 On 2008.05.03. 20:51, Bernd Walter wrote: > On Sat, May 03, 2008 at 08:09:25PM +0200, Attila Nagy wrote: > >> Hello, >> >> On 2008.05.03. 14:50, Bernd Walter wrote: >> >>> On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote: >>> >>> >>>> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote: >>>> >>> Nevertheless I think that the UFS/NFS combo is not very good for this >>> problem. >>> >>> >> I don't think so. I need a stable system and UFS/NFS is in that state in >> FreeBSD. >> > > ZFS is pretty stable as well, although it has some points you need > to care and tune about. > I have (had, switched back one to UFS) two machines with ZFS. One i386 and one amd64. Both kept crashing or freezing, so I don't consider ZFS pretty stable ATM. :( > Havn't though about this. > Of course this is a real problem. > Have you tried the following: > Setup Server A with all required ZFS filesystems. > Replicate everything to Server B using dd. > Then the filesystem ID should be the same on both systems. > This will not work for newly created filesystems however and you may > need to take extra care about not accidently change disks between the > machines, since they have the same disk IDs as well. > I admit - not very perfect :( > Haven't tried that -but thought of it-, because I would need a bunch of new filesystems for snapshotting and synchronizing and I would like to dd tens of gigabytes every time to all of the NFS servers over the network. >> Yes, that's why I thought of this in the first place. But there is >> another problem, which hits us today (with the loopbacked image mount) >> as well: you have to unmount the image and restart the NFS server (it >> can panic the machine otherwise), so we have to flip the active state >> from one machine to the other during the sync. >> > > Of course you have to do this - readonly mounts mean not writing, but > it doesn't mean not caching metadata and expecting the underlying media > to change contents, so to stay in sync you have to remount. > I am very well aware of that. If it would work, I would choose a geom_gate solution with one RW machine and many RO ones with a mirror formed from them. Of course that's still not perfect, so ZFS's mirroring would be a better fit (due to incremental updates). But sadly, it's not possible (AFAIK with "standard" methods) to run systems like that. > >> The exact process looks like this: >> - rsync the image to the inactive server >> - when it's done, remount the image and restart the nfsd >> > > You also have to sync the image to a different file, since you can't > pollute the original file with new content, while it is mounted. > I am doing this for years without any ill effects. Of course I don't access the filesystem while it's synced. I'm just lazy to umount it, but you are right, that's the correct way. > But with propper (IIRC default) options rsync already writes a new > file and than exchanges it with the old one. > Yes, I use inplace syncing, because I don't have that much space available. > >> Currently I'm experimenting with a silly kernel patch, which replaces >> the following arc4random()s with a constant value: >> ./ffs/ffs_alloc.c: ip->i_gen = arc4random() / 2 + 1; >> ./ffs/ffs_alloc.c: prefcg = arc4random() % fs->fs_ncg; >> ./ffs/ffs_alloc.c: dp2->di_gen = arc4random() / 2 + 1; >> ./ffs/ffs_vfsops.c: ip->i_gen = arc4random() / 2 + 1; >> >> It seems that this works when I don't use soft updates on the volumes. >> > > But it is very fragile and it is there for a good reason. > For a normal filesystem, yes. > Namely to distribute the allocated inodes over the media and since > AFAIK at leasy small files have their data allocated near the inode > you influece data distribution as well. > This will very likely lead to lower speed after some usage. > Because these are mostly RO (only RW while updating, which is a slow process anyway) volumes, used for serving NFS clients, I don't think it will matter that much. But I'll see. Currently this is the best I could came up with. > Honestly said - I wouldn't trust that very much. > Say you use two disk stations with fibre channel, which are connetced to > two hosts. > Use the disk stations with different power supply rails. > Then use a solid constructed single server and have the same machine > as cold or maybe already booted standby. > Use the disk stations to mirror - one half on each station. > If the host dies you can easily take over the service to the other > machine by just mounting the disks. > If you do this with ZFS it even takes care that the original host will > not automatically mount them, since the host-id for the pool has been > changed to that of the other host. > It is not a hot standby as your solution, but talking about service > failures I would assume this will outperform any hackish solution. > I see so many people trying to do freaky failover with additional > complexity and additional failure points, instead of just to increase > the quality of their hardware. > > The above servers are providing NFS to FreeBSD and Linux netboot clients (clients are at many sites, running the real services behind load balancers, BGP anycast routing, whatever you like). The NFS servers here have the function of rapid deployment (put some new machines in the server pool X), centralised management (only have to make the configuration and OS changes in one place), etc. So I'm not trying to build a highly available general cluster (with NFS), but a highly available NFS server for netbooted clients. And commercial NASes aren't better at all (at least this is what I've seen so far), most of them are not shared nothing systems with affordable, reliable multisite replication capabilities.