From owner-freebsd-current@FreeBSD.ORG  Thu Dec  2 22:37:11 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9BC4E16A4CE
	for <current@freebsd.org>; Thu,  2 Dec 2004 22:37:11 +0000 (GMT)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id CAF4943D54
	for <current@freebsd.org>; Thu,  2 Dec 2004 22:37:10 +0000 (GMT)
	(envelope-from andre@freebsd.org)
Received: (qmail 31381 invoked from network); 2 Dec 2004 22:28:08 -0000
Received: from dotat.atdotat.at (HELO [62.48.0.47]) ([62.48.0.47])
          (envelope-sender <andre@freebsd.org>)
          by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
          for <scottl@freebsd.org>; 2 Dec 2004 22:28:08 -0000
Message-ID: <41AF9912.7040701@freebsd.org>
Date: Thu, 02 Dec 2004 23:37:06 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
	rv:1.8a5) Gecko/20041122
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Scott Long <scottl@freebsd.org>
References: <41AE3F80.1000506@freebsd.org>  <41AF29AC.6030401@freebsd.org>
	<1102022838.11465.7735.camel@palm.tree.com> <41AF8C78.8050806@freebsd.org>
In-Reply-To: <41AF8C78.8050806@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
cc: hackers@freebsd.org
cc: "current@freebsd.org" <current@freebsd.org>
cc: Stephan Uphoff <ups@tree.com>
Subject: Re: My project wish-list for the next 12 months
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Dec 2004 22:37:11 -0000

Scott Long wrote:
> Stephan Uphoff wrote:
> 
>> On Thu, 2004-12-02 at 09:41, Andre Oppermann wrote:
>>
>>> The holy grail of course is to mount
>>> the same filesystem 'rw' on more than one box, preferrably more than 
>>> two.
>>> This requires some more involved synchronization and locking on top 
>>> of the
>>> cache invalidation.  And make sure that the multi-'rw' cluster stays 
>>> alive
>>> if one of the participants freezes and doesn't respond anymore.
>>>
>>> Scrolling through the UFS/FFS code I think the first one is 2-3 days of
>>> work.  The second 2-4 weeks and the third 2-3 month to get it right.
>>> If someone would throw up the money...
> 
> Although I don't know the specifics of your experience, I can easily
> imagine how hard it would be to make this work on UFS.  Common
> operations like walking a file path to the root are nearly impossible to
> do reliably without an overbearing amount of synchronization.  Then you
> have all of the problems synchronizing buffered data and metadata.

You don't do it the fully synchronized way.  The semantics will be somewhat
like with NFS.  Reading in the filesystem does not invoke synchronization.
Only writing, or actually intending to write something, requires synchro-
nization with all other machines in the filesystem cluster.  Synchronization
ensures that writes to a directory or file are properly serialized and that
caches are invalidated at the same time.

> Softupdates would be a nightmare, if not impossible.

To the contrary.  It provides a nice place to link code into and it helps
to keep performance up to good levels.  Any inode changes entering the
softdep code would cause a message to all other machines informing them
of the change.  If another one wants to update the same inode or something
dependend on it it has to wait until you have written it to disk.  His
backnotification would of course move this inode to the front of the softdep
work queue to not stall your request.

As long as you don't want to have mmap() work across machines with contents
in memory but not yet written to disk things stay pretty much sane.

-- 
Andre