From owner-freebsd-fs  Thu Dec 14 14:57:45 2000
From owner-freebsd-fs@FreeBSD.ORG  Thu Dec 14 14:57:42 2000
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP id 6DD0E37B402
	for <freebsd-fs@FreeBSD.ORG>; Thu, 14 Dec 2000 14:57:42 -0800 (PST)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id PAA29369;
	Thu, 14 Dec 2000 15:53:27 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp04.primenet.com, id smtpdAAA.kaqn5; Thu Dec 14 15:53:20 2000
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id PAA15102;
	Thu, 14 Dec 2000 15:57:28 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200012142257.PAA15102@usr08.primenet.com>
Subject: Re: Filesystem tuning (minimize seeks)
To: henrich@sigbus.com (Charles Henrich)
Date: Thu, 14 Dec 2000 22:57:26 +0000 (GMT)
Cc: tlambert@primenet.com (Terry Lambert), freebsd-fs@FreeBSD.ORG
In-Reply-To: <20001213130138.A25214@sigbus.com> from "Charles Henrich" at Dec 13, 2000 01:01:38 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: tlambert@usr08.primenet.com
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > > Yes, my test is running about 25-50 machines writing a 20mb file to the
> > > FreeBSD box.  (The clients are FreeBSD as well).  The write is nothing
> > > more than a dd.  
> 
> I think maybe you've misunderstood my initial question.  What Filesystem
> tuning options are there, or any suggestions, to reduce the amount of seeking
> going on when N files are being created and written to at once.  I have N
> machines, each one opens a file, writes out a chunk of data, then closes the
> file.  Unfortunatly, because all 50 are doing this simultaneously, the data is
> getting written to disk very non-sequentially (From a per file perspective).
> Is there any options to UFS (or via NFSd?) to delay writes, or anything of
> that nature to allow the data to be serialized more often than not?

The NFS protocol is defined as not returning success unless the
write has been committed to stable storage.  In FreeBSD, this
tends to serialize NFS I/O from a single client, and between
multiple clients in excess of the number of nfsiod's you are
running.

For your large number of clients, increasing the number of nfsiod's
should prefent inter-client contention.

For the write latency and the intra-client contention (e.g. several
writes from a single client), the only thing you can really do at
this time is mount the exported FS async.

SVR4 has an option called "write gathering", where they violate
the NFS protocol definition (and make server failures nearly
impossible to fully recover from completely) by scheduling the
write to occur after a short period, and lie to the client that
the data has been committed to stable storage.  Then if subsequent
writes occur in the same pages where the previous write fell, the
writes are "gathered together", and occur in the same physical
write.

In general, most high performance NFS servers have battery backed
RAM, where they log the writes, so that they can state to the
client that the write has been committed to stable storage,
without lying to the client. (if the system fails, after a boot,
the write log is replayed to recover any uncommitted writes that
the server told the client had been committed).  Network Appliance,
PrestoServ, and similar products use this technique.

If you end up with a lot of client stalls because a client is
stalling itself (i.e. not inter-client stalls, which can be
fixed by upping the number of nfsiod's), then you might want to
consider going to one of these boxes.

If the data is not critical, an async mount might resolve the
problem, with the added risk, equal to write gathering in case
of a crash, that you will have to redo the work between the
last time the writes were committed, and the time of the crash;
this would generally mean restarting the clients, since they
believe the server when it says the data has been written to
stable storage, so there's no way to cause them to rewrite the
mising sections, assuming they even still have them available.


> I mean, in top, what is the process state "inode" relating?  What is the
> process blocking on at that point?

An inode allocation into the ihash table.  You should be able to
tune the number of inodes larger (on the machine with the problem;
I'm assuming the NFS server) to avoid them being unnecessarily
recycled too quickly.  This should actually produce a significant
speed up, since the inode/vnode association is destroyed even
though the cache contents hung off the vnode are valid.  When this
happens, the cache contents are unrecoverable, and have to be
recreated by rereading them off of disk.  Having enough inodes
to allow cached vnodes to stay associated makes the cached data
recoverable.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message