From owner-freebsd-current  Mon Aug 26 15:06:07 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id PAA19670
          for current-outgoing; Mon, 26 Aug 1996 15:06:07 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id PAA19658;
          Mon, 26 Aug 1996 15:05:53 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA23328; Mon, 26 Aug 1996 14:55:12 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199608262155.OAA23328@phaeton.artisoft.com>
Subject: Re: The VIVA file system (fwd)
To: eric@ms.uky.edu
Date: Mon, 26 Aug 1996 14:55:12 -0700 (MST)
Cc: terry@lambert.org, freebsd-fs@freebsd.org, current@freebsd.org
In-Reply-To:  <9608252145.aa12275@t2.t2.mscf.uky.edu> from "eric@ms.uky.edu" at Aug 25, 96 09:45:16 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> I guess I should respond to this thread since I happen to be on
> all these nice FreeBSD mailing lists nowadays.
> 
> The Linux version was done by one of Raphael's Master's students.
> It isn't complete, but it does apparently work (I personally have 
> not seen it nor have I seen the performance figures).
> 
> As a side note, I am currently working on Viva2, which should be a much
> more interesting gadget.  Anyway, it is in active development again
> and FreeBSD is the platform this time.
> 
> > The VIVA stuff is, I think, overoptimistic.
> > 
> > They have made a number of claims in the University of Kentucky papers
> > that were published about two years ago that seem to rely on overly
> > optimistic assumptions about policy and usage.
> 
> You might explain this one, I'm not sure I know what you mean.   The
> paper was written over three years ago.   The work was actually
> performed from late 1991-1992.    The AT&T lawsuit came out, I became
> distracted with making a living, and haven't gotten back to it until
> a couple months ago.

The optimism is in terms of comparative technology.  The FFS you talk
about in the paper is *not* the FFS FreeBSD is running.  Sorry if this
seemed like an attack on VIVA; it was intended as an attack on the
idea of replacing FFS with VIVA based on the contents of the paper, and
the fact that "Linux has it, now we need it!".

I know that I saw the paper at least two years and 5 months ago, if not
before that -- I *think* I saw it the week it came out; there was a
presentation by one of the grad students involved to the USL FS gurus:
Art Sabsevitch, Wen Ling Lu, etc., of the code on SVR4.


> For all the discussion below, you must remember that the platforms for
> Viva were 1) AT&T SysV, and 2) BSDI's BSD/386.    We abandoned SysV
> because I wanted to release the code, then came the AT&T lawsuit:-(

I saw the code on #1.  That's part of what made me skeptical; the
SVR4 FFS implementation was intentionally (IMO) crippled on a lot
of defaults and tunables so they could make the claims they did
about VXFS.  The VXFS code was the shining golden baby.  Never mind
that it was itself FFS derived (for example, it used SVR4 UFS directory
management code without modification).  Any comparison against SVR4
UFS as it was will be incredibly biased, even if the bias was not
an intentional result of the testing conditions, because the UFS
code came pre-biased.  8-(.


> > They also seemed to pick "worst case" scenarios for comparison with FFS,
> > and avoided FFS best case.
> 
> We did our testing on clean, freshly newfs'd partitions for the graphs.
> I don't see how this is "worst case", but perhaps you mean the types
> of tests we ran.   Obviously, we ran some tests that showed a difference
> between FFS and Viva.

Well, of course.  And without looking at it side-by-side myself, I really
couldn't say if there were only positive-for-VIVA comparisons existing,
or only positive one presented.

The near-full-disk scenario seemed a bit contrived, but *could* be
justified in real world usage.  Like I said, I'd like to see someone
working on a thesis (or with a similar incentive for detail) revisit
the whole thing in light of the rewritten FICUS-derived-VFS-based
FFS code in FreeBSD, post cache-unification.  I wonder if all the
locality wins would still hold.  My opinion is that at least three
of them would not; I'd be happy to be proven wrong, and find out
that they are additive to the VM/buffer cache architecture improvements
in FreeBSD's VFS/VM interface.


> > This is not nearly as bad as the MACH MSDOSFS papers, which intentioanlly
> > handicapped FFS through parameter setting and cache reduction, while
> > caching the entire DOS FAT in core was seen as being acceptable, to
> > compare their work to FFS.
> > 
> > But it is certainly not entirely unbiased reporting.
> 
> I'm not sure how to react to this.   Can one write an "entirely
> unbiased" report about one's own work?   Personally, I don't think
> so.   We tried.  I'll leave it at that.

I don't think it's possible, either -- conclusions must always be taken
with a grain of salt.  It is definitely *not* in the same class as the
MACH paper, which exhibited intentional bias.  I guess this would have
read better if you knew about the MACH paper's taking of license before
you read my comment.  Sorry about that -- I tend to assume everyone has
the same context, or is willing to get it.  I wasn't going off half-cocked.


> > The read and rewrite differences are moslty attributable to policy
> > issues in the use of FFS "optimizations" which are inapropriate to
> > the hardware used.
> 
> The read and rewrite differences are due to the fact FFS didn't do
> clustering very well at all.    BSDI *still* doesn't do it well, but
> FreeBSD appears to be much better at it.   I'm still running tests
> though and probably will be running tests for some time yet.

Yes.  Like I said, Matt Day's work on this is relevent, but it may never
see the light.  He's made some comments to the effect of what he has
done in brief discussions on the -current list.  Even with the recent
work, there's a lot of room for improvement in FreeBSD clustering and
write-gathering, without needing a new disk layout to get it.  As before,
I'm not sure if this is additive or parallel to the VIVA developement.

> > Finally, I am interested in, but suspicious of, their compression
> > claims, since they also claim that the FFS performance degradation,
> > which Knuth clearly shows to be a hash effect to be expected after
> > an 85% fill (in "Sorting and Searching"), to be nonexistant.
> 
> Well, the results are in the paper.   This is what we saw, but
> you should look at the table carefully.   There are places where
> the effective clustering of a particular file degrades over 50%, 
> but that was (at the time) about as good as FFS ever did anyway.
> The mean effective clustering always remained very high (90%+).

Yes; I didn't make the distinction on "effective clustering", the term
which was introduced in the paper.  I'm really not sure about one
cache effect being superior to another in a unified VM.  The FFS
does do clustering as well, and it seems that this has more to do
with file I/O clustering mapping to disk I/O clustering effectively
than anything that might be a real artifact of the FFS layout itself.

The address space compression is interesting for vnode-based buffering;
FreeBSD currently uses this method, but... well, I've railed often
enough against vclean that I think everyone knows how I feel.


> I should have some more modern numbers in a few months, Raphael's
> student probably has some for Linux now.   I think he used the
> original algorithms.

Yes... really, it wants a new paper (or 3 or 5 of them).  There is a
lot of room for exciting work in FS development.  I just think that
it would be ill-advised to expect the same gains in a FreeBSD framework,
at least until some basic academic work has taken place that addresses
the new situation.

> > INRE: "where the wins come from", the "Discussion" reverses the
> > claims made earlier in the paper -- we see that the avoidance of
> > indirect blocks is not the primary win (a conclusion we came to
> > on our own from viewing the earlier graphs).
> 
> This is correct, the big performance wins came from:
> 
> 	1) Large block sizes and small frag sizes
> 	2) Good clustering
> 	3) Multiple read-ahead

Yes; the only one failing in FreeBSD right now is #3.  There is some
room for improvement in #2 for writes, but that won't affect the
read/rewrite speeds (or shouldn't).


> > We also see in "Discussion" that caching file beginnings/ends in the
> > inode itself is not a win as they has hoped.  In fact, compilation
> > times are pessimized by 25% by it.
> 
> Yes, we were disappointed by that, but it just confirmed what others
> (Tanenbaum for example) had seen earlier.   You should remember 
> that one reason for the degradation was that it threw off all 
> the code that tries to read things in FS block-sized chunks.   
> We wanted to be able to read headers in files quickly and it *does*
> do that well.   We just thought it would be nice to provide that
> capability (some people have entire file systems dedicated to
> particular tasks).    Some space in the inode can be used for lots
> of things, perhaps it would be most useful at user level and disjoint
> from the bytes in the file.

I was considering this at one time for executable header information,
to cause the pages in the actual binary on disk to be block aligned
for more efficient paging at the boundry conditions.  This is probably
still a win, but might be better handled by moving to a more generic
attribution mechanism.

I should probably say that I've had experience both with upping the
directory block size (for Unicode and multiple name space support),
and with doubling the inode size (for use in attribution -- NetWare,
Apple, OS/2, and NT file attributes, specifically).  I saw similar
NULL effects everywhere but for competition with Veritas, which must
fault seperate pages for their file attribution, and so pay a serious
performance penalty for use of attributes, in comparison.  So for
that case, it maintained the status quo instead of being a loss.

I also found it useful to combine the directory lookup and stat
operations into a single system call, saving two protection domain
crossings, and to pre-fault the inode asynchronously when the
directory entry was referenced.  This last assumed locality of
usage, and was probably more related to the application (attributed
kernel FS for NetWare services in a UNIX environment) than to a general
win.  Like the file rewrite benchmark in the VIVA paper, this was
a specialized usage, and more related to implementation than
architecture... for file rewrites on a block boundry granularity,
it is not strictly necessary to fault in the blocks to be rewritten
from the disk, using the historical read-before-write, and then
comparing the (obviously bad) numbers that result.  To my knowledge,
partial page sequential writes on block or multiple-of-block
granularity are possible, but not currently supported by the FreeBSD
VM.  If supported, I expect that the win from doing the rewrite
that way will dwarf any small relative win into insignificance.


In any case, the VIVA code is worth pursuing; just not for the reasons
that seemed to be behind the original posting ("Linux has it, we must
have it").  If someone wants to pursue it mining for a thesis, it's
far better than some of the other areas people tend to look to.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.