From owner-freebsd-current Mon Aug 26 15:06:07 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id PAA19670 for current-outgoing; Mon, 26 Aug 1996 15:06:07 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id PAA19658; Mon, 26 Aug 1996 15:05:53 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA23328; Mon, 26 Aug 1996 14:55:12 -0700 From: Terry Lambert Message-Id: <199608262155.OAA23328@phaeton.artisoft.com> Subject: Re: The VIVA file system (fwd) To: eric@ms.uky.edu Date: Mon, 26 Aug 1996 14:55:12 -0700 (MST) Cc: terry@lambert.org, freebsd-fs@freebsd.org, current@freebsd.org In-Reply-To: <9608252145.aa12275@t2.t2.mscf.uky.edu> from "eric@ms.uky.edu" at Aug 25, 96 09:45:16 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > I guess I should respond to this thread since I happen to be on > all these nice FreeBSD mailing lists nowadays. > > The Linux version was done by one of Raphael's Master's students. > It isn't complete, but it does apparently work (I personally have > not seen it nor have I seen the performance figures). > > As a side note, I am currently working on Viva2, which should be a much > more interesting gadget. Anyway, it is in active development again > and FreeBSD is the platform this time. > > > The VIVA stuff is, I think, overoptimistic. > > > > They have made a number of claims in the University of Kentucky papers > > that were published about two years ago that seem to rely on overly > > optimistic assumptions about policy and usage. > > You might explain this one, I'm not sure I know what you mean. The > paper was written over three years ago. The work was actually > performed from late 1991-1992. The AT&T lawsuit came out, I became > distracted with making a living, and haven't gotten back to it until > a couple months ago. The optimism is in terms of comparative technology. The FFS you talk about in the paper is *not* the FFS FreeBSD is running. Sorry if this seemed like an attack on VIVA; it was intended as an attack on the idea of replacing FFS with VIVA based on the contents of the paper, and the fact that "Linux has it, now we need it!". I know that I saw the paper at least two years and 5 months ago, if not before that -- I *think* I saw it the week it came out; there was a presentation by one of the grad students involved to the USL FS gurus: Art Sabsevitch, Wen Ling Lu, etc., of the code on SVR4. > For all the discussion below, you must remember that the platforms for > Viva were 1) AT&T SysV, and 2) BSDI's BSD/386. We abandoned SysV > because I wanted to release the code, then came the AT&T lawsuit:-( I saw the code on #1. That's part of what made me skeptical; the SVR4 FFS implementation was intentionally (IMO) crippled on a lot of defaults and tunables so they could make the claims they did about VXFS. The VXFS code was the shining golden baby. Never mind that it was itself FFS derived (for example, it used SVR4 UFS directory management code without modification). Any comparison against SVR4 UFS as it was will be incredibly biased, even if the bias was not an intentional result of the testing conditions, because the UFS code came pre-biased. 8-(. > > They also seemed to pick "worst case" scenarios for comparison with FFS, > > and avoided FFS best case. > > We did our testing on clean, freshly newfs'd partitions for the graphs. > I don't see how this is "worst case", but perhaps you mean the types > of tests we ran. Obviously, we ran some tests that showed a difference > between FFS and Viva. Well, of course. And without looking at it side-by-side myself, I really couldn't say if there were only positive-for-VIVA comparisons existing, or only positive one presented. The near-full-disk scenario seemed a bit contrived, but *could* be justified in real world usage. Like I said, I'd like to see someone working on a thesis (or with a similar incentive for detail) revisit the whole thing in light of the rewritten FICUS-derived-VFS-based FFS code in FreeBSD, post cache-unification. I wonder if all the locality wins would still hold. My opinion is that at least three of them would not; I'd be happy to be proven wrong, and find out that they are additive to the VM/buffer cache architecture improvements in FreeBSD's VFS/VM interface. > > This is not nearly as bad as the MACH MSDOSFS papers, which intentioanlly > > handicapped FFS through parameter setting and cache reduction, while > > caching the entire DOS FAT in core was seen as being acceptable, to > > compare their work to FFS. > > > > But it is certainly not entirely unbiased reporting. > > I'm not sure how to react to this. Can one write an "entirely > unbiased" report about one's own work? Personally, I don't think > so. We tried. I'll leave it at that. I don't think it's possible, either -- conclusions must always be taken with a grain of salt. It is definitely *not* in the same class as the MACH paper, which exhibited intentional bias. I guess this would have read better if you knew about the MACH paper's taking of license before you read my comment. Sorry about that -- I tend to assume everyone has the same context, or is willing to get it. I wasn't going off half-cocked. > > The read and rewrite differences are moslty attributable to policy > > issues in the use of FFS "optimizations" which are inapropriate to > > the hardware used. > > The read and rewrite differences are due to the fact FFS didn't do > clustering very well at all. BSDI *still* doesn't do it well, but > FreeBSD appears to be much better at it. I'm still running tests > though and probably will be running tests for some time yet. Yes. Like I said, Matt Day's work on this is relevent, but it may never see the light. He's made some comments to the effect of what he has done in brief discussions on the -current list. Even with the recent work, there's a lot of room for improvement in FreeBSD clustering and write-gathering, without needing a new disk layout to get it. As before, I'm not sure if this is additive or parallel to the VIVA developement. > > Finally, I am interested in, but suspicious of, their compression > > claims, since they also claim that the FFS performance degradation, > > which Knuth clearly shows to be a hash effect to be expected after > > an 85% fill (in "Sorting and Searching"), to be nonexistant. > > Well, the results are in the paper. This is what we saw, but > you should look at the table carefully. There are places where > the effective clustering of a particular file degrades over 50%, > but that was (at the time) about as good as FFS ever did anyway. > The mean effective clustering always remained very high (90%+). Yes; I didn't make the distinction on "effective clustering", the term which was introduced in the paper. I'm really not sure about one cache effect being superior to another in a unified VM. The FFS does do clustering as well, and it seems that this has more to do with file I/O clustering mapping to disk I/O clustering effectively than anything that might be a real artifact of the FFS layout itself. The address space compression is interesting for vnode-based buffering; FreeBSD currently uses this method, but... well, I've railed often enough against vclean that I think everyone knows how I feel. > I should have some more modern numbers in a few months, Raphael's > student probably has some for Linux now. I think he used the > original algorithms. Yes... really, it wants a new paper (or 3 or 5 of them). There is a lot of room for exciting work in FS development. I just think that it would be ill-advised to expect the same gains in a FreeBSD framework, at least until some basic academic work has taken place that addresses the new situation. > > INRE: "where the wins come from", the "Discussion" reverses the > > claims made earlier in the paper -- we see that the avoidance of > > indirect blocks is not the primary win (a conclusion we came to > > on our own from viewing the earlier graphs). > > This is correct, the big performance wins came from: > > 1) Large block sizes and small frag sizes > 2) Good clustering > 3) Multiple read-ahead Yes; the only one failing in FreeBSD right now is #3. There is some room for improvement in #2 for writes, but that won't affect the read/rewrite speeds (or shouldn't). > > We also see in "Discussion" that caching file beginnings/ends in the > > inode itself is not a win as they has hoped. In fact, compilation > > times are pessimized by 25% by it. > > Yes, we were disappointed by that, but it just confirmed what others > (Tanenbaum for example) had seen earlier. You should remember > that one reason for the degradation was that it threw off all > the code that tries to read things in FS block-sized chunks. > We wanted to be able to read headers in files quickly and it *does* > do that well. We just thought it would be nice to provide that > capability (some people have entire file systems dedicated to > particular tasks). Some space in the inode can be used for lots > of things, perhaps it would be most useful at user level and disjoint > from the bytes in the file. I was considering this at one time for executable header information, to cause the pages in the actual binary on disk to be block aligned for more efficient paging at the boundry conditions. This is probably still a win, but might be better handled by moving to a more generic attribution mechanism. I should probably say that I've had experience both with upping the directory block size (for Unicode and multiple name space support), and with doubling the inode size (for use in attribution -- NetWare, Apple, OS/2, and NT file attributes, specifically). I saw similar NULL effects everywhere but for competition with Veritas, which must fault seperate pages for their file attribution, and so pay a serious performance penalty for use of attributes, in comparison. So for that case, it maintained the status quo instead of being a loss. I also found it useful to combine the directory lookup and stat operations into a single system call, saving two protection domain crossings, and to pre-fault the inode asynchronously when the directory entry was referenced. This last assumed locality of usage, and was probably more related to the application (attributed kernel FS for NetWare services in a UNIX environment) than to a general win. Like the file rewrite benchmark in the VIVA paper, this was a specialized usage, and more related to implementation than architecture... for file rewrites on a block boundry granularity, it is not strictly necessary to fault in the blocks to be rewritten from the disk, using the historical read-before-write, and then comparing the (obviously bad) numbers that result. To my knowledge, partial page sequential writes on block or multiple-of-block granularity are possible, but not currently supported by the FreeBSD VM. If supported, I expect that the win from doing the rewrite that way will dwarf any small relative win into insignificance. In any case, the VIVA code is worth pursuing; just not for the reasons that seemed to be behind the original posting ("Linux has it, we must have it"). If someone wants to pursue it mining for a thesis, it's far better than some of the other areas people tend to look to. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.