From owner-freebsd-fs  Tue Mar 21 16:24:19 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from ns0.netcraft.com (ns0.netcraft.com [195.188.192.4])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2BC4A37BBEB; Tue, 21 Mar 2000 16:23:59 -0800 (PST)
	(envelope-from richard@netcraft.com)
Received: (from richard@localhost)
          by ns0.netcraft.com (8.8.8/8.8.8) id AAA28786;
          Wed, 22 Mar 2000 00:22:42 GMT
          (envelope-from richard)
From: Richard Wendland <richard@netcraft.com>
Message-Id: <200003220022.AAA28786@ns0.netcraft.com>
Subject: FreeBSD random I/O performance issues
In-Reply-To: <38D6BBD7.DA4B950B@originative.co.uk> from Paul Richards at "Mar
 21, 2000 00:01:27 am"
To: Paul Richards <paul@originative.co.uk>
Date: Wed, 22 Mar 2000 00:22:42 +0000 (GMT)
Cc: Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>,
	Matthew Dillon <dillon@apollo.backplane.com>, current@FreeBSD.ORG,
	fs@FreeBSD.ORG
X-Mailer: ELM [version 2.4ME+ PL61 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Paul Richards said in "Re: patches for test / review":

> Richard, do you want to post a summary of your tests?

Well I'd best post the working draft of my report on the issues
I've seen, as I'm not going to have time to work on it in the near
future, and it raises serious performance issues that are best
looked at soon.  Note none of these detailed results are from
current, but Paul Richards has checked that these issues are still
present in current.

There are still issues to be explored so this report isn't in a
complete state, and not polished.  It's grown in 3 stages:

- initial Berkeley DB (random I/O) performance problem analysis
- side-issue of ATA outperforming SCSI systems at my synthetic benchmark
- interesting dramatic performance changes from changing seek multiple
  and I/O block size one byte from 8192

Note I've cc'd freebsd-fs, as this raises issues in the filesystem
area.  I've also changed the subject since I think there are broader
issues here than the clustering algorithm, and this email is rather
large to drop into an ongoing discussion.

The benchmark program source code is available, and easy to run,
the bottom of the report has links.

I don't have an explanation for the behaviour I have been measuring,
but I hope these quite extensive results will enable someone to
explain and perhaps suggest improvements.

	Richard.


Folks,

I appear to have found a serious performance problem with random
access file I/O in FreeBSD, and have a simple C benchmark program
which reproducibly demonstrates it.  In that the benchmark demonstrates
very poor non-async performance, this touches on the age-old
sync/async filesystem argument, and FreeBSD vs Linux debates.

I originally observed this problem with perl DB_File (Berkeley DB),
and with the help of truss have synthesised this benchmark as a
much simplified model of heavy Berkeley DB update behaviour.  Quite
probably other database-like software will have similar performance
issues.

This issue appears to be related to the traditional BSD behaviour
of immediately scheduling full disc block writes.  I think this
benchmark must be showing up a related bug.  But it is conceivable
that this is intended noasync behaviour, in which case the implications
need to be thought through.

The program does simple random I/O within a 64KB file, which should
I hope be fully cached so hardly any real I/O would be done.  Other
than mtime, this program makes no file meta-data or directory
changes; and the file remains the same size.

The file is used as 8 8KB blocks, and for each block in the order
0,5,2,7,4,1,6,3,0,... 10,000 lseek/read/lseek/write block updates
are done, much like updating 10,000 non-localised Berkeley DB file
records.

Using a tiny 64KB file is just to simplify and make a point.  My
original perl performance problems were with multi-megabyte files,
but still small enough to be fully cached.

I ran this on a large range of lightly loaded or idle machines,
which gave reproducible results.  Results and a summary of the
machines, which unless otherwise noted use SCSI 7200 RPM discs and
Adaptec controllers, are given in descending performance order
below.


  OS						Elapse secs, system

  FreeBSD 3.2-RELEASE, async mount		<1  (cheap ATA C433, 5400 RPM)
  Linux 2.2.13					<1  (Dell 1300, PIII 450MHz)
  Linux 2.0.36					3   (old ATA P200, 5400 RPM)
  Linux 2.0.36, sync [meta-data] mount		3   (old ATA P200, 5400 RPM)
  SunOS 5.5.1 (Solaris 2.5.1)			7   (old SS4/110, 5400 RPM)
  FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=5	15  (PII 450MHz, 512MB, 10k RPM)
  FreeBSD 2.2.7-RELEASE+CAM			21  (PII 400MHz, 512MB)
  FreeBSD 2.1.6.1-RELEASE			32  (old P100, 64MB)
  FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=2	39  (PII 400MHz, 512MB)
  FreeBSD 3.4-STABLE, vinum stripe+mirr=4	41  (dual PIII 500MHz, 1GB)
  FreeBSD 3.4-STABLE				41  (dual PIII 500MHz, 1GB)
  FreeBSD 2.1.6.1-RELEASE, ccd stripe=2		52  (old P100, 64MB)
  FreeBSD 3.3-RELEASE, ccd stripe=2		53  (Dell 1300, PIII 450MHz)
  FreeBSD 3.2-RELEASE				55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noatime mount		55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noclusterr mount		55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noclusterw mount		58  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.3-RELEASE				63  (Dell 1300, PIII 450MHz)
  FreeBSD 3.3-RELEASE, softupdates		63  (Dell 1300, PIII 450MHz)
  FreeBSD 3.2-RELEASE, sync mount		105 (cheap ATA C433, 5400 RPM)


I also have a range of results from an ATA (IDE) cheap deskside
Dell system running FreeBSD 3.3-RELEASE, with a range of wd(4)
flags.  This system exhibits much better performance than the SCSI
systems above at this benchmark, perhaps related to better DMA
ability.

ATA being faster than SCSI on this benchmark is a bit of a side-issue
to the thrust of this report, but the performance numbers may give
hints diagnosing the problem.

    Dell Dimension XPS T450 440BX
    IBM-DPTA-372730 (Deskstar 34GXP, 7200RPM, 2MB buffer)
    default mount options

	wd(4) flags				Elapse secs

	0x0000					19
	0x00ff, multi-sector transfer mode	17
	0x8000, 32bit transfers			13
	0x2000, bus-mastering DMA		4
	0xa0ff, BM-DMA+32bit+multi-sector	4


Note that Linux performs about the same for [meta-data] sync &
async mounts, which is as I'd expect for this program.  But FreeBSD
performance is hugely affected by async, sync or default (meta-data
sync) filesystem mounts, with noclusterw unsurprisingly making it
somewhat worse.

One interesting observation is that for non sync, async or noclusterw
mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000
writes.  If I change the program to use 16 blocks there are ~9375
I/O operations which is 15/16ths of the 10,000 writes.  Guessing,
this is as if writes are forced for all blocks but one.

With async filesystem mounts very little I/O occurs, and with
noclusterw there are ~10,000 operations matching the number of
writes.

With sync it's ~20,000 operations matching the total of reads &
writes.  This demonstrates another aspect of the bug, sync behaviour
should cause 10,000 operations; the reads aren't being cached.

A quick softupdates test suggests this makes no difference, as
would be expected.

Looking at mount output on FreeBSD 3 the substantial part of the
I/O is async in all cases other than sync mounts; as expected.


Another aspect of this issue is the effect of changing the seek
blocksize, and write blocksize, by 1 byte each way from 8192, thus
doing block unaligned I/O.  In some cases this changes the amount
of I/O recorded by getrusage to zero, and drops elapse time from
half a minute or so to less than 1 second.

Thanks to Paul Richard for noticing this.  I've not spent much time
researching this, so can only present my small set of measurements.
To do these tests you have to recompile my test program each time eg

	gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c

Sorry it's that crude.  These results are from a FreeBSD
2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system,
though exactly the same pattern is apparent with 3.4-STABLE.
"****" indicate sub-second "zero I/O" results.

BLOCKSIZE   WRITESIZE	csh 'time' output

8191	    8191	0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w
8191	    8192	0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w
8191	    8193	0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w

8192	    8191	0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w    ****
8192	    8192	0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w
8192	    8193	0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w

8193	    8191	0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w
8193	    8192	0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w
8193	    8193	0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w


8191	    4095	0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w
8191	    4096	0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w
8191	    4097	0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w

8192	    4095	0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w    ****
8192	    4096	0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w   ****
8192	    4097	0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w    ****

8193	    4095	0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w    ****
8193	    4096	0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w    ****
8193	    4097	0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w    ****


Any views gratefully received.  A fix would be much better :-)

Test program source, including compile & run instructions, is
available at:

	http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c

Detailed notes on the test system configurations are at:

	http://www.netcraft.com/freebsd/random-IO/results-notes.txt

Thanks,
	Richard
-
Richard Wendland				richard@netcraft.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Mar 21 16:59:42 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id E105B37BD80; Tue, 21 Mar 2000 16:59:29 -0800 (PST)
	(envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.3/8.9.1) id QAA83848;
	Tue, 21 Mar 2000 16:59:25 -0800 (PST)
	(envelope-from dillon)
Date: Tue, 21 Mar 2000 16:59:25 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200003220059.QAA83848@apollo.backplane.com>
To: Richard Wendland <richard@netcraft.com>
Cc: Paul Richards <paul@originative.co.uk>,
	Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>, current@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: FreeBSD random I/O performance issues
References:  <200003220022.AAA28786@ns0.netcraft.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


:Paul Richards said in "Re: patches for test / review":
:
:> Richard, do you want to post a summary of your tests?
:
:Well I'd best post the working draft of my report on the issues
:I've seen, as I'm not going to have time to work on it in the near
:future, and it raises serious performance issues that are best
:looked at soon.  Note none of these detailed results are from
:current, but Paul Richards has checked that these issues are still
:present in current.
:
: (lots of good stuff)

    Interesting.  The behavior is probably related closely to the
    write-behind methodology that UFS uses.

    A while back while fixing an O(N^2) degenerate condition in the buffer
    cache queueing code, DG and I had a long discussion of the write_behind
    behavior.  I added a sysctl to 4.x that changes the write_behind
    behavior:

	sysctl vfs.write_behind

	0	Turned off
	1	Normal		(default)
	2	Backed off

    It would be interesting to see how the benchmark performs with 
    write_behind turned off (set to 0).  Note that a setting of 2
    is highly experimental and will probably suffer from the same problem(s)
    that normal mode suffers from.  (see below, I ran the benchmark)

    In general turning off write behind is *NOT* a good idea, because
    it saturates the buffer cache with dirty blocks and can lead to seriously
    degraded performance on a normal system due to write hogging.   On the
    flip side, this was all before I put in the new buffer cache flushing code
    so it is possible that 4.x will not degrade as seriously with write
    behind turned off.  I haven't run saturation tests recently with 
    write_behind turned off.

    A secondary issue -- actually the reason *why* performance is so bad, is
    that the buffer cache nominally locks the underlying VM pages when issuing
    a write and this is almost certainly the cause of the program stalls.
    When a program writes a piece of data (and I/O is started immediately),
    and then reads it back later on, the read operation may stall even though
    the data is in the cache due to the write not having yet completed.  The
    write operation might also stall if another nearby write is in progress
    (I'm not sure on that last point).

    Kirk has made significant improvements to stalls related to bitmap 
    operations.  I'm not sure if softupdates must be turned on or not to
    get these improvements.  The data blocks can still stall, though, but 
    part of the plan for later this year is to fix that too.

:The benchmark program source code is available, and easy to run,
:the bottom of the report has links.

    test3:/test/tmp# sysctl -w vfs.write_behind=0		(turned off)
    test3:/test/tmp# time ./seekreadwrite xxx 10000
    0.125u 0.807s 0:00.93 98.9%     5+181k 0+0io 0pf+0w

    test3:/test/tmp# sysctl -w vfs.write_behind=1		(normal)
    test3:/test/tmp# time ./seekreadwrite xxx 10000
    0.040u 1.709s 0:32.57 5.3%      4+174k 0+8750io 0pf+0w


:I also have a range of results from an ATA (IDE) cheap deskside
:Dell system running FreeBSD 3.3-RELEASE, with a range of wd(4)
:flags.  This system exhibits much better performance than the SCSI
:systems above at this benchmark, perhaps related to better DMA
:ability.
:
:ATA being faster than SCSI on this benchmark is a bit of a side-issue
:to the thrust of this report, but the performance numbers may give
:hints diagnosing the problem.

    IDE drives sometimes appear to be faster because they fake the 
    write-completion response (they return the response prior to the
    write actually completing).  It could also simply be that the 
    lack of any real mixed I/O (due to the file being so small) is
    a slightly faster operation on an IDE drive.  I wouldn't read much
    into it... where SCSI really shines is in more heavily loaded 
    environments.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

:Thanks,
:	Richard
:-
:Richard Wendland				richard@netcraft.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Mar 21 18:45:31 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from mailgate.originative.co.uk (mailgate.originative.co.uk [194.217.50.228])
	by hub.freebsd.org (Postfix) with ESMTP
	id 799A437C048; Tue, 21 Mar 2000 18:45:18 -0800 (PST)
	(envelope-from paul@originative.co.uk)
Received: from originative.co.uk (lobster.originative.co.uk [194.217.50.241])
	by mailgate.originative.co.uk (Postfix) with ESMTP
	id 614EA1D131; Wed, 22 Mar 2000 02:45:16 +0000 (GMT)
Message-ID: <38D833BC.A082DF09@originative.co.uk>
Date: Wed, 22 Mar 2000 02:45:16 +0000
From: Paul Richards <paul@originative.co.uk>
Organization: Originative Solutions Ltd
X-Mailer: Mozilla 4.7 [en] (X11; I; FreeBSD 4.0-CURRENT i386)
X-Accept-Language: en-GB, en
MIME-Version: 1.0
To: Richard Wendland <richard@netcraft.com>
Cc: Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>,
	Matthew Dillon <dillon@apollo.backplane.com>, current@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: FreeBSD random I/O performance issues
References: <200003220022.AAA28786@ns0.netcraft.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Richard Wendland wrote:
> 

I spent a bit of time analysing these results when I first saw them. I
don't think it has anything to do with the cache, it has to do with how
we write out blocks.

> One interesting observation is that for non sync, async or noclusterw
> mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000
> writes.  If I change the program to use 16 blocks there are ~9375
> I/O operations which is 15/16ths of the 10,000 writes.  Guessing,
> this is as if writes are forced for all blocks but one.

This is due to a quirk of the clustering algorithm. See below or my
previous email.

> With async filesystem mounts very little I/O occurs, and with
> noclusterw there are ~10,000 operations matching the number of
> writes.
> 
> With sync it's ~20,000 operations matching the total of reads &
> writes.  This demonstrates another aspect of the bug, sync behaviour
> should cause 10,000 operations; the reads aren't being cached.

This isn't quite true. It's 20,000 *write* operations. I put this down
to the mtime update for each write doubling the number of actual write
operations. No read operations take place, the data *does* come out of
the cache. There's nothing wrong with reading as far as I can tell.
  
> Another aspect of this issue is the effect of changing the seek
> blocksize, and write blocksize, by 1 byte each way from 8192, thus
> doing block unaligned I/O.  In some cases this changes the amount
> of I/O recorded by getrusage to zero, and drops elapse time from
> half a minute or so to less than 1 second.
> 
> Thanks to Paul Richard for noticing this.  I've not spent much time
> researching this, so can only present my small set of measurements.
> To do these tests you have to recompile my test program each time eg
> 
>         gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c

This is because of the fact that if the filesystem block is full it is
written immediately, or rather the clustering code is called
immediately. The rationale is that a full block isn't likely to be
written to again so it might as well be pushed out to disk. Richard's
program deliberately writes full blocks, which is apparently what db
does, so it always forces a write to take place. Given the behaviour of
db it might be more sensible to remove this feature and just mark full
blocks dirty the same as other blocks since it's likely that they will
be written to again shortly if the db record is written to frequently.

The clustering code has a bug in that an old cluster is not pushed out
if the block no is 0 because the code that would do so never gets
reached.

if (lbn == 0)
        vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0;


if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 ||
        (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) {
        maxclen = vp->v_mount->mnt_iosize_max / lblocksize - 1;
        if (vp->v_clen != 0) {
            /*
             * Next block is not sequential.
             *
             * If we are not writing at end of file, the process
             * seeked to another point in the file since its last
             * write, or we have reached our maximum cluster size,
             * then push the previous cluster. Otherwise try
             * reallocating to make it sequential.
             */

         ............

In Richard's program the next block is never sequential so the previous
cluster is always pushed *except* that when the program seeks back to
block zero the
"if (vp->v_clen != 0)" fails and a new cluster is started without
pushing out the previously started one. That dirty block in the previous
cluster then hangs around until it is flushed as dirty blocks normally
would be.

It is the combination of this clustering behaviour and the fact that the
program always writes full blocks that causes the 8750 writes below.
Since the blocks are full file system blocks rather than mark them dirty
they are immediately passed to the clustering code, because they are
never in sequence the clustering code always starts a new cluster and
flushes the previous one except for 1 in every 8 blocks that doesn't
happen because when block 0 is written the previous cluster is not
pushed out but hangs around.  The end result is that 7/8 blocks get
written immediately which is 8750/10000 writes.

When the write size drops below the filesystem block size then the
clustering code never gets called because the buffers are just marked
dirty and cached.

I think if we fixed the issue of writing out full blocks this behviour
would stop but I also think the clustering code could do with a fix. It
should at least check to see if there is a cluster being built when the
blockno is 0 and push it out. Possibly though it'd be better to not push
out clusters of only one block and just leave them in the cache.

> 
> Sorry it's that crude.  These results are from a FreeBSD
> 2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system,
> though exactly the same pattern is apparent with 3.4-STABLE.
> "****" indicate sub-second "zero I/O" results.
> 
> BLOCKSIZE   WRITESIZE   csh 'time' output
> 
> 8191        8191        0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w
> 8191        8192        0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w
> 8191        8193        0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w
> 
> 8192        8191        0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w    ****
> 8192        8192        0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w
> 8192        8193        0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w
> 
> 8193        8191        0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w
> 8193        8192        0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w
> 8193        8193        0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w
> 
> 8191        4095        0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w
> 8191        4096        0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w
> 8191        4097        0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w
> 
> 8192        4095        0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w    ****
> 8192        4096        0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w   ****
> 8192        4097        0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w    ****
> 
> 8193        4095        0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w    ****
> 8193        4096        0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w    ****
> 8193        4097        0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w    ****
> 
> Any views gratefully received.  A fix would be much better :-)
> 
> Test program source, including compile & run instructions, is
> available at:
> 
>         http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c
> 
> Detailed notes on the test system configurations are at:
> 
>         http://www.netcraft.com/freebsd/random-IO/results-notes.txt
> 
> Thanks,
>         Richard
> -
> Richard Wendland                                richard@netcraft.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Mar 21 22:18: 4 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 80A3637BB18; Tue, 21 Mar 2000 22:17:57 -0800 (PST)
	(envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.3/8.9.1) id WAA86154;
	Tue, 21 Mar 2000 22:17:52 -0800 (PST)
	(envelope-from dillon)
Date: Tue, 21 Mar 2000 22:17:52 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200003220617.WAA86154@apollo.backplane.com>
To: Paul Richards <paul@originative.co.uk>
Cc: Richard Wendland <richard@netcraft.com>,
	Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>, current@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: FreeBSD random I/O performance issues
References: <200003220022.AAA28786@ns0.netcraft.com> <38D833BC.A082DF09@originative.co.uk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

:written immediately which is 8750/10000 writes.
:
:When the write size drops below the filesystem block size then the
:clustering code never gets called because the buffers are just marked
:dirty and cached.
:
:I think if we fixed the issue of writing out full blocks this behviour
:would stop but I also think the clustering code could do with a fix. It
:should at least check to see if there is a cluster being built when the
:blockno is 0 and push it out. Possibly though it'd be better to not push
:out clusters of only one block and just leave them in the cache.

    Hmm.  Your analysis is correct but I don't think it's worth
    fixing the block-is-0 case.   It may be worth revisiting the
    write-behind code to try to give it the ability to better discern
    random I/O from sequential I/O (e.g. perhaps it should ignore unaligned
    full blocks).

    It is perfectly ok for dirty blocks to remain in the buffer cache.  In
    fact, it's *optimal* to leave them in the buffer cache as long as the
    buffer cache does not get saturated with them.  The buffer cache is
    perfectly capable of clustering delayed writes.  Also, the filesystem 
    syncer comes along every 30 seconds or so anyway and flushes everything
    out.

    What the write-behind code tries to do is to prevent the buffer cache 
    from being saturated with dirty buffers and to smooth out disk write
    I/O.  It makes the assumption that write-behind data is not typically
    accessed by the program immediately after being written -- an assumption
    that winds up being incorrect in the DBM case you tested and resulting
    in stalls due to the buffer / VM pages being locked during the write I/O.
    The stalls are *not* due to the I/O itself but instead are due to side
    effects of the I/O being in-progress.  If a user program doesn't access
    any of the information it recently wrote the whole mechanism winds up
    operating asynchronously in the background.  If a user program does, 
    then the write behind mechanism breaks down and you get a stall.

    The most common dirty-data case the filesystem has to deal with is 
    appending to a file -- that is, doing piecemeal sequential writes.  There
    are virtually no other cases which have the ability to saturate the
    buffer cache.  This is why the write-behind code only tries to handle
    the piecemeal-write-flush-full-blocks case.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Mar 22  7:46: 3 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from ns0.netcraft.com (ns0.netcraft.com [195.188.192.4])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2076B37BD9A; Wed, 22 Mar 2000 07:45:56 -0800 (PST)
	(envelope-from richard@netcraft.com)
Received: (from richard@localhost)
          by ns0.netcraft.com (8.8.8/8.8.8) id PAA08760;
          Wed, 22 Mar 2000 15:44:20 GMT
          (envelope-from richard)
From: Richard Wendland <richard@netcraft.com>
Message-Id: <200003221544.PAA08760@ns0.netcraft.com>
Subject: Re: FreeBSD random I/O performance issues
In-Reply-To: <38D833BC.A082DF09@originative.co.uk> from Paul Richards at "Mar
 22, 2000 02:45:16 am"
To: Paul Richards <paul@originative.co.uk>
Date: Wed, 22 Mar 2000 15:44:20 +0000 (GMT)
Cc: Richard Wendland <richard@netcraft.com>,
	Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>,
	Matthew Dillon <dillon@apollo.backplane.com>, current@FreeBSD.ORG,
	fs@FreeBSD.ORG
X-Mailer: ELM [version 2.4ME+ PL61 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > With sync it's ~20,000 operations matching the total of reads &
> > writes.  This demonstrates another aspect of the bug, sync behaviour
> > should cause 10,000 operations; the reads aren't being cached.
> 
> This isn't quite true. It's 20,000 *write* operations. I put this down
> to the mtime update for each write doubling the number of actual write
> operations. No read operations take place, the data *does* come out of
> the cache. There's nothing wrong with reading as far as I can tell.

Yes, you're absolutely right, I should have looked at my own data
more closely.

If I change the test program to call fsync after write, and run on
a default mount filesystem I also see 20,000 I/O operations from
10,000 writes.  This probably impacts more real programs out there
than sync mounts.

If this is mtime updates being does synchronously, that seems a
separate issue to the clustering/VM issue, and seems to me it should
be fixed.  It'll normally double the number of all writes won't
it, possibly forcing seeks between otherwise localised access.

Can anyone offer an alternative hypothesis to mtime updates being
done synchronously?

Looking at my logs for the sync filesystem test, mount output before
and after shows all ~20,000 operations are writes::

	mount
	/dev/wd0s2e on /var (local, synchronous, writes: sync 182 async 10)

	time ./seekreadwrite xxx 10000
	0.1u 7.8s 1:47.61 7.4% 5+179k 0+20000io 0pf+0w

	mount
	/dev/wd0s2e on /var (local, synchronous, writes: sync 20190 async 15)

But when using fsync on a default mount filesystems, 10000 writes
are sync and 10000 async:

	mount
	/dev/wd0s2e on /var (local, writes: sync 682 async 2764)

	time ./seekreadwrite xxx 10000
	0.0u 1.7s 0:54.34 3.3% 4+171k 0+20000io 0pf+0w

	mount
	/dev/wd0s2e on /var (local, writes: sync 10682 async 12777)

This is on the ATA machine that could run the test in 4 seconds
without fsync, 54 seconds with fsync, suggesting some head movements
may be being forced, though not 20000 as that would imply 2.7ms
per seek.

	Richard
-- 
Richard Wendland				richard@netcraft.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Mar 22 12:38:46 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from mass.cdrom.com (mg134-217.ricochet.net [204.179.134.217])
	by hub.freebsd.org (Postfix) with ESMTP
	id 1EB4A37C24E; Wed, 22 Mar 2000 12:38:32 -0800 (PST)
	(envelope-from msmith@mass.cdrom.com)
Received: from mass.cdrom.com (localhost [127.0.0.1])
	by mass.cdrom.com (8.9.3/8.9.3) with ESMTP id MAA00661;
	Wed, 22 Mar 2000 12:39:46 -0800 (PST)
	(envelope-from msmith@mass.cdrom.com)
Message-Id: <200003222039.MAA00661@mass.cdrom.com>
X-Mailer: exmh version 2.1.1 10/15/1999
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Paul Richards <paul@originative.co.uk>,
	Richard Wendland <richard@netcraft.com>,
	Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>, current@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: FreeBSD random I/O performance issues 
In-reply-to: Your message of "Tue, 21 Mar 2000 22:17:52 PST."
             <200003220617.WAA86154@apollo.backplane.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 22 Mar 2000 12:39:42 -0800
From: Mike Smith <msmith@freebsd.org>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

>     effects of the I/O being in-progress.  If a user program doesn't access
>     any of the information it recently wrote the whole mechanism winds up
>     operating asynchronously in the background.  If a user program does, 
>     then the write behind mechanism breaks down and you get a stall.

What makes no sense is that it should be perfectly ok to _read_ this 
information back.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  msmith@freebsd.org
\\ and he'll hate you for a lifetime.             \\  msmith@cdrom.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Mar 22 16:10:44 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 1803B37C259; Wed, 22 Mar 2000 16:10:40 -0800 (PST)
	(envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.3/8.9.1) id QAA94351;
	Wed, 22 Mar 2000 16:10:39 -0800 (PST)
	(envelope-from dillon)
Date: Wed, 22 Mar 2000 16:10:39 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200003230010.QAA94351@apollo.backplane.com>
To: Mike Smith <msmith@FreeBSD.ORG>
Cc: Paul Richards <paul@originative.co.uk>,
	Richard Wendland <richard@netcraft.com>,
	Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>, current@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: FreeBSD random I/O performance issues 
References:  <200003222039.MAA00661@mass.cdrom.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


:
:>     effects of the I/O being in-progress.  If a user program doesn't access
:>     any of the information it recently wrote the whole mechanism winds up
:>     operating asynchronously in the background.  If a user program does, 
:>     then the write behind mechanism breaks down and you get a stall.
:
:What makes no sense is that it should be perfectly ok to _read_ this 
:information back.

    When we separate out the read vs write access in the buffer
    cache API we *will* be able to read the information back while a
    write is in progress.  At the moment the buffer cache has no clue
    how a buffer is going to be used, which means the buffer is locked
    exclusively.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Mar 22 16:17:37 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from fLuFFy.iNt.tElE.dK (fw1.inet.tele.dk [193.163.158.4])
	by hub.freebsd.org (Postfix) with ESMTP
	id B20CC37C2B9; Wed, 22 Mar 2000 16:17:17 -0800 (PST)
	(envelope-from pedophile@INT.TELE.DK)
Received: from localhost (pedophile@localhost)
	by fLuFFy.iNt.tElE.dK (8.9.3/8.9.3) with SMTP id BAA86413;
	Thu, 23 Mar 2000 01:17:06 +0100 (CET)
	(envelope-from pedophile@INT.TELE.DK)
X-Authentication-Warning: fLuFFy.iNt.tElE.dK: pedophile owned process doing -bs
Date: Thu, 23 Mar 2000 01:17:06 +0100 (CET)
From: FREENIX IS OVERRATED <pedophile@INT.TELE.DK>
Reply-To: FreeBSD-abusers@netscum.dk
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: current@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: FreeBSD random I/O performance issues
In-Reply-To: <fa.kiqr0qv.1enaji3@ifi.uio.no>
Message-ID: <Pine.BSF.3.96.1000323001655.85607G-100000@fLuFFy.iNt.tElE.dK>
X-Pedophile: BARRY BOUWSMA IS AN OFFENSIVE USENET PEDOPHILE
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 2395 Sep 1993, Matthew Dillon wrote:

>     It is perfectly ok for dirty blocks to remain in the buffer cache.  In
>     fact, it's *optimal* to leave them in the buffer cache as long as the
>     buffer cache does not get saturated with them.  The buffer cache is
>     perfectly capable of clustering delayed writes.  Also, the filesystem 
>     syncer comes along every 30 seconds or so anyway and flushes everything
>     out.
> 
>     What the write-behind code tries to do is to prevent the buffer cache 
>     from being saturated with dirty buffers and to smooth out disk write
>     I/O.  It makes the assumption that write-behind data is not typically
>     accessed by the program immediately after being written -- an assumption
>     that winds up being incorrect in the DBM case you tested and resulting
>     in stalls due to the buffer / VM pages being locked during the write I/O.
>     The stalls are *not* due to the I/O itself but instead are due to side
>     effects of the I/O being in-progress.

And that sounds a heck of a lot like what those of us who have been
running INN news swervers with 1,1GB size text history files on 2.whazzit
(now dead, may it rest in pieces widely-scattered) and later have seen.

You should have forgotten that a couple months or so ago, I wrote to
one of these lists to ask why I was getting only about 50-70%
availability as my 1.5+MD5-based-dbz innd was stuck in ufslck2 during
these every-30-seconds syncs.  The .hash and .index files from this,
which are comparable to the dbm (dbz) files being typically 125MB and
85MB or so, this under 3.4-STABLE.

Well, I've meant to get around to trying 4.0 on it, and Real Soon Now
I will, but I wanted to relate my experiences in turning traitor, a
heretic who has left the fold, deserving to be ridden out of town on
a rail and stuff, which sounds like a lot of fun.  I tried NetBSD.

NetBSD (at least the development now 1.4V version) has trickle
syncing, which seems to work quite well when having to cope with
these rather large database files, keeping a full 14 days of message
IDs from a full news feed.

Without really tuning anything, after a bit of time, the time needed
to do history lookups drops to microseconds, and as long as a `sync'
isn't needed, innd doesn't get stuck.  Theoretically, a sync, where
you are in fact seeking rather wildly over the disk to update these
files, happens once a day at expiry.  Depending on the speed of the
drive (and I haven't optimized this at all, using a single drive for
OS, logs, history, and part of spool, with a second drive for the rest
of the spool, far from an ideal setup), this seems to mean only a
few minutes of downtime.  Actually building the new .index and .hash
files goes quite a bit faster, like by a factor of 3 to 4, so clearly
the update of these files during the `sync' could stand improved sorting.

I wouldn't complain a bit if you were to steal mercilessly from the
NetBSD k0deZ to incorporate trickle sync (if something comparable is
not already in place) since that seems to make a world of difference
for those of us using long-outdated INN code and who want to have
bigger history file sizes than our shriveling Freenix members.

(What kills me now is that I'm using a single drive to hold the news
spool apart from a small overflow, so while time spent accessing this
history database is way down, the time actually spent hopping around
the disk to write (and read, for our sluggish peers) articles has
skyrocketed.  The box I'll try 4.0 on has a separate disk pack that
is far faster under NetBSD.  Test boxen, eh?)


There.  I've confessed.  It feels really good.  Now have at me.


Naturally, since I haven't followed this discussion closely, you may
be talking about something completely different, but I did want to
mention generally improved (yet not totally perfect) performance
with huge INN database files and NetBSD's trickle syncing.  Now,
go out and steal some k0deZ, okay?


barry bouwsma, tele danMerika internet

-- 

     *** This was posted with the express permission of ***
     ******************************************************
     **  HIS HIGHNESS KAAZMANN LORD AND MASTER OF USENET **
     ******************************************************
     ********* We are simple servants of his will *********


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Mar 22 16:48:47 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 413E037B5BD; Wed, 22 Mar 2000 16:48:43 -0800 (PST)
	(envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.3/8.9.1) id QAA94830;
	Wed, 22 Mar 2000 16:48:40 -0800 (PST)
	(envelope-from dillon)
Date: Wed, 22 Mar 2000 16:48:40 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200003230048.QAA94830@apollo.backplane.com>
To: FREENIX IS OVERRATED <pedophile@INT.TELE.DK>
Cc: current@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: FreeBSD random I/O performance issues
References:  <Pine.BSF.3.96.1000323001655.85607G-100000@fLuFFy.iNt.tElE.dK>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


:>     out.
:> 
:>     What the write-behind code tries to do is to prevent the buffer cache 
:>     from being saturated with dirty buffers and to smooth out disk write
:>     I/O.  It makes the assumption that write-behind data is not typically
:>     accessed by the program immediately after being written -- an assumption
:>     that winds up being incorrect in the DBM case you tested and resulting
:>     in stalls due to the buffer / VM pages being locked during the write I/O.
:>     The stalls are *not* due to the I/O itself but instead are due to side
:>     effects of the I/O being in-progress.
:
:And that sounds a heck of a lot like what those of us who have been
:running INN news swervers with 1,1GB size text history files on 2.whazzit
:(now dead, may it rest in pieces widely-scattered) and later have seen.
:
:You should have forgotten that a couple months or so ago, I wrote to
:one of these lists to ask why I was getting only about 50-70%
:availability as my 1.5+MD5-based-dbz innd was stuck in ufslck2 during
:these every-30-seconds syncs.  The .hash and .index files from this,
:which are comparable to the dbm (dbz) files being typically 125MB and
:85MB or so, this under 3.4-STABLE.
:
:Well, I've meant to get around to trying 4.0 on it, and Real Soon Now
:I will, but I wanted to relate my experiences in turning traitor, a
:heretic who has left the fold, deserving to be ridden out of town on
:a rail and stuff, which sounds like a lot of fun.  I tried NetBSD.
:
:NetBSD (at least the development now 1.4V version) has trickle
:syncing, which seems to work quite well when having to cope with
:these rather large database files, keeping a full 14 days of message
:IDs from a full news feed.

    Personally speaking I agree with you in regards to the syncer code.
    I don't have time to fix it, though I suspect it would not be 
    difficult.  Trickle syncing is an inherently easy thing to do.

    Kirk and I have both had serious trouble with the syncer daemon not
    being able to smooth out write I/O's due to it fsync'ing whole files
    all in one go.  The buffer daemon does a much better job which is why
    the speedup_syncer stuff is being slowly depreciated in favor of 
    bd_speedup().

    For INN there are several things you can tune in 4.0.  First and
    foremost you can try turning off the write-behind code, 
    sysctl -w vfs.write_behind=0.  Secondly you can mess around with 
    the vfs.hidirtybuffers sysctl (generally lower it) in order to
    force out dirty pages earlier and thus reduce the number that 
    fsync has to deal with.  I believe that INN also messes around with
    shared/R+W mmap()'s - it may be possible to add MAP_NOSYNC to those
    maps to turn off the 30 second fsync on pages dirtied through the VM
    system (for those maps), though this may increase the amount of stale
    (unwritten) data after a crash.

:There.  I've confessed.  It feels really good.  Now have at me.
:
:Naturally, since I haven't followed this discussion closely, you may
:be talking about something completely different, but I did want to
:mention generally improved (yet not totally perfect) performance
:with huge INN database files and NetBSD's trickle syncing.  Now,
:go out and steal some k0deZ, okay?
:
:
:barry bouwsma, tele danMerika internet

							-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Mar 22 23:35:56 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from muzak.iinet.net.au (muzak.iinet.net.au [203.59.24.237])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7822237B574; Wed, 22 Mar 2000 23:35:39 -0800 (PST)
	(envelope-from julian@elischer.org)
Received: from jules.elischer.org (reggae-09-79.nv.iinet.net.au [203.59.67.79])
	by muzak.iinet.net.au (8.8.5/8.8.5) with SMTP id PAA30777;
	Thu, 23 Mar 2000 15:35:26 +0800
Message-ID: <38D9B306.2781E494@elischer.org>
Date: Wed, 22 Mar 2000 23:34:11 -0800
From: Julian Elischer <julian@elischer.org>
X-Mailer: Mozilla 3.04Gold (X11; I; FreeBSD 5.0-CURRENT i386)
MIME-Version: 1.0
To: Mike Smith <msmith@freebsd.org>
Cc: Matthew Dillon <dillon@apollo.backplane.com>,
	Paul Richards <paul@originative.co.uk>,
	Richard Wendland <richard@netcraft.com>,
	Alfred Perlstein <bright@wintelcom.net>,
	Poul-Henning Kamp <phk@critter.freebsd.dk>, current@freebsd.org,
	fs@freebsd.org
Subject: Re: FreeBSD random I/O performance issues
References: <200003222039.MAA00661@mass.cdrom.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

This is one of the things that made us do so badly
in the benchmarks against NT/Linux last year.
OBVIOUSLY one should be able to re-read this infoirmation
without affecting a pending write.

Mike Smith wrote:
> 
> >     effects of the I/O being in-progress.  If a user program doesn't access
> >     any of the information it recently wrote the whole mechanism winds up
> >     operating asynchronously in the background.  If a user program does,
> >     then the write behind mechanism breaks down and you get a stall.
> 
> What makes no sense is that it should be perfectly ok to _read_ this
> information back.
> 


-- 
      __--_|\  Julian Elischer
     /       \ julian@elischer.org
    (   OZ    ) World tour 2000
---> X_.---._/  presently in:  Perth
            v


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Mar 23  1:51: 4 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from trinity.skynet.be (trinity.skynet.be [195.238.2.38])
	by hub.freebsd.org (Postfix) with ESMTP
	id A8A2A37C3E6; Thu, 23 Mar 2000 01:50:59 -0800 (PST)
	(envelope-from blk@skynet.be)
Received: from [195.238.1.121] (brad.techos.skynet.be [195.238.1.121])
	by trinity.skynet.be (Postfix) with ESMTP
	id 6DBEC1814B; Thu, 23 Mar 2000 10:50:34 +0100 (MET)
Mime-Version: 1.0
X-Sender: blk@pop.skynet.be
Message-Id: <v04220805b4ff94c7c0ec@[195.238.1.121]>
In-Reply-To:
 <Pine.BSF.3.96.1000323001655.85607G-100000@fLuFFy.iNt.tElE.dK>
References:
 <Pine.BSF.3.96.1000323001655.85607G-100000@fLuFFy.iNt.tElE.dK>
Date: Thu, 23 Mar 2000 10:34:42 +0100
To: FreeBSD-abusers@netscum.dk,
	Matthew Dillon <dillon@apollo.backplane.com>
From: Brad Knowles <blk@skynet.be>
Subject: Re: FreeBSD random I/O performance issues
Cc: current@FreeBSD.ORG, fs@FreeBSD.ORG
Content-Type: text/plain; charset="us-ascii" ; format="flowed"
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

At 1:17 AM +0100 2000/3/23, FREENIX IS OVERRATED wrote:

>  Without really tuning anything, after a bit of time, the time needed
>  to do history lookups drops to microseconds, and as long as a `sync'
>  isn't needed, innd doesn't get stuck.  Theoretically, a sync, where
>  you are in fact seeking rather wildly over the disk to update these
>  files, happens once a day at expiry.  Depending on the speed of the
>  drive (and I haven't optimized this at all, using a single drive for
>  OS, logs, history, and part of spool, with a second drive for the rest
>  of the spool, far from an ideal setup), this seems to mean only a
>  few minutes of downtime.  Actually building the new .index and .hash
>  files goes quite a bit faster, like by a factor of 3 to 4, so clearly
>  the update of these files during the `sync' could stand improved sorting.

	There are those of us running Diablo that solve this sort of 
problem on our main news peering servers by having the entire history 
file stored on a memory-based filesystem, so that we can sustain 
1000-2000 history lookups per second.

	Obviously, this solution is not scalable to news spool servers, 
because you can't afford to lose the history file for a months worth 
of news, but the current mmap() based solution for the indexes of the 
history database seems to cause much more disk accesses than I would 
like to see.


	Perhaps this would be a good application for md?

--
   These are my opinions -- not to be taken as official Skynet policy
======================================================================
Brad Knowles, <blk@skynet.be>                || Belgacom Skynet SA/NV
Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124
Phone/Fax: +32-2-706.13.11/12.49             || B-1140 Brussels
http://www.skynet.be                         || Belgium


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Mar 24 19:11:57 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from io.dreamscape.com (io.dreamscape.com [206.64.128.6])
	by hub.freebsd.org (Postfix) with ESMTP id 2309637B6F6
	for <freebsd-fs@FreeBSD.ORG>; Fri, 24 Mar 2000 19:11:38 -0800 (PST)
	(envelope-from krentel@dreamscape.com)
Received: from dreamscape.com (sA20-p50.dreamscape.com [209.217.200.242])
          by io.dreamscape.com (8.9.3/8.8.4) with ESMTP
	  id WAA25887; Fri, 24 Mar 2000 22:10:45 -0500 (EST)
X-Dreamscape-Track-A: sA20-p50.dreamscape.com [209.217.200.242]
X-Dreamscape-Track-B: Fri, 24 Mar 2000 22:10:45 -0500 (EST)
Received: (from krentel@localhost)
	by dreamscape.com (8.9.3/8.9.3) id WAA00537;
	Fri, 24 Mar 2000 22:10:50 -0500 (EST)
	(envelope-from krentel)
Date: Fri, 24 Mar 2000 22:10:50 -0500 (EST)
From: "Mark W. Krentel" <krentel@dreamscape.com>
Message-Id: <200003250310.WAA00537@dreamscape.com>
To: freebsd-fs@FreeBSD.ORG
Subject: ext2fs optional features
Cc: kwc@world.std.com
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

This question was asked in -stable a couple days ago, but it really
belongs in -fs.

Recently, some changes were made to the ext2fs support that prohibit
R/W mounts for some newer ext2fs partitions with optional features.
I've seen this with Red Hat 6.1 and Slackware 7.  Red Hat 6.0 seems to
use an older format.

This is what Linux's tune2fs reports:

   # tune2fs -l /dev/sdb2 
   tune2fs 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09
   Filesystem volume name:   <none>
   Last mounted on:          <not available>
   Filesystem UUID:          38a27662-0012-11d4-8f7a-ead76bc87798
   Filesystem magic number:  0xEF53
   Filesystem revision #:    1 (dynamic)
   Filesystem features:      sparse_super
   Filesystem state:         not clean
   Errors behavior:          Continue
   Filesystem OS type:       Linux
   ...

And this is what appears in the logs:

   Mar 24 21:36:47 blue /kernel: WARNING: R/W mount of dev 0x3040a 
   denied due to unsupported optional features

What are the optional features?  What does "sparse_super" do?
Does Linux actually use these features, or are they for future use?

Is it possible to support R/W mounts with these features?

I remember 3.4-release let me mount the same filesystem R/W.  Was I
unknowingly corrupting the filesystem, or running some risk of a panic?

I noticed that tune2fs also reported:

   Block size:               4096
   Fragment size:            4096

Does Linux really not support fragments??  I was stunned.

Much thanks for any answers.

--Mark Krentel


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Mar 25  1:48:20 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from mail.cs.tu-berlin.de (mail.cs.tu-berlin.de [130.149.17.13])
	by hub.freebsd.org (Postfix) with ESMTP id 3F5F537B6D1
	for <freebsd-fs@FreeBSD.ORG>; Sat, 25 Mar 2000 01:48:11 -0800 (PST)
	(envelope-from loewis@cs.tu-berlin.de)
Received: from rubel.cs.tu-berlin.de (loewis@rubel.cs.tu-berlin.de [130.149.20.46])
	by mail.cs.tu-berlin.de (8.9.3/8.9.3) with ESMTP id KAA00572;
	Sat, 25 Mar 2000 10:45:01 +0100 (MET)
Received: (from loewis@localhost)
	by rubel.cs.tu-berlin.de (8.9.3/8.9.3) id KAA29526;
	Sat, 25 Mar 2000 10:44:56 +0100 (MET)
Date: Sat, 25 Mar 2000 10:44:56 +0100 (MET)
Message-Id: <200003250944.KAA29526@rubel.cs.tu-berlin.de>
X-Authentication-Warning: rubel.cs.tu-berlin.de: loewis set sender to loewis@cs.tu-berlin.de using -f
From: "Martin v.Loewis" <loewis@cs.tu-berlin.de>
To: krentel@dreamscape.com
Cc: freebsd-fs@FreeBSD.ORG, kwc@world.std.com
In-reply-to: <200003250310.WAA00537@dreamscape.com> (krentel@dreamscape.com)
Subject: Re: ext2fs optional features
References:  <200003250310.WAA00537@dreamscape.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

   What are the optional features?  What does "sparse_super" do?
   Does Linux actually use these features, or are they for future use?

Ext2 has three feature sets: compatible features, r/o compatible
features, and incompatible features. If an ext2 implementation sees a
volume that has a feature it does not recognize, it should act
accordingly: If the feature is compatible, go ahead an mount the
volume. If the feature is r/o compatible, refuse to mount r/w. If the
feature is incompatible, refuse to mount at all.

Currently (e2fstools 1.18), the following features are defined

#define EXT2_FEATURE_COMPAT_DIR_PREALLOC	0x0001
#define EXT2_FEATURE_COMPAT_IMAGIC_INODES		0x0002
#define EXT3_FEATURE_COMPAT_HAS_JOURNAL		0x0004

#define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
#define EXT2_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
#define EXT2_FEATURE_RO_COMPAT_BTREE_DIR	0x0004

#define EXT2_FEATURE_INCOMPAT_COMPRESSION	0x0001
#define EXT2_FEATURE_INCOMPAT_FILETYPE	0x0002
#define EXT3_FEATURE_INCOMPAT_RECOVER	0x0004

The sparse_super option means that not every block group has a super
block, but only those that are powers of 3, 5, or 7, and block group
0. The feature is ro-compatible, since an implementation can mount the
file system when it finds a valid super block; it is not compatible,
since the implementation will overwrite data when it attempts to
write-back the super blocks into groups where none belong.

Of the features above, Linux 2.3.99pre2 supports the following ones:

#define EXT2_FEATURE_COMPAT_SUPP	0
#define EXT2_FEATURE_INCOMPAT_SUPP	EXT2_FEATURE_INCOMPAT_FILETYPE
#define EXT2_FEATURE_RO_COMPAT_SUPP	(EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
					 EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
					 EXT2_FEATURE_RO_COMPAT_BTREE_DIR)

Whether these features are activated on a certain installation
primarily depends on the default settings that the distributor
(RedHat, Debian, ...) has selected.

Regards,
Martin


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Mar 25  2:27:58 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16])
	by hub.freebsd.org (Postfix) with ESMTP id 96E9037B61A
	for <freebsd-fs@FreeBSD.ORG>; Sat, 25 Mar 2000 02:27:54 -0800 (PST)
	(envelope-from bde@zeta.org.au)
Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102])
	by mailman.zeta.org.au (8.8.7/8.8.7) with ESMTP id VAA14626;
	Sat, 25 Mar 2000 21:35:40 +1100
Date: Sat, 25 Mar 2000 21:27:28 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-Sender: bde@alphplex.bde.org
To: "Mark W. Krentel" <krentel@dreamscape.com>
Cc: freebsd-fs@FreeBSD.ORG, kwc@world.std.com
Subject: Re: ext2fs optional features
In-Reply-To: <200003250310.WAA00537@dreamscape.com>
Message-ID: <Pine.BSF.4.21.0003251849470.522-100000@alphplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Fri, 24 Mar 2000, Mark W. Krentel wrote:

> ...
> And this is what appears in the logs:
> 
>    Mar 24 21:36:47 blue /kernel: WARNING: R/W mount of dev 0x3040a 
>    denied due to unsupported optional features
> 
> What are the optional features?  What does "sparse_super" do?

They are extensions that modify the filesystem format.  I don't know
exactly what "sparse_super" does.  FreeBSD's ext2fs knows even less.

> Does Linux actually use these features, or are they for future use?

Linux has supported the ext2fs "filetype" and "sparse_super" features
for several years.  Otherwise, they wouldn't be the default for the
current version of mkfs.ext2fs.

> Is it possible to support R/W mounts with these features?

Everything is possible in software :-).

> I remember 3.4-release let me mount the same filesystem R/W.  Was I

That was a bug in 3.4 :-).

> unknowingly corrupting the filesystem, or running some risk of a panic?

The "filetype" extension caused panics.  I don't know what the "sparse_super"
extension caused.

> I noticed that tune2fs also reported:
> 
>    Block size:               4096
>    Fragment size:            4096
> 
> Does Linux really not support fragments??  I was stunned.

Fragments are a dubious feature.  They were more useful when 100MB disks
were large.

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Mar 25 15:38:13 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from io.dreamscape.com (io.dreamscape.com [206.64.128.6])
	by hub.freebsd.org (Postfix) with ESMTP id 9E37037B533
	for <freebsd-fs@FreeBSD.ORG>; Sat, 25 Mar 2000 15:38:10 -0800 (PST)
	(envelope-from krentel@dreamscape.com)
Received: from dreamscape.com (sA19-p23.dreamscape.com [209.217.200.86])
          by io.dreamscape.com (8.9.3/8.8.4) with ESMTP
	  id SAA15662; Sat, 25 Mar 2000 18:37:20 -0500 (EST)
X-Dreamscape-Track-A: sA19-p23.dreamscape.com [209.217.200.86]
X-Dreamscape-Track-B: Sat, 25 Mar 2000 18:37:20 -0500 (EST)
Received: (from krentel@localhost)
	by dreamscape.com (8.9.3/8.9.3) id SAA05240;
	Sat, 25 Mar 2000 18:37:23 -0500 (EST)
	(envelope-from krentel)
Date: Sat, 25 Mar 2000 18:37:23 -0500 (EST)
From: "Mark W. Krentel" <krentel@dreamscape.com>
Message-Id: <200003252337.SAA05240@dreamscape.com>
To: freebsd-fs@FreeBSD.ORG
Subject: Re: ext2fs optional features
Cc: bde@zeta.org.au, kwc@world.std.com, loewis@cs.tu-berlin.de
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > Is it possible to support R/W mounts with these features?
> Everything is possible in software :-).

I guess I was really asking if some Freebsd developer was working
on supporting some of these features so that the mounts can be R/W
legitimately.  I'd offer to help, but it would only slow you down. :-)

> Currently (e2fstools 1.18), the following features are defined

What is e2fstools?  Is this a Linux package?

Lastly, does anyone know what will happen with ext3fs?  Will Freebsd
be able to read or write it?

--Mark Krentel


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message