Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Mar 2000 00:22:42 +0000 (GMT)
From:      Richard Wendland <richard@netcraft.com>
To:        Paul Richards <paul@originative.co.uk>
Cc:        Alfred Perlstein <bright@wintelcom.net>, Poul-Henning Kamp <phk@critter.freebsd.dk>, Matthew Dillon <dillon@apollo.backplane.com>, current@FreeBSD.ORG, fs@FreeBSD.ORG
Subject:   FreeBSD random I/O performance issues
Message-ID:  <200003220022.AAA28786@ns0.netcraft.com>
In-Reply-To: <38D6BBD7.DA4B950B@originative.co.uk> from Paul Richards at "Mar 21, 2000 00:01:27 am"

next in thread | previous in thread | raw e-mail | index | archive | help
Paul Richards said in "Re: patches for test / review":

> Richard, do you want to post a summary of your tests?

Well I'd best post the working draft of my report on the issues
I've seen, as I'm not going to have time to work on it in the near
future, and it raises serious performance issues that are best
looked at soon.  Note none of these detailed results are from
current, but Paul Richards has checked that these issues are still
present in current.

There are still issues to be explored so this report isn't in a
complete state, and not polished.  It's grown in 3 stages:

- initial Berkeley DB (random I/O) performance problem analysis
- side-issue of ATA outperforming SCSI systems at my synthetic benchmark
- interesting dramatic performance changes from changing seek multiple
  and I/O block size one byte from 8192

Note I've cc'd freebsd-fs, as this raises issues in the filesystem
area.  I've also changed the subject since I think there are broader
issues here than the clustering algorithm, and this email is rather
large to drop into an ongoing discussion.

The benchmark program source code is available, and easy to run,
the bottom of the report has links.

I don't have an explanation for the behaviour I have been measuring,
but I hope these quite extensive results will enable someone to
explain and perhaps suggest improvements.

	Richard.


Folks,

I appear to have found a serious performance problem with random
access file I/O in FreeBSD, and have a simple C benchmark program
which reproducibly demonstrates it.  In that the benchmark demonstrates
very poor non-async performance, this touches on the age-old
sync/async filesystem argument, and FreeBSD vs Linux debates.

I originally observed this problem with perl DB_File (Berkeley DB),
and with the help of truss have synthesised this benchmark as a
much simplified model of heavy Berkeley DB update behaviour.  Quite
probably other database-like software will have similar performance
issues.

This issue appears to be related to the traditional BSD behaviour
of immediately scheduling full disc block writes.  I think this
benchmark must be showing up a related bug.  But it is conceivable
that this is intended noasync behaviour, in which case the implications
need to be thought through.

The program does simple random I/O within a 64KB file, which should
I hope be fully cached so hardly any real I/O would be done.  Other
than mtime, this program makes no file meta-data or directory
changes; and the file remains the same size.

The file is used as 8 8KB blocks, and for each block in the order
0,5,2,7,4,1,6,3,0,... 10,000 lseek/read/lseek/write block updates
are done, much like updating 10,000 non-localised Berkeley DB file
records.

Using a tiny 64KB file is just to simplify and make a point.  My
original perl performance problems were with multi-megabyte files,
but still small enough to be fully cached.

I ran this on a large range of lightly loaded or idle machines,
which gave reproducible results.  Results and a summary of the
machines, which unless otherwise noted use SCSI 7200 RPM discs and
Adaptec controllers, are given in descending performance order
below.


  OS						Elapse secs, system

  FreeBSD 3.2-RELEASE, async mount		<1  (cheap ATA C433, 5400 RPM)
  Linux 2.2.13					<1  (Dell 1300, PIII 450MHz)
  Linux 2.0.36					3   (old ATA P200, 5400 RPM)
  Linux 2.0.36, sync [meta-data] mount		3   (old ATA P200, 5400 RPM)
  SunOS 5.5.1 (Solaris 2.5.1)			7   (old SS4/110, 5400 RPM)
  FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=5	15  (PII 450MHz, 512MB, 10k RPM)
  FreeBSD 2.2.7-RELEASE+CAM			21  (PII 400MHz, 512MB)
  FreeBSD 2.1.6.1-RELEASE			32  (old P100, 64MB)
  FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=2	39  (PII 400MHz, 512MB)
  FreeBSD 3.4-STABLE, vinum stripe+mirr=4	41  (dual PIII 500MHz, 1GB)
  FreeBSD 3.4-STABLE				41  (dual PIII 500MHz, 1GB)
  FreeBSD 2.1.6.1-RELEASE, ccd stripe=2		52  (old P100, 64MB)
  FreeBSD 3.3-RELEASE, ccd stripe=2		53  (Dell 1300, PIII 450MHz)
  FreeBSD 3.2-RELEASE				55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noatime mount		55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noclusterr mount		55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noclusterw mount		58  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.3-RELEASE				63  (Dell 1300, PIII 450MHz)
  FreeBSD 3.3-RELEASE, softupdates		63  (Dell 1300, PIII 450MHz)
  FreeBSD 3.2-RELEASE, sync mount		105 (cheap ATA C433, 5400 RPM)


I also have a range of results from an ATA (IDE) cheap deskside
Dell system running FreeBSD 3.3-RELEASE, with a range of wd(4)
flags.  This system exhibits much better performance than the SCSI
systems above at this benchmark, perhaps related to better DMA
ability.

ATA being faster than SCSI on this benchmark is a bit of a side-issue
to the thrust of this report, but the performance numbers may give
hints diagnosing the problem.

    Dell Dimension XPS T450 440BX
    IBM-DPTA-372730 (Deskstar 34GXP, 7200RPM, 2MB buffer)
    default mount options

	wd(4) flags				Elapse secs

	0x0000					19
	0x00ff, multi-sector transfer mode	17
	0x8000, 32bit transfers			13
	0x2000, bus-mastering DMA		4
	0xa0ff, BM-DMA+32bit+multi-sector	4


Note that Linux performs about the same for [meta-data] sync &
async mounts, which is as I'd expect for this program.  But FreeBSD
performance is hugely affected by async, sync or default (meta-data
sync) filesystem mounts, with noclusterw unsurprisingly making it
somewhat worse.

One interesting observation is that for non sync, async or noclusterw
mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000
writes.  If I change the program to use 16 blocks there are ~9375
I/O operations which is 15/16ths of the 10,000 writes.  Guessing,
this is as if writes are forced for all blocks but one.

With async filesystem mounts very little I/O occurs, and with
noclusterw there are ~10,000 operations matching the number of
writes.

With sync it's ~20,000 operations matching the total of reads &
writes.  This demonstrates another aspect of the bug, sync behaviour
should cause 10,000 operations; the reads aren't being cached.

A quick softupdates test suggests this makes no difference, as
would be expected.

Looking at mount output on FreeBSD 3 the substantial part of the
I/O is async in all cases other than sync mounts; as expected.


Another aspect of this issue is the effect of changing the seek
blocksize, and write blocksize, by 1 byte each way from 8192, thus
doing block unaligned I/O.  In some cases this changes the amount
of I/O recorded by getrusage to zero, and drops elapse time from
half a minute or so to less than 1 second.

Thanks to Paul Richard for noticing this.  I've not spent much time
researching this, so can only present my small set of measurements.
To do these tests you have to recompile my test program each time eg

	gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c

Sorry it's that crude.  These results are from a FreeBSD
2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system,
though exactly the same pattern is apparent with 3.4-STABLE.
"****" indicate sub-second "zero I/O" results.

BLOCKSIZE   WRITESIZE	csh 'time' output

8191	    8191	0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w
8191	    8192	0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w
8191	    8193	0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w

8192	    8191	0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w    ****
8192	    8192	0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w
8192	    8193	0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w

8193	    8191	0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w
8193	    8192	0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w
8193	    8193	0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w


8191	    4095	0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w
8191	    4096	0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w
8191	    4097	0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w

8192	    4095	0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w    ****
8192	    4096	0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w   ****
8192	    4097	0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w    ****

8193	    4095	0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w    ****
8193	    4096	0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w    ****
8193	    4097	0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w    ****



Any views gratefully received.  A fix would be much better :-)

Test program source, including compile & run instructions, is
available at:

	http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c

Detailed notes on the test system configurations are at:

	http://www.netcraft.com/freebsd/random-IO/results-notes.txt

Thanks,
	Richard
-
Richard Wendland				richard@netcraft.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003220022.AAA28786>