Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 11 Feb 1997 15:44:04 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        Shimon@i-Connect.Net (Simon Shapiro)
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Raw I/O Question
Message-ID:  <199702112244.PAA29164@phaeton.artisoft.com>
In-Reply-To: <XFMail.970211141038.Shimon@i-Connect.Net> from "Simon Shapiro" at Feb 11, 97 01:38:03 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> Can someone take a moment and describe briefly the execution path of a
> lseek/read/write system call to a raw (character) SCSI partition?

You skipped a specification step: the FS layout on that partition.
I will assume FFS with 8k block size (the default).

I will also assume your lseek is absolute or relative to the start
of the file (no VOP_GETATTR needed to find the current end of the
file).

I will take a gross stab at this; clearly, I can't include everything,
and the Lite2 changes aren't reflected.  I expect I will be corrected
wherever I have erred.

lseek
	-> lseek syscall
	-> set offset in fdesc
	-> return( 0);

	(one could argue that there should be a VOP_LSEEK at
	 this point to allow for predictive read-ahead using
	 the lseek offset -- there is not)

read
	-> read syscall
	-> fill out uio struct
	-> call VOP_READ using bogus fileops struct dereference
	   which is there because named pipes and UNIX domain
	   sockets aren't in the VFS like they should be
	-> ffs_read (/sys/ufs/ufs/ufs_readwrite.c:READ)
	-> (called iteratively)
		bread
		-> getblk
		   (in cache?  No?)
		   -> vfs_busy_pages
		      VOP_STRATEGY
		      -> VCALL strategy routine for device vnode
		      -> spec_strategy (/sys/miscfs/specfs/spec_vnops.c)
		      -> call device strategy through bdevsw[]
		      -> generic scsi (scbus [bus interface]/sd [disk
			 interface]
		      -> actual controller requests
		      biowait
		uiomove
		-> copyout

write
	-> write syscall
	-> fill out uio struct
	-> call VOP_WRITE using bogus fileops struct dereference
	   which is there because named pipes and UNIX domain
	   sockets aren't in the VFS like they should be
	-> ffs_write (/sys/ufs/ufs/ufs_readwrite.c:WRITE)
	-> (called iteratively)
		(partial FS block? !!!CALL READ!!!
		-> fill in modified areas of partial FS block
		   (uiomove)
		   -> copyin
		bwrite
	...


> We are very interested in the most optimal, shortest path to I/O on
> a large number of disks.

o	Write in the FS block size, not the disk block size to
	avoid causing a read before the write can be done
o	Do all I/O on FS block boundries
o	Use the largest FS block size you can
o	Used CCD to implement striping
o	Use a controller that supports tagged command queueing
o	Use disk drives with tack write caching (they use the
	rotational speed of the disk to power writes after a
	power failure, so writes can be immediately ack'ed even
	though they haven't really bee written).

> 
> We performed some measurements and see some results we would like to
> understand;
> 
> For example, we did READ and WRITE to random records in a block device.
> The test was run several times, each using a different block size
> (starting at 512 bytes and ending with 128KB).  All our measurements
> are in I/O Transfers/Sec.

DEFINITION:	Random reads/writes: "please remove any cache
		effects from my testing, I believe my app will
		be a cache-killer, so I don't want cache to
		count for anything because I have zero locality
		of reference".


> We see a depression in READ and WRITE performance, until block size
> reaches 2K. At this point performance picks up and levels off until
> block size reaches 8KB.  At this point it starts gradual, linear
> decline.

The FS block size is 8k.  The OS page size is 4k.  Your random access
(zero locality of reference: a hard thing to find in the real world)
prevent the read-ahead from being invoked.

The best speed will be at FS block size, since all reads and writes
are in terms of chunks in FS block size, some multiple of the page
size (in general, assuming you want it to be fast).

The smaller your block size, the more data you have to read of of disk
for your write.

The VM system has an 8 bit bitmap, one bit per 512b (physical disk
block) in a 4k (VM page size) page.  This bitmap is, unfortunately,
not used for read/write, or your aligned 512b blocks would not have
to actually read 4k of data off the disk to write the 512b you want
to write.

The problem here is that you can not insure write fault granularity
to better than your page size.  The funny thing is, the i386 will
not write fault a modification from protected mode (kernel code),
so it has to fake this anyway -- so it's *possible* to fake it,
and it would, in general, be a win for writes on disk block boundries
for some multiple of disk block size (usually 512b for disks, 1k for
writeable optical media).

Talk to John Dyson about this optimization.  Don't expect an enthusiastic
response: real work utilization is seldom well aligned... this is a
"benchmark optimization".

> What we see is a flat WRITE response until 2K.  then it starts a linear
> decline until it reaches 8K block size.  At this point it converges 
> with READ performance.  The initial WRITE performance, for small blocks
> is quite poor compared to READ.  We attribute it to the need to do
> read-modify-write when blocks are smaller than a certain ``natural block
> size (page?).

Yes.  But the FS block size s 8k, not pagesize (4k).

> Another attribute of performance loss, we think to be the
> lack of O_SYNC) option to the write(2) system call.  This forces the 
> application to do an fsync after EVERY WRITE.  We have to do that for
> many good reasons.

There is an option O_WRITESYNC.  Use it, or fcntl() it on.  You will
take a big hit for using it, however; the only overhead difference
will be the system call overhead for implementing the protection
domain crossing for the fsync() call.

Most likely, you do not really need this, or you are poorly implementing
the two stage commit process typical of most modern database design.


> The READ performance is even more peculiar.  It starts higher than
> WRITE, declines rapidly until block size reaches 2K.  It peaks at 4K
> blocks and starts a linear decline from that point on (as block size 
> increases).

This is because of precache effects.  Your "random" reads are not
"random" enough to get rid of cache effects, it seems.  If they were,
the 4k numbers would be worse, and the peak would be the FS block size.


> We intend to use the RAW (character) device with the mpool buffering
> system and would like to understand its behavior without reading the
> WHOLE kernel source :-)

The VM and buffer cache have been unified.  bread/bwrite are, in fact,
page fault operations.  Again, talk to John Dyson about the bitmap
optimization for representing partially resident pages at this level;
otherwise, you *must* fault the data in and out on page boundries,
and the fault will be in page groups of FS blocksize in size.

> We are very interested in the flow of control and flow of data.

Jorg, Julian, and the specific SCSI driver authors are probably
your best resource below the bdevsw[] layer.

>
> How do synchronous WRITE operations pass through?  We need this to
> guarantee transaction completion (commits)

They are written using a write operation which block until the data
has been committed.  Per the definition of O_WRITESYNC.

> There are several problems here we want to understand:
> 
> How does the system call logic transfer control to the SCSI layer?

See above.

> All we see is the condtruction of a struct buf and a call to
> scsi_scsi_cmd.  How is the SCSI FLUSH CACHE passed down?  We may need
> to trap it in the HBA driver, so the HBA can flush its buffers too.

I believe this is handled automatically in all but one cae (there is
a debug sysctl in the ufs code for this case, actually).

> What block size I/O do we need so that we do not ever do
> read-modify-write?

FS block size -- 8k by default, differentif you installed with
non-default values.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199702112244.PAA29164>