Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 9 Oct 1998 22:02:44 -0700 (PDT)
From:      dan@math.berkeley.edu (Dan Strick)
To:        tlambert@primenet.com
Cc:        dan@math.berkeley.edu, freebsd-smp@FreeBSD.ORG
Subject:   Re: hw platform Q - what's a good smp choice these days?
Message-ID:  <199810100502.WAA17750@math.berkeley.edu>

next in thread | raw e-mail | index | archive | help
> We can argue about whether the FS code should be reading mode page 2
> and acting with the physical geometry in mind in order to minimize
> actual seeks, and that FreeBSD's imaginary 4M cylinder is broken. 

I suspect we would both agree that teaching the FS code and the driver
code to optimize for the actual disk geometry would be so painful and
perhaps computationally expensive as to be not worth the effort.
It is probably adequate to model a disk as a sequence of blocks
with on the average a much larger latency between nonconsecutive
blocks than between consecutive blocks.  It may also be true that on
the average the latency increases with the difference in block numbers,
but the actual function is so jagged that this approximation is of
uncertain value.

I tend to divide disk activity into several categories which must
be optimized separately.  The first category is randomly located
I/O requests separated by large disk latencies.  In this case,
I/O reordering is useful and the per command SCSI latencies are
so small that they fit unnoticed within the large disk latencies.
There is no practical difference between reads and writes.
Simultaneously executed SCSI commands are not very useful.

The second category is highly localized disk I/O for mostly
noncontiguous chunks.  The third category is contiguous disk I/O.
Both of these categories have read and write cases.  Since modern
drives do speculative read ahead, the read cases behave similarly,
but the the write cases are different.  (Modern drives may also
be capable of write-behind (i.e. cached writes) but they had better
not do it with my valued data.)

> The clustering code and locality of reference will generally ensure
> that data locality for a given process is relatively high; this,
> combined with the fact that most modern SCSI drives inverse-order
> the blocks and start reading immediately after the seek (effectively
> read-caching the hicgh locality data) will definitely shorten the
> increment.
> 
> Also, let me once again emphasize that the improvement is due to
> interleaved I/O, which gets rid of the agregate latency, replacing
> it with a single latency.

You have lost me here.  I must not understand the "aggregate latency"
to which you refer.  If we execute our I/O commands serially, we can
divide the time into SCSI command latency (command processing overhead),
disk latency (waiting for the heads to reach the data), disk data transfer
time (between the heads and the drive data buffer), and DMA (between the
drive and system main memory through the SCSI and PCI busses).
A smart drive might overlap some of the disk data transfer with DMA
and in the case of highly localized disk reads it might effectively
overlap disk latency with disk data transfer by doing speculative
read ahead.  I don't understand what you mean by "interleaved I/O"
or how this relates to the I/O sub-activities I have listed above.

> > The one big advantage of tagged drivers is the possibility that disk
> > activity could overlap DMA, but this of course depends on the smarts of
> > the particular disk drive and the SCSI host adapter and it only matters
> > if the disk latencies are so small that disk revs would be lost
> > otherwise.  (It is hard to draw a picture of this in ascii.) Even in this
> > case, a smart driver that does 2 simultaneous SCSI commands might do
> > as well as one that does 64.
> 
> If there is an intervening seek, yes.  But in general, the number
> of sectors per cylinder has increased, not decreased, over time.

Actually, I was visualizing disk writes to consecutive or nearly
consecutive sectors with no intervening seeks or head switches at all.
I was also visualizing the disk sectors written in order of increasing
sector number so that the disk could be kept continuously busy
providing that DMA is always completed before the next sector to be
written comes underneath the heads.  In this case, it could be very
useful to begin DMA for the next write command before the current
write command is complete.  Two simultaneous SCSI I/O commands might
be sufficient.

Reversing the sector order in the track changes everything.  Without
detailed knowledge of the actual disk geometry, the only obvious
tactic is to issue large writes (by merging I/O requests).  It doesn't
much matter if you do this with a single SCSI command or a bunch of
simultaneous SCSI commands.  If you do it with a single command, you
have reason to hope that even a dumb drive will do all the sectors in
a single track in a single rev and you are certain to eliminate some
of the per-command overhead though you will also force all of the
merged I/O requests to wait until the last is done.  If you do it
with multiple SCSI commands, you might benefit from early completion
of some of the commands.  On the other hand, the drive might choose
to do the I/O inefficiently.

> We can also state that a process waiting for I/O will be involuntarily
> context switched, in favor of another process, and that the pool
> retention time that we are really interested, in terms of determining
> overall data transfer rates, is based on the transfer to user space, 
> not merely the transfer from the controller into system memory.  As
> before, this greatly amplifies the effects of serialized I/O, hence
> my initial steep slope for my "stair-step" diagram.

I think you are saying that the process of transferring data between
wherever the device controller accesses it and the running program's
virtual memory is something else that can be overlapped with the other
I/O activities if only we are doing enough different things at once.
I would guess that this transfer process takes place at least at
main memory speeds, something on the order of 10 times the raw disk
data transfer rate.  I suspect the memory transfer latency can almost
be ignored.  (I also don't understand the significance of the context
switch.  Perhaps I don't understand something important about the
PC I/O system.)

> > This also applies to the special case of doing contiguous disk reads
> > from a drive that does substantial read-ahead.  There is no lost-rev
> > issue, but overlapping DMA with something else is possible.  In this
> > case also, 2 simultaneous SCSI commands are probably as good as 64
> > performance improvements over the smart untagged driver cannot possibly
> > exceed a factor of two.
> 
> I can't really parse this, but if (1) the commands are overlapped,
> and (2) operating against read-ahead cache on the disk itself,
> then I can't see how more commands don't equal more performance,

The bottleneck will probably be the raw disk (perhaps 10 MB/sec).
The SCSI bus will probably be much faster and DMA will be
much faster still.  Even executing only one SCSI command at
a time, all this additional activity and miscellaneous SCSI
command overhead, even if serialized, will mostly overlap the
raw disk transfer time.

Example (1 8kb transfer):

    SCSI bus @ 20 MB/sec:    400 us        raw disk @ 10 MB/sec:   800 us
    PCI DMA @ 120 MB/sec:     70 us
    SCSI command overhead:   500 us
    -------------------------------        ------------------------------
    total:                   970 us                                800 us

Note: "SCSI command overhead" includes time spent in the SCSI driver.
Even so, it may be overstated.  It does not include time which would
overlap the SCSI bus transfer or the DMA (for the same SCSI command).
It is not clear how much if any of the DMA overlaps the SCSI bus transfer.
This overlap is not affected by using tagged SCSI commands.

> in terms of linearly scaling.  I don't think that it's likely,
> unless the disk itself contains as many track buffers as some
> high fraction of the number of tagged commands it supports (in
> the ideal, 1:1), to achive optimal benefit, but it's certainly
> unlikely to be as pessimal as taking a seek hit plus a rotational
> latency, which is what your "2" implies...

I don't think I mentioned seeks or rotational latencies in this case.
My model assumes the disk drive is sucking bits off the disk as fast
as they come and that the disk drive is passing those bits down the
SCSI bus to the host adapter as fast as it asks for them.  The raw
disk activity is basically read-ahead.  It cannot overlap itself.
It happens continuously, even if the SCSI read commands are executed
one at a time.

The only affect of queuing multiple simultaneous SCSI commands is to
possibly overlap the "SCSI command overhead" of one command with the
DMA and SCSI bus data transfer of other commands.  If complete overlap
were achieved, the larger would remain.  (Hence the limiting factor
of "2".)  In this case, the raw disk transfer rate limitation would
also remain, so for these numbers the best improvement would be a
factor of 970/800 (about 1.2).

Dan Strick
dan@math.berkeley.edu

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199810100502.WAA17750>