Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 1995 14:10:55 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        rgrimes@gndrsh.aac.dev.com (Rodney W. Grimes)
Cc:        terry@lambert.org, pete@kesa26.kesa.com, jbryant@argus.iadfw.net, freebsd-hackers@FreeBSD.ORG, pete@rahul.net
Subject:   Re: 4GB Drives
Message-ID:  <199508312110.OAA23399@phaeton.artisoft.com>
In-Reply-To: <199508312010.NAA12388@gndrsh.aac.dev.com> from "Rodney W. Grimes" at Aug 31, 95 01:10:34 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> You see, in modern workstation disk drives you have something called
> spindle sync.  Well, when you set up spindle sync you have 2 modeselect
> values you tweak.  One bit says who is the sync master and who are
> the sync slaves.  Then for each slave drive you tweak another value
> that is used to offset the spindles from perfect sync so that the I/O
> of block zero of a track on drive 0 of a stripe set has just finished
> the scsi bus transfer when block zero of a track on drive 1 is about to
> come under the heads.

One assumes that stripes will not cross cylinder boundries in this
case, since doing so would preterb a articular stripe but not all
stripes, then?

One also assumes that the head positioning on both drives is synchronized
so as to induce any seek delays simultaneously?

> I was in no way talking about ``rotdelay'' in the file system since,
> I am still playing with raw devices at the block level, no file systems
> have been built since the slice code kinda screwed me up for getting
> labels on the things.

Rotational delay refers to the location of the head relative to the
sector address within the track, and is thus independent of file
system code unless the file system itself attempts to compensate.

Ideally, you'd want spindle-synced drives with identical geometries
and knowledge of the sector offsets at which a seek will occur so that
it can be avoided on both drives simultaneously.

Finally, you'd want the rotation *advanced* by the stripe length -- a
function of block placement on writes -- given that the advance in the
rotation will force the entire stripe into cache as the drive begins
reading before the end of the stripe with reverse ordered sectors.


In effect, a file system wants to be a variable block store, and have
the driver worry about issues like this and media perfections, etc.
Most file systems are not written this way, even "advanced" file
systems like vxfs (Veritas), hpfs, and ntfs.

The net effect on this is that you can not guarantee stripes to be
consecutive except for as many drives as you have in the set.

> Already looking at those factors.  I am given the fact that my drives
> will be SCSI-II, will report the zone pages, etc.  Without that stripe
> sets are pretty stupid and can never be made to go fast.   I have been
> able to get to 85% of theorotical bandwidth, not bad, but want to sqeeze
> that on up to 95% before I go looking at laying file systems on this
> thing.

I think that unless you do the logical equivalent of predictive branching
(which it might be possible to precalculate at drive set initialization),
you are going to be limited to an effective hash efficiency with an
expotential fall-off at about 85% (Knuth: Sorting and Searching).  The
predictor you'd use to defeat this would be stripe prescheduling, for
instance by precalculating values for skip lists rather than a pure hash.
The 10% "reserve" in UFS is actually a soft hash-fill limit to keep it
reasonably close to the hash cost/benefit falloff of 85% that was
calculated by Taylor series expansion by Knuth.  Again, UFS is only an
example, as it was in rotdelay, since file systems shouldn't be doing
this type of crap, it should be at the driver level.

Another thing that you might want to play with is turning *off* SCSI
sector replacement.  This may seem counter-intuitive, but in fact you
might be better off handling your own media perfection issues to
ensure that you don't get an unexpected seek in a stripe set.  You'd
be better off avoiding the bad block entirely than replacing it and
taking the replacement lookup hit.  8-).

How do you deal with thermal variance?  The "AV" drives don't try to
compensate while they are "busy" and so are quite fragile in this
regard.   I haven't looked into what would be required to precompensate
in the driver for recalibration delays, or if it's even something
that's possible at all.  It might be better in the long run to take
the risk to avoid the delay if you really feel "the need for speed".

> > Without the physical seek locations, any benchmarking will be rather
> > arbitrary based on the layout you end up with for a particular test.
> 
> To eliminate this very problem whilst I work on the technological ends
> of things I am simply doing raw disk I/O starting at the same logical
> drive on all spindles.  Those do often end up in the same physical
> location, and when I want the best numbers simply start at logical sector
> 0 which will always be physically the same location on all spindles sans
> whatever value I put into scsi mode page 4:Rotational Offset:.

This would definitely ensure internal consistency; I was thinking more
in terms of the results of particular stripe set builds, not necessarily
the identical build each run.  The results you get with the identical
build will be drive/instance dependent even after spindle sync without
seek optimization of some kind.

I think going to the engineering lengths to implement every possible
optimization is probably not worth it, though it's damn fun to try,
or at least talk about.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199508312110.OAA23399>