Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 17 Feb 1999 09:54:32 +1030
From:      Greg Lehey <grog@lemis.com>
To:        "Bryn Wm. Moslow" <bryn@spacemonster.org>
Cc:        freebsd-isp@FreeBSD.ORG
Subject:   Re: DPT 3334UW RAID-5 Slowness / Weird FS problems
Message-ID:  <19990217095432.F515@lemis.com>
In-Reply-To: <36C9DE12.78E77F53@spacemonster.org>; from Bryn Wm. Moslow on Tue, Feb 16, 1999 at 01:07:30PM -0800
References:  <36C88CC6.E1621F6F@spacemonster.org> <19990216105959.P2207@lemis.com> <36C9DE12.78E77F53@spacemonster.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tuesday, 16 February 1999 at 13:07:30 -0800, Bryn Wm. Moslow wrote:
> Greg Lehey wrote:
>
>> I don't know the DPT controllers, but 32 kB stripes are far too small.
>> For better performance, you should increase them to between 256 kB and
>> 512 kB.  Small stripe sizes create many more I/O requests at the drive
>> level.
>
> I was actually hoping to get quicker read times out of the spool by
> potentially having more heads at the disposal of each seek, given that
> the vast majority of files on the filesystem are either directories or 0
> length. Perhaps I came up on the short end of the stick on this one.
> Good suggestion. 32K was the default for anyone else reading this... you
> have to change it on the DPT in the Storage Manager software.

From an earlier message I sent, in answer to somebody who was
advocating small stripes:

>> A media server (large files) 64kb or more. On boxes with lots of
>> small sizes, accessed randomly and rapidly, 8 or 16kb.
>
> Far too small
>
>> But you dont' want to fill up the controller's command queue with
>> too many commands.
>
> That's not the big problem.  The fact is that the block I/O system
> issues requests of between .5kB and 60 kB; a typical mix is somewhere
> round 8 kB.  You can't stop any striping system from breaking a
> request into two physical requests, and if you do it wrong it can be
> broken into several.  This will result in a significant drop
> in performance: the decrease in transfer time per disk is offset by
> the order of magnitude greater increase in latency.
>
> With modern disk sizes and the FreeBSD block I/O system, you can
> expect to have a reasonably small number of fragmented requests with a
> stripe size between 256 kB and 512 kB; I can't see any reason not to
> increase the size to 2 or 4 MB on a large disk.
>
> The easiest way to consider the impact of any transfer is the total
> time it takes: since just about everything is cached, the time
> relationship between the request and its completion is not important.
> Consider, then, a typical news article of 24 kB, which will probably
> be read in a single I/O.  Take disks with a transfer rate of 6 MB/s
> and an average positioning time of 8 ms, and a file system with 4 kB
> blocks.  Since it's 24 kB, we don't have to worry about fragments, so
> the file will start on a 4 kB boundary.  The number of transfers
> required depends on where the block starts: it's (S + F - 1) / S,
> where S is the stripe size in file system blocks, and F is the file
> size in file system blocks.
>
> 1: Stripe size of 4 kB.  You'll have 6 transfers.  Total subsystem
>   load: 48 ms latency, 2 ms transfer, 50 ms total.
>
> 2: Stripe size of 8 kB.  On average, you'll have 3.5 transfers.  Total
>   subsystem load: 28 ms latency, 2 ms transfer, 30 ms total.
>
> 3: Stripe size of 16 kB.  On average, you'll have 2.25 transfers.
>   Total subsystem load: 18 ms latency, 2 ms transfer, 20 ms total.
>
> 4: Stripe size of 256 kB.  On average, you'll have 1.08 transfers.
>   Total subsystem load: 8.6 ms latency, 2 ms transfer, 10.6 ms total.
>
> These calculations are borne out in practice.

As I said elsewhere, it looks like you'll end up with most of your
directories in cache anyway, so they're not the issue, and I don't see
any case where it will allow more concurrent accesses to different
files.

>>> My first problem was that I initially tried to do 16K per inode to
>>> speed things up a little bit (also, I didn't need the millions of
>>> inodes that came with a default newfs on 43GB =).)
>>
>> I might be missing something, but I don't see any performance
>> improvement by changing the number of inodes.
>
> I've seen not earth-shaking but significant performance increases in
> filesystem performance using this technique. It's taken from many a
> "Unix Gurus Down From the Mountain to Ram Knowledge into Your Skull"
> book and a couple of UFS/FFS optimizing guides I've read around the web.
> In my own testing, systems that require large numbers of opens and seeks
> within files (anything where you're moving a pointer) are sped up by
> reducing the number of inodes in a huge filesystem. It's just a little
> tweak but it can really contribute to performance, I've seen it and I
> wasn't hallucinating at that time <g>. Also, try formatting a 43 gig
> filesystem sometime, do a df -i and look how much space you lose at the
> default. I weep like a baby =). It makes me squirm in my sleep at night
> to see 80% usage on a fs and see inode usage at like 1% but I'm high
> strung and need a vacation...

OK, you can save (maybe much) space by reducing the number of inodes.
But this means that your average file size must be 16 kB or more, or
you will run out of inodes before the disk is full.

>> What hung?  The FreeBSD system or the DPT subsystem?
>
> The whole dog and pony show... You know - like Windows 98 =) kinda
> freeze. Solid, solid as a rock. <G>

Hmm.  We should investigate that.  You want to build a kernel with
debugger and all those good things?

>>> The user directories for delivery are broken out into 1st letter, 1st
>>> two letters, username (i.e.: /home/u/us/username) to speed up dir
>>> lookups already.
>>
>> I'd guess that these would end up in cache anyway, so you shouldn't
>> see much improvement with this technique.
>
> I actually found this one when I had the mail spool on one disk eons ago
> and it helps indeed. With 13,000+ entries in /var/mail directory lookups
> anything requiring vnode access (heh, what doesn't? Just do a "man -k
> dir" and start poking) could take a up to a second or two - especially
> when getting hardcore spammed or something - an eternity when you're
> firing off mail.local and popper every other nanosecond.

Hmm.  13,000 entries in a single directory is somewhat beyond the
design specs.  The problem here isn't disk access--that's only about
160 kB, depending on the length of the name--but the search algorithms
are a little primitive.  OK, I'm convinced, keep your directory
structure.

> Most of the system utils (ls, args, etc.) break on dirs this large
> (changed in 3.x?  - dunno)

Ah, you're talking about the length of the argument list.  Yes, that's
a nuisance.  Currently the parameter ARG_MAX is set to 65536, and it's
been like that since 1994.  BSDI uses 256 kB.  You could increase it
if you wanted.

> and you find yourself writing for loops and while loops just to do
> everyday stuff (like ls). Yes, I know, that's why I have the source
> - too bad it doesn't come with time to rewrite that whole part of
> the system <G>.

ARG_MAX is defined in /sys/sys/syslimits.h.  If you change it there
before a `make World' and a kernel build, it will propagate to
everything.

> Also, even though you have it in cache access/modification times and
> such have to be updated and writes to reads are almost even on this
> particular filesystem. Shutting down atime on the fs would lose us a
> first-tier diagnostic tool and I don't want to run the fs async
> <shudder>.

You could consider soft updates.

> One last note - the hardware cache on the DPT isn't particularly
> useful for the majority of what goes on with this fs, or at least it
> doesn't when I read their theory of caching and the implementation
> described in their manual. The disk access usage is just too
> random. It does help, very much so, but doesn't come close to the
> ideal.... (mmm... momentary solid state disk fantasy... shall we all
> pause?)

I can't comment, since I don't know the DPT.

>> There's a Compaq driver out there.  It seems to have some
>> strangenesses which suggest that it'll need a lot of work before it
>> can be incorporated into the source tree.
>
> Strangenesses, hehe - I like that one, can I use it? All the Compaq
> controllers I've looked at are made by DPT. Is there a new one?

Check out http://www.doc.ic.ac.uk/~md/ida/.  It might be a DPT
controller in disguise.

> Thanks, Greg, I appreciate that you put some time and thought into this.
> Mighty nice of you...

You're welcome.
Greg
--
When replying to this message, please copy the original recipients.
For more information, see http://www.lemis.com/questions.html
See complete headers for address, home page and phone numbers
finger grog@lemis.com for PGP public key


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-isp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990217095432.F515>