From owner-freebsd-chat  Sun Jan 27  4: 5:56 2002
Delivered-To: freebsd-chat@freebsd.org
Received: from swan.prod.itd.earthlink.net (swan.mail.pas.earthlink.net [207.217.120.123])
	by hub.freebsd.org (Postfix) with ESMTP id 47D8837B416
	for <freebsd-chat@freebsd.org>; Sun, 27 Jan 2002 04:05:34 -0800 (PST)
Received: from pool0039.cvx22-bradley.dialup.earthlink.net ([209.179.198.39] helo=mindspring.com)
	by swan.prod.itd.earthlink.net with esmtp (Exim 3.33 #1)
	id 16Uo3p-00030B-00; Sun, 27 Jan 2002 04:05:26 -0800
Message-ID: <3C53ED01.61407A02@mindspring.com>
Date: Sun, 27 Jan 2002 04:05:21 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: "Gary W. Swearingen" <swear@blarg.net>
Cc: freebsd-chat@FreeBSD.ORG
Subject: Re: Bad disk partitioning policies (was: "Re: FreeBSD Intaller (was   
 "Re: ... RedHat ...")")
References: <20020123124025.A60889@HAL9000.wox.org>
		<3C4F5BEE.294FDCF5@mindspring.com> <20020123223104.SM01952@there>
		<p0510122eb875d9456cf4@[10.0.1.3]>
		<15440.35155.637495.417404@guru.mired.org>
		<p0510123fb876493753e0@[10.0.1.3]>
		<15440.53202.747536.126815@guru.mired.org>
		<p05101242b876db6cd5d7@[10.0.1.3]>
		<15441.17382.77737.291074@guru.mired.org>
		<p05101245b8771d04e19b@[10.0.1.3]>
		<20020125212742.C75216@over-yonder.net>
		<p05101203b8788a930767@[10.0.1.14]>
		<gc1ygc7sfi.ygc@localhost.localdomain>
		<3C534C4A.35673769@mindspring.com> <0s3d0s5dos.d0s@localhost.localdomain>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-chat@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-chat.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-chat>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-chat>
X-Loop: FreeBSD.org

"Gary W. Swearingen" wrote:
> That's odd.  Your example there shows relative and I interpret the rest
> of your comments about hashing to imply that it's relative.

I meant "relative to the size".  It's an absolute scaling
factor (the percentage of the disk that should be used for
the free reserve is invariant).

> (Maybe my use of absolute and relative wasn't clear.  Absolute meant
> the reserve space for good defraging (or SA reserve) wasn't (much)
> dependant on partition size, while relative meant the reserve space
> needs was a set fraction of the partition size.)

Yep.  See above.

> Trust me.  It's not easy to understand from this thread so far, and I
> don't expect it to be; I can go to the FFS treatise for understanding.
> I feel bad even seeing you spend your time trying to explain reasons.

Nonsense.  If I can't explain reasons, then they are
unsupportable (by me, at least ;^)).

> But I am asking for statements of how the algorithm behaves which
> would be helpful in knowing whether to twist the -m knob or how far.

The algorithm operates by hashing for selection of where
to write next.  If it collides, it has to do collision
handling, and that inflates the cost considerably.

The free reserve is intended to keep the disk empty
enough to prevent hash collisions from occurring.  At
85% fill, this is a probability of 1:1.040 of getting
an empty area (for a perfect hash).

You really need to read the FFS paper and the Knuth book,
if you want to understand the math that makes it work,
since I am a poor math teacher (IMO 8-)).  I do much
better in person, on a whiteboard, and waving my hands.


The tweaks are typically to reduce the free reserve, and
to reduce the threshold below which optimization will be
for space filling, rather than for speed (the 5% number
below which performance becomes a factor of 3 slower is
the space filling optimization threshold).

Dropping the free reserve decreases the required free
space, and when the disk fills to the point where it is
more than 85% full, then every fractional percent more
full it gets after that increases the probability of
collision on attempt to allocate free space via hashing.

When you get a collision, you get two things: (1) the
speed decreases, both writing and reading, since you are
taking longer to find places to write, and the writing
and reading occur in scattered chunks, instead of clusters
of the FS block size, and (2) the fragmentation of the
disk increases for files created or extended during the
low free reserve period.

It's because of the hashing that the FFS does not suffer
from fragmentation, and therefore, there is no need for
a "defragger"; many people don't understand this, and ask
were they can get a defragger anyway.


> > If you have a friend who is a statistician, you should
> > ask them to explain "The Birthday Paradox" to you.
> 
> I've read about it several times, always forgetting the math, but I
> remember you need only about 50 people for a 0.5 match probability.

23.  For 50 people, the probability is 96.5%.

Basically, people's birthdays are hashed using modulus 365,
and you are checking for hash collisions after hashing
everyone into one of 365 buckets based on their birthday.

There's a really nice statistical explanation at:

	http://www.howstuffworks.com/question261.htm

8-).

365 is a really bad number for a hash, since it not prime
at all.  It's far from a "perfect hash"/"Fibbonacci hash".


> > You'd probably benefit from reading the original FFS paper.
> 
> No doubt.  Though I trust you that the performance of the algorithm
> is not a function of the partition size, but of the reserve relative
> to that size (and the space filled relative to that size), I'll need
> to read more to believe that I care as much about poor performance with
> relatively full big disk than with a small one.  For example, I might
> accept slow performance to get an extra 5 GB when I wouldn't for 50 MB.

Relative to the size of your disk, people complain about
very large disks for even a very small free reserve
percentage, mostly because they grew up in an era when
"that was a lot of space!".

The reality is that the algorithm needs a certain percentage
of the space to work correctly, and if you take that away,
then it doesn't work correctly.

If you want to use a different algorithm, fine.  But so far,
the only competing one that seems to be worth anything is to
extent or log structure your FS, and then spend CPU cycles
and some percentage of your disk access cycles, on a "cleaner"
process, that follows around and manually defragments the
disk behind users using it.

This process is expensive enough that for disks that are
under where the free reserve would be, you are paying a
performance penalty for any disk with more data than a
single "cleaner" relocation block.

In other words, there's a trade off.  If you assume your
disks are full, or you know your limiting factor is going
to be I/O, and never CPU, then you might be better off
using a different approach.  For general purpose use,
though, FFS has served us well for a couple of decades
now.  8-).

> > You know, you could worry about something else... like
> > the fact that a formatted disk has less capacity than an
> > unformatted one.
> 
> I probably would, if there was a poorly-documented knob for that too.

8-).

> But when I read silly recommendations to set the swap/RAM knob to 2,
> regardless of the size of RAM or applications, I find it easy to
> question other recommendations for which the justification is only deep
> in the source or developer archives or even hairy treatises or seemingly
> wrong (as the above tunefs(8) quote).

It used to be that the swap/RAM knob was 1:1.  It became 2:1
by default when we started doing memory overcommit in UNIX,
and it really hasn't been reexamined much since then.

Really, it'd probably be a good idea to find a reasonable
way to make swap take up disk space until you ran out, on
the theory that the limiting factor will be the limiting
factor, so if it's swap space, or it's disk space, it doesn't
matter, it's preferrable to exceed administrative limits (at
least to the limits of the available hardware), to not doing
the job you intended the hardeware to do.

NeXTStep did this, and Windows does this currently (that's
the real reason for the API to get the physical block list
for a file, which we discussed as a way of putting a FreeBSD
disk into an NTFS file, in the "partitioning" thread.

The problem with doing "swap files" is that accessing swap
through an FS adds another level of indirection (that's
what the Windows direct sector list access API is taking
out, but it can't make it as fast as a raw swap partition,
because it can't guaranteed physical adjacency of logically
adjacent file blocks ...unless the disk isn't very full).


> Actually, my worry was not really in how something worked or could be
> optimized as much as it was a response to what I find to be a poorly
> documented config setting.  If it just said "leave this to experts" I
> probably wouldn't have brought it up.  But when I read the tunefs quote
> above, I see an implication that I'm quite sure is absolutely wrong: It
> implies that the throughput will always be poor, regardless of how full
> the disk is.  That is misleading and tends to make people twist the knob
> less far than they would if the statement expressed the truth better.
> Maybe it only needs to change "throughput" to "worst-case throughput" or
> "near-full throughput".  It's also quite common-sensical to think that
> the reserve wouldn't be as necessary for bigs disks as it was for small
> ones.  Better documentation would head off many FAQs on this issue.

This issue has been discussed many times before.  It's
in the literature, and it's in the FreeBSD list archives
dozens of times, at least.  8-).

To address your suggestions: this would imply that the you
could get non-worst-case performance on a full disk near a
very small free reserve selected administratively.

The real answer is that the more data on the disk above
the optimal free reserve for the algorithm used for block
selection, the worse the performance will be, and "worst
case" is defined as "the last write before hitting the
free reserve limit".  So disk performance degrades
steadily, the fuller it gets over the optimal free reserve
(which is ~15%, much higher than the free reserve kept on
most disks).

The other misleading thing is that, once written fragged,
the file will remain fragged, even if you drop the system
back down below the "space optimization" limit, or even
down below the "free reserve" limit.

If that happens to an important file, then you are screwed,
since there's no defragger, since the system was never
designed to be run with the disk full over the free reserve.

Thus it's a good idea to keep a large free reserve, so that
a run-away user process can't screw up the on disk access
speed for important files for another, more important
process.

BTW: "root" is immune to the free reserve limit; that's
why you can sometimes see disks that are "110% full": it's
a calcualtion based on the amount used, divided into the
available space _under the reserve_.

BTWBTW: If you screw up an important file this way, you
can fix it by backing it up, deleting it, and restoring
it, once the disk has dropped down to the optimal free
reserve.  This is known as "the poor man's defragger".

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message