Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Feb 2002 01:26:40 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Tony Finch <dot@dotat.at>
Cc:        Alfred Perlstein <bright@mu.org>, Dominic Marks <dominic_marks@btinternet.com>, Kip Macy <kmacy@netapp.com>, Peter Wemm <peter@wemm.org>, Mike Silbersack <silby@silby.com>, Hiten Pandya <hiten@uk.FreeBSD.org>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: In-Kernel HTTP Server (name preference)
Message-ID:  <3C736BD0.73513ECB@mindspring.com>
References:  <20020220010113.B17928@chiark.greenend.org.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
Tony Finch wrote:

[ ... Terry describes non-blocking I/O on page-not-present
      on SVR4, and how it behaves better than BSD ... ]

> How does it deal with the situation that the machine's
> working set has exceeded memory? If the web server is dealing
> with lots of concurrent connections it may have to block in
> poll() waiting for (say) a thousand pages to be brought in,
> then in the process of dealing with them after the poll()
> returns some of the pages may get re-used for something else
> making read() on a "readable" fd return EWOULDBLOCK again.

It doesn't deal with this at all.  SVR4 doesn't deal with
overcommit of the working set at all well.  I've pointed
this out many times since 1992: the SVR4 linker maps all
the involved object files into the linker, and then seeks
all over heck (effectively) to do the relocation and the
symbol table fixup.  In this process, it thrashes out all
the pages for the X server (among other things, but that's
the most noticeable), and you lose control of the system
until the link is done.

FreeBSD has a slightly less difficult time with processes
which are pigs in this way, mostly because it has a unified
VM and buffer cache, so the contention between the VM and
buffer cache allocations is no longer there. However, a
"correctly" written program, designed to exercise the code
path for the degenerate case will, well... exercise the code
path for the degenerate case.

SVR4 dealt with this issue by throwing the CPU at it: they
have modular scheduler classes, and one of the ones they
provide is called "fixed", where a certain percentage of
the CPU is dedicated to that task, whether it needs it or
not.  Thus the X server runs at a "fixed" class, and it
thrashes the pages it needs back in in the time allotted to
it, and the net effect is that when you move the mouse, the
cursor wiggles, just like it's supposed to.  This approach
works for SVR4 _because_ the VM and buffer cache are not
unified, and _cannot_ work for FreeBSD, because they are,
since it can't attribute demand back to the demander.

The correct fix for this problem is to set a high watermark
for the amount of available system memory, and a high
watermark per vnode.  When you hit the high watermark for
the system, then you are in a resource starvation situation;
knowing this, then if you are asked to page a page in on a
vnode, you check its page count, and, if it is over the
second high watermark, then instead of taking an LRU page
from the system, you steal the page from the page list on
the vnode instead.

The net effect of this approach is, in starvation situations
(and _only_ in starvation situations!), you limit the per
vnode working set size.

Obviously, the VMS approach of limiting the per process
working set size (via a working set quota) would be better,
if you could enforce it, and if you delayed enforcement
until starvation set in.  But doing this per process is not
possible with a unified VM and buffer cache, unless all file
I/O occurs via mmap(), rather than kernel read/write calls
(not possible to do, because of struct fileops, since not
all vnodes are created equal in FreeBSD; sockets are a
particularly problematic area).

You could make this approach even more complex, in an attempt
to ensure "fairness" by raising the quota on a per reference
basis, but that's exploitable.

If we are talking web traffic here, then the enforcement of
working set size should probably take into account how close
the quota is to the file size: if the quota is 800k, and the
file is 801k, then it probably makes sense to give in to the
process, and load the extra page to avoid thrashing.  This is
probably calculable as a percentage of the remaining system
resources, once the system is over the high watermark, but
below total starvation.

Realize that the normal approach to this problem is to simply
trust LRU and the locality of reference model.

I don't think that anything you can think of (short of packing
in more RAM) can possibly prevent at least _some_ elbow in the
performance at starvation.


> But on the other hand the OS can't lock the pages in memory
> until they are read, since passing the fd to poll() isn't
> a promise to read since this may lead to DOS attacks, or
> alternatively processes being unexpectedly killed for hitting
> RLIMIT_MEMLOCK.

Yes.  I _am_ assuming that your access model used by your
application is relatively uniform.  If everything doesn't
go through the same access path, all bets are off.  This is
going to be true of any asymmetric application under resource
starvation conditions, though, sho I view this as a problem
og you shooting yourself in the foot: it's your foot, and you
can do what you want with it, including shooting it.

This should mean that it's robust in the face of a DOS attack
(given that the code path is uniform), but not robust in the
face of being badly implemented.

I'd have to say that there, too, you have no defense, but
since it's your own fault, "as ye sow, so shall ye reap".

8-).


> I can't see a good way of avoiding these semantic problems
> without changing to a completely different kernel API like KSE.

No.  A minor change in the working set management algorithm,
away from a simple global LRU, and then only in the starvation
case, can make a significant positive difference.

Personally, I'd do a simple watermark of 90% using integer
math, and then enforce a per vnode LRU policy, rather than
using the global LRU table, unless the vnode is not over its
quota.  There's another advantage to this, in that it avoids
the global LRU lock, if it starts enforcing on a per vnode
basis.  This would add three sysctls:

1)	The global watermark level
2)	The per vnode weighting algorithm selection for
	deriving the per vnode watermark
3)	Enable/disable

It would be relatively simple to make the per vnode watermark
dynamic, based on the number of vnodes in use in the system,
or the number of "eligible vnodes", if you want to make
special exception for executables, since they must be in core
in order to create demand in the first place (but by that
token, the current global LRU policy is broken, since it takes
no notice of executables as special entities).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C736BD0.73513ECB>