From owner-freebsd-hackers  Thu Oct 30 15:03:38 1997
Return-Path: <owner-freebsd-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id PAA18477
          for hackers-outgoing; Thu, 30 Oct 1997 15:03:38 -0800 (PST)
          (envelope-from owner-freebsd-hackers)
Received: from usr03.primenet.com (tlambert@usr03.primenet.com [206.165.6.203])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id PAA18466
          for <freebsd-hackers@FreeBSD.ORG>; Thu, 30 Oct 1997 15:03:34 -0800 (PST)
          (envelope-from tlambert@usr03.primenet.com)
Received: (from tlambert@localhost)
	by usr03.primenet.com (8.8.5/8.8.5) id QAA04930;
	Thu, 30 Oct 1997 16:01:14 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199710302301.QAA04930@usr03.primenet.com>
Subject: Re: help with fstat?
To: karpen@ocean.campus.luth.se (Mikael Karpberg)
Date: Thu, 30 Oct 1997 23:01:12 +0000 (GMT)
Cc: tlambert@primenet.com, freebsd-hackers@FreeBSD.ORG
In-Reply-To: <199710301129.MAA10740@ocean.campus.luth.se> from "Mikael Karpberg" at Oct 30, 97 12:29:30 pm
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Sender: owner-freebsd-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> Well, it's not for sure that the pages used in a MADV_SEQUENTIAL reading
> in a process will not be used again, is it? I might back up a few bytes
> in parsing text, for example, but ALMOST be sequential, and then it might
> be a good idea to hint the system anyway. That would easilly be solved with
> three pages, though, if one page is enough read ahead.

You will maintain a lookahead buffer in your code to do this, or you
will insure that you *never* back over a page boundry (at least not
one that's not in a read-ahead page chain, which you probably can't
know), or... you won't like to the system via madvise.

Alternately, you agree to pay heinous paging overhead each time you
go back on your promise to the VM system.  8-) 8-).


> But the real case of where it will be reused is, actually, if many
> processes access the file after eachother, or almost simultaniously.

Well, if you look at the code, there are reference instances which are
divorced, so I think this will not be a problem.

> That might be the case for something like a loaded webserver where the
> speed of a read might matter a lot.

If you can only make the flag apply to the shared object instead of
the referencing object, then you'd be right.  Most likely, you would
not use MADV_SEQUENTIAL in that case: you'd save the flagging for a
case like "cp" (which currently does not use mmap() because of a legacy
"fix" and does not call madvise() to flag it MADV_SEQUENTIAL anyway).

Ie; you mark things sequential only if you promise they well be
accessed that wy, and you don't make promises you can't keep (promises
you can't keep is what INN did before the msync() fixes took place).


> It might be mmaping and writing a whole bunch of index.html copies
> a second, accessing them sequientally, in which it is likely to use
> MADV_SEQUENTIAL, no?

No... at least not if the system is loaded above the amount of physical
RAM.  And if it's loaded above the amount of physical RAM + swap, you
are utterly screwed.


> It's a very good thing if it doesn't trash those pages right away,
> then.

Do you want the pages cached behind you, or are you promising to
access them sequentially.  You can have one or the other.  Either you
say "I will not use these pages again, and, oh yes, I want read-ahead
from the get-go even though I have not triggered slow-start sequential
access recognition" (which should *also* set OBJ_SEQUENTIAL, btw!), OR
you say "I may bneed these later".

The whole issue here is process vs. system locality of reference.

The whole issue with per vnode working set quotas is to prevent
fast process locality from stomping slow system locality to death.
If I'm running 5 xterms, each with a copy of /bin/sh, I should favor
the executable imnages used by 5 processses over the data images
used by one when I'm deciding whose page gets stolen to satisfy
an "I want a page" request.


> But less accessed pages will be very happily discareded right
> away. They will not be moved back in the free-queue all the
> time, because they are not accessed again. So they WILL be
> truly discarded.
> 
> Now, this might not be completely correct, but don't I have a point, Terry?

I think there is still a need for a quota.  The need is *NOT* the result
of the MADV_SEQUENTIAL case (which is specific enough that it can be
tweaked to be "sort of optimal" relatively easily).


In reality, when ld or some other program randomly accesses a working
set larger than physical RAM, it does so quickly enough (it's an
I/O bound process -- it's soft priority will be kicked up) that it
will basically force everyone elses clean-but-going-to-be-reused
pages (oh, like text image backing a running program) out of core
to back the faulted pages.  You can demostrate this using an mmap'ing
ld to link a kernel while you are running from an xterm, and trying to
select another window.

You have to page:

o	The X server's mouse code
o	The mouse cursor bitmap
o	The xterm you are moving from for LeaveNotify
o	The window manager for EnterNotify
o	The xterm again for FocusChanged
o	The the xterm's cursor change code
o	The window manager again for FocusChanged (window manager window)
o	The window manager for LeaveNotify (out of one xterm frame)
o	The window manager for EnterNotify (into another)
o	The window manager for LeaveNotify (out of the second xterm frame)
o	The new xterm for EnterNotify
o	The window manager and new xterm for FocusChanged
o	The the new xterm's cursor change code

Now you are ready to type:

o	The xterm's keyboard handling code
o	The shell on the other end of the pty
o	The xterms display handling code
o	The X server's font for that xterm, plus the GC, plus the
	colormap, plus...

Etc.

Each one of these event boundry transitions is a full transit of the
run queue by the scheduler.

Each page involved (after the ld has thrashed them all out of core and
swap) is a disk access (tsleep() -- another run queue transition) for
howwever many code pages are involved (X itself is 8-10M -- how big is
Motif?).

The interactive response basically goes in the toilet when a process
is allowed to create a large virtual address space and basically
displace all other clean pages to the end of the LRU, and discard
them from there.

Such processes need to be whacked on the knuckles.  I'm up for any
suggestions you have to do the whacking, if you think it's possible
without a working set quota...


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.