Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Aug 2018 22:26:39 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        bob prohaska <fbsd@www.zefox.net>
Cc:        Mark Millard <marklmi@yahoo.com>, freebsd-arm <freebsd-arm@freebsd.org>,  Mark Johnston <markj@freebsd.org>
Subject:   Re: RPI3 swap experiments (grace under pressure)
Message-ID:  <CANCZdfoB_AcidFpKT_ZmZWUFnmC4Bw55krK%2BMqEmmj=f9KMQ2Q@mail.gmail.com>
In-Reply-To: <20180815013612.GB51051@www.zefox.net>
References:  <20180812173248.GA81324@phouka1.phouka.net> <20180812224021.GA46372@www.zefox.net> <B81E53A9-459E-4489-883B-24175B87D049@yahoo.com> <20180813021226.GA46750@www.zefox.net> <0D8B9A29-DD95-4FA3-8F7D-4B85A3BB54D7@yahoo.com> <FC0798A1-C805-4096-9EB1-15E3F854F729@yahoo.com> <20180813185350.GA47132@www.zefox.net> <FA3B8541-73E0-4796-B2AB-D55CE40B9654@yahoo.com> <20180814014226.GA50013@www.zefox.net> <CANCZdfqFKY3Woa%2B9pVS5hika_JUAUCxAvLznSS4gaLq2kKoWtQ@mail.gmail.com> <20180815013612.GB51051@www.zefox.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Aug 14, 2018 at 7:36 PM, bob prohaska <fbsd@www.zefox.net> wrote:

> On Tue, Aug 14, 2018 at 05:50:11PM -0600, Warner Losh wrote:
> [big snip]
> >
> > So, philosophically, I agree that the system shouldn't suck. Making it
> > robust against suckage for extreme events that don't match the historic
> > usage of BSD, though, is going to take some work.
> >
>
> You've taught me a lot in the snippage above, but you skipped a key
> question:
>
> What do modern sysadmins in  datacenter environments want their machines
> to
> do when overloaded? The overloads could be malign or benign, they might
> even
> be profitable. In the old days the rule seemed to be "slow down if you
> must,
> but don't stop". Page first, swap second, kill third.
>
> Has that changed? Perhaps the jobs aborted by OOMA can be restarted by
> another
> machine in the cloud? Then OOMA makes a great deal more sense.


No. That's not changed. That's the order we do things in FreeBSD still. The
question is always when do you give up on each level? When do you start to
swap? Ideally never, but you should swap when your current rate of page
cleaning can't keep up with the demand, and there's dirty pages that you
could use if you swap them out. When do you OOMA? Ideally, never, which is
why we try to avoid it. The problem is that the heuristic we use to avoid
it (12 tries) is well tuned for systems that are well matched to the I/O
system, when dirty pages are created slower than the disk can swap them
out, and there's rarely a backup in the disk system. There's no knowledge
in the upper layers of how much we're loading the disks (apart from a few
kludges like runningbuf), so it has to guess how it can best put load onto
the disks to get the most out of them. All the tunables in the kernel for
the VM system try to address these balance points. We want to have enough
free pages so that we can give processes free pages so they don't have to
sleep. We want to keep this above a minimum which is basically the response
time of the page daemon to new or changing demand. The extra act as a shock
absorber for changes in load. Otherwise, the PID that's in the page daemon
does just enough work to ensure that we push out pages we need to to keep
up with demand, yet not so much we do too much work. It knows how many new
pages will be dirtied (estimated based on recent events), how many clean
ones will show up (also based on recent history), so it can guess fairly
well that to keep above the low water mark, it needs to launder so many
pages in the next interval. The PID keeps the oscillations down, and allows
it to respond more quickly to trouble than a simple 'steer out the error
(P)' loop. However, tuning the PID look can be tricky in some applications.
>From the data I've seen so far, FreeBSD isn't one of the tricky
applications: there's a broad range of Kp, Kd and Ki values that give good
results.

So I think we may be seeing several problems here. One is that the normal
write speed of thumb drives isn't that great, so the ability to push pages
out is diminished (think of it this way: if you had 0 cost page out, you
are only limited by the architecture's VM limits, real disks take time, so
the practical limits are somewhat less of than that). In the past, read and
write speed have remained in the same order of magnitude (more or less:
median may be 5ms and P99 may be 40ms with max somewhere near 60ms, for
example, and the numbers are similar for read and write), but with some
flash that's no longer true. So even when there's not bugs / trouble in the
I/O stack, you have things tied against you. Next, you have the problem
that thumb drives have an 'erase size' that's more like 64k or 128k or so,
not 4k so the traditional behavior of the swapper is going to do write
somewhat less than that, which can make these drives perform even worse due
to rewriting (the good ones have a true log device behind the scenes, so
this doesn't matter, the bad ones cheat on cost so don't have enough RAM
for the LUTs needed to do this, so make tradeoffs, one of which can be
RMW). Next, there's issues with something in the system. Either the drive
stops responding (so we get timeouts) or the USB stack hick-ups (so we get
timeouts) or something. This problem comes and goes and confounds efforts
to make the first problems better...

So I think what's well tuned for the gear that's in a server doing
traditional database and/or compute workloads may not be so well tuned for
the RPi3 when you put NAND that can vary a lot in performance, as well as
have fast reads and slow writes when the mix isn't that high. The system
can be tuned to cope, but isn't tuned that way out of the box.

tl;dr: these systems are enough different than the normal system that
additional tuning is needed where the normal systems work great out of the
box. Plus some code tuneups may help the algorithms be more dynamic than
they are today.

Warner



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfoB_AcidFpKT_ZmZWUFnmC4Bw55krK%2BMqEmmj=f9KMQ2Q>