Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 12 Aug 2018 20:05:06 -0700
From:      John Kennedy <warlock@phouka.net>
To:        bob prohaska <fbsd@www.zefox.net>
Cc:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: RPI3 swap experiments ["was killed: out of swap space" with: "v_free_count: 5439, v_inactive_count: 1"]
Message-ID:  <20180813030506.GC81324@phouka1.phouka.net>
In-Reply-To: <20180812224021.GA46372@www.zefox.net>
References:  <20180802015135.GC99523@www.zefox.net> <EC74A5A6-0DF4-48EB-88DA-543FD70FEA07@yahoo.com> <20180806155837.GA6277@raichu> <20180808153800.GF26133@www.zefox.net> <20180808204841.GA19379@raichu> <2DC1A479-92A0-48E6-9245-3FF5CFD89DEF@yahoo.com> <20180809033735.GJ30738@phouka1.phouka.net> <20180809175802.GA32974@www.zefox.net> <20180812173248.GA81324@phouka1.phouka.net> <20180812224021.GA46372@www.zefox.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Aug 12, 2018 at 03:40:21PM -0700, bob prohaska wrote:
> The closest thing to clever is the logging script, started by Mark M., ...

I was thinking more of the heavily distilled contents of top.

> The script surely isn't "lightweight", but in my case the crashes came before the script
> and haven't changed much since it arrived. Still, you make a good point and I should do
> a test occasionally to see if the script contributes to the crashes. I don't think the
> script has ever been killed by OOMA. 

I think we're probably chasing this the wrong way around.  Going OOM is to be
expected in some types of situations.  I think we're mostly saying that the
buildworld/buildkernel process shouldn't be one of those places, and for most
of us (at least the verbal ones) it presumably isn't.  Bob P has an interesting
situation that triggers it when it arguably shouldn't, that perhaps reveals a
problem.  OOMing when swap is unresponsive?  And then we need to decide what
is reasonably responsive (with possibly a "tweak this tunable knob" note if
you have some hardware that isn't tall enough to ride the rollercoaster.

In my case, I didn't have my normal resources available, so I was basically
watching it swap a lot more than run.  If I was having issues and my swap
wasn't fast enough (assuming swap-speed issue), that might be helpful and I
should have left it as-is.

To that end, I've applied the patches that tell me more about what was going on
when things were going OOM, but not necessarily trying to avoid it.  Once I
can get things to fail reliably, figuring out to fix it reliably starts.


So for my part, can I guarantee that some arbitrary process kicked off on my
box during build*, used up all swap and kicked off an OOM massacre.  The
solution there is to not do it (or re-engineer it).

The build* process seems like a pretty constant load, but I bet you that if
you looked at it from the scheduler or swap, it isn't.


For you, CAM_IOSCHED_DYNAMIC seems to hurt.  That looks like it might tweak the
number of read-vs-write traffic.  They were worrying about SSDs, I can only
imagine how much worse SD cards or USB2 devices much seem.  I guess if you're
cutting corners on price, you might fine-tune the suck that far down the line.

Tuning vm.pageout_oom_seq increases the number of back-to-back passes the
pagedaemon (?) makes while waiting for usable pages.  That sounds like it lets
us dig a deeper hole, which is fine as long as we can dig ourselves out of it.
You might just be (un)lucky, which isn't reproduced reliably.

(https://lists.freebsd.org/pipermail/svn-src-head/2015-November/078968.html)


I'm not sure what Bob's ultimate problem is.  My gut feeling is a slow disk,
but I had the impression that he's tried similar hardware.  I've got a RPI3B+
in a 77-degree-F room, a Sandisk Extreme Plus (V30-rated) SD card with the
swap on it and a heatsink + pi-fan case-mod to keep my system cool.  That
would seem easy enough to reproduce.

Counterfeit hardware?  Bad sectors that cause unpredictable delays when they
get wear-balanced over them?  Dodgy hardware doing the same?  Thermal throttle
that gets him closer to some invisible performance dropoff?

How do divide and conquer this problem?  What can we do to split this problem
in half so we can figure out which of the two halves has the problem (and
then rinse-n-repeat).



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180813030506.GC81324>