Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Apr 2018 12:39:24 +0200
From:      Alban Hertroys <haramrae@gmail.com>
To:        Peter <pmc@citylink.dinoex.sub.org>
Cc:        freebsd-stable@FreeBSD.ORG
Subject:   Re: kern.sched.quantum: Creepy, sadistic scheduler
Message-ID:  <9FDC510B-49D0-4722-B695-6CD38CA20D4A@gmail.com>
In-Reply-To: <pa17m7$82t$1@oper.dinoex.de>
References:  <pa17m7$82t$1@oper.dinoex.de>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 4 Apr 2018, at 2:52, Peter <pmc@citylink.dinoex.sub.org> wrote:
>=20
> Occasionally I noticed that the system would not quickly process the
> tasks i need done, but instead prefer other, longrunning tasks. I
> figured it must be related to the scheduler, and decided it hates me.

If it hated you, it would behave much worse.

> A closer look shows the behaviour as follows (single CPU):

A single CPU? That's becoming rare! Is that a VM? Old hardware? =
Something really specific?

> Lets run an I/O-active task, e.g, postgres VACUUM that would

And you're running a multi-process database server on it no less. That =
is going to hurt, no matter how well the scheduler works.

> continuousely read from big files (while doing compute as well [1]):
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G  1.58K      0  12.9M      0
>=20
> Now start an endless loop:
> # while true; do :; done
>=20
> And the effect is:
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G      9      0  76.8K      0
>=20
> The VACUUM gets almost stuck! This figures with WCPU in "top":
>=20
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        99    0  7044K  1944K RUN      1:06  92.21% bash
> >53005 pgsql       52    0   620M 91856K RUN      5:47   0.50% =
postgres
>=20
> Hacking on kern.sched.quantum makes it quite a bit better:
> # sysctl kern.sched.quantum=3D1
> kern.sched.quantum: 94488 -> 7874
>=20
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G    395      0  3.12M      0
>=20
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        94    0  7044K  1944K RUN      4:13  70.80% bash
> >53005 pgsql       52    0   276M 91856K RUN      5:52  11.83% =
postgres
>=20
>=20
> Now, as usual, the "root-cause" questions arise: What exactly does
> this "quantum"? Is this solution a workaround, i.e. actually something
> else is wrong, and has it tradeoff in other situations? Or otherwise,
> why is such a default value chosen, which appears to be ill-deceived?
>=20
> The docs for the quantum parameter are a bit unsatisfying - they say
> its the max num of ticks a process gets - and what happens when
> they're exhausted? If by default the endless loop is actually allowed
> to continue running for 94k ticks (or 94ms, more likely) =
uninterrupted,
> then that explains the perceived behaviour - buts thats certainly not
> what a scheduler should do when other procs are ready to run.

I can answer this from the operating systems course I followed recently. =
This does not apply to FreeBSD specifically, it is general job =
scheduling theory. I still need to read up on SCHED_ULE to see how the =
details were implemented there. Or are you using the older SCHED_4BSD?
Anyway...

Jobs that are ready to run are collected on a ready queue. Since you =
have a single CPU, there can only be a single job active on the CPU. =
When that job is finished, the scheduler takes the next job in the ready =
queue and assigns it to the CPU, etc.

Now, that would cause a much worse situation in your example case. The =
endless loop would keep running once it gets the CPU and would never =
release it. No other process would ever get a turn again. You wouldn't =
even be able to get into such a system in that state using remote ssh.

That is why the scheduler has this "quantum", which limits the maximum =
time the CPU will be assigned to a specific job. Once the quantum has =
expired (with the job unfinished), the scheduler removes the job from =
the CPU, puts it back on the ready queue and assigns the next job from =
that queue to the CPU.
That's why you seem to get better performance with a smaller value for =
the quantum; the endless loop gets forcibly interrupted more often.

This changing of the active job however, involves a context switch for =
the CPU. Memory, registers, file handles, etc. that were required by the =
previous job needs to be put aside and replaced by any such resources =
related to the new job to be run. That uses up time and does nothing to =
progress the jobs that are waiting for the CPU. Hence, you don't want =
the quantum to be too small either, or you'll end up spending =
significant time switching contexts. That gets worse when the job =
involves system calls, which are handled by the kernel, which is also a =
process that needs to be switched (and Meltdown made that worse, because =
more rigorous clean-up is necessary to prevent peeks into sections of =
memory that were owned by the kernel process previously).

The "correct" value for the quantum depends on your type of workload. =
PostgreSQL's auto-vacuum is a typical background process that will =
probably (I didn't verify) request to be run at a lower priority, giving =
other, more important, jobs more chance to get picked from the ready =
queue (provided that the OS implements priority for the ready queue).
That is probably why your endless loop gets much more CPU time than the =
VACUUM process. It may be that FreeBSD's default value for the quantum =
is not suitable for your workload. Finding the one best suited to you is =
not particularly easy though - perhaps FreeBSD allows access to average =
job times (below quantum) that can be used to calculate a reasonable =
average from.

That said, SCHED_ULE (the default scheduler for quite a while now) was =
designed with multi-CPU configurations in mind and there are claims that =
SCHED_4BSD works better for single-CPU configurations. You may give that =
a try, if you're not already on SCHED_4BSD.

A much better option in your case would be to put the database on a =
multi-core machine.

> [1]
> A pure-I/O job without compute load, like "dd", does not show
> this behaviour. Also, when other tasks are running, the unjust
> behaviour is not so stongly pronounced.

That is probably because dd has the decency to give the reins back to =
the scheduler at regular intervals.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9FDC510B-49D0-4722-B695-6CD38CA20D4A>