From owner-freebsd-virtualization@FreeBSD.ORG  Wed Jun 10 20:14:49 2015
Return-Path: <owner-freebsd-virtualization@FreeBSD.ORG>
Delivered-To: freebsd-virtualization@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 107465FF
 for <freebsd-virtualization@freebsd.org>; Wed, 10 Jun 2015 20:14:49 +0000 (UTC)
 (envelope-from stefan.andritoiu@gmail.com)
Received: from mail-oi0-x241.google.com (mail-oi0-x241.google.com
 [IPv6:2607:f8b0:4003:c06::241])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id CC41F13E6
 for <freebsd-virtualization@freebsd.org>; Wed, 10 Jun 2015 20:14:48 +0000 (UTC)
 (envelope-from stefan.andritoiu@gmail.com)
Received: by oiax69 with SMTP id x69so4485472oia.1
 for <freebsd-virtualization@freebsd.org>; Wed, 10 Jun 2015 13:14:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:date:message-id:subject:from:to:content-type;
 bh=jlxV0A7Y7cO2zwg4jE5bhg1yZpOJE4PEw3t7WWgwkqk=;
 b=YBDFwmVKJM/pGYNDh3GXowIp1yY0ZPvkfjgU+YuGAd+hbB08fb3zR6xxG+g0iaBbTj
 KrxBRak89ujlJp4ougv/vuXxKu75+FzaXM/Lnk03lrXz3dqakrB/DsPbjaxCXoSfy5pi
 EWe2VbYVnChAk6C10LEjvyOIsrwsIstCYwnLgLzqbdfU6zY3pgXhdpWg6qSRYyJkLhDL
 lvSroL+wZ3KkUYbb5sbCNLHgx54O4mG+plCzaz/lnWc1zdWPmJgBGc06Cyd9Rj/iY41s
 i9kVW626yG1zlw22hXlG02CMmI6a9dPIpTooGIQNLUAVFB1Ab2p7JUSZDTOniuUl2kv4
 mRAQ==
MIME-Version: 1.0
X-Received: by 10.202.50.198 with SMTP id y189mr4135537oiy.21.1433967288018;
 Wed, 10 Jun 2015 13:14:48 -0700 (PDT)
Received: by 10.60.82.168 with HTTP; Wed, 10 Jun 2015 13:14:47 -0700 (PDT)
Date: Wed, 10 Jun 2015 23:14:47 +0300
Message-ID: <CAO3d8=ZyPjH0Yrntw9t=5v9sC8scSDP+mOEzPg2Crd_qZeBVVQ@mail.gmail.com>
Subject: Gang scheduling implementation in the ULE scheduler
From: Stefan Andritoiu <stefan.andritoiu@gmail.com>
To: freebsd-virtualization@freebsd.org
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-virtualization@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: "Discussion of various virtualization techniques FreeBSD supports."
 <freebsd-virtualization.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-virtualization>, 
 <mailto:freebsd-virtualization-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-virtualization/>
List-Post: <mailto:freebsd-virtualization@freebsd.org>
List-Help: <mailto:freebsd-virtualization-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization>, 
 <mailto:freebsd-virtualization-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Jun 2015 20:14:49 -0000

Hello,

I am currently working on a gang scheduling implementation for the
bhyve VCPU-threads on FreeBSD 10.1.
I have added a new field "int gang" to the thread structure to specify
the gang it is part of (0 for no gang), and have modified the bhyve
code to initialize this field when a VCPU is created. I will post
these modifications in another message.

When I start a Virtual Machine, during the guest's boot, IPIs are sent
and received correctly between CPUs, but after a few seconds I get:
    spin lock 0xffffffff8164c290 (smp rendezvous) held by
0xfffff8000296c000 (tid 100009) too long
    panic: spin lock held too long

If I limit the number of IPIs that are sent, I do not have this
problem. Which leads me to believe that (because of the constant
context-switch when the guest boots), the high number of IPIs sent
starve the system.

Does anyone know what is happening? And maybe know of a possible solution?

Thank you,
Stefan


======================================================================================
I have added here the modifications to the sched_ule.c file and a
brief explanation of it:

In struct tdq, I have added two new field:
  - int scheduled_gang;
    /* Set to a non-zero value if the respective CPU is required to
schedule a thread belonging to a gang. The value of scheduled_gang
also being the ID of the gang that we want scheduled. For now I have
considered only one running guest, so the value is 0 or 1 */
  - int gang_leader;
    /* Set if the respective CPU is the one who has initialized gang
scheduling. Zero otherwise. Not relevant to the final code and will be
removed. Just for debugging purposes. */

Created a new function "static void schedule_gang(void * arg)" that
will be called by each processor when it receives an IPI from the gang
leader:
  - sets scheduled_gang = 1
  - informs the system that it needs to reschedule. Not yet implemented

In function "struct thread* tdq_choose (struct tdq * tdq)":
    if (tdq->scheduled_gang) - checks to see if a thread belonging to
a gang must be scheduled. If so, calls functions that check the runqs
and return a gang thread. I have yet to implement these functions.

In function "sched_choose()":
   if (td->gang) - checks if the chosen thread is part of a gang. If
so it signals all other CPUs to run function "schedule_gang(void *
gang)".
   if (tdq->scheduled_gang) - if scheduled_gang is set it means that
the scheduler is called after the the code in schedule_gang() has ran,
and bypasses sending IPIs to the other CPUs. If not for this checkup,
a CPU would receive a IPI; set scheduled_gang=1; the scheduler would
be called and would choose a thread to run; that thread would be part
of a gang; an IPI would be sent to all other CPUs. A constant
back-and-forth of IPIs between the CPUs would be created.

The CPU that initializes gang scheduling, does not receive an IPI, and
does not even call the "schedule_gang(void * gang)" function. It
continues in scheduling the gang-thread it selected, the one that
started the gang scheduling process.


===================================================================
--- sched_ule.c (revision 24)
+++ sched_ule.c (revision 26)
@@ -247,6 +247,9 @@
  struct runq tdq_timeshare; /* timeshare run queue. */
  struct runq tdq_idle; /* Queue of IDLE threads. */
  char tdq_name[TDQ_NAME_LEN];
+
+ int gang_leader;
+ int scheduled_gang;
 #ifdef KTR
  char tdq_loadname[TDQ_LOADNAME_LEN];
 #endif
@@ -1308,6 +1311,20 @@
  struct thread *td;

  TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+
+ /* Pick gang thread to run */
+ if (tdq->scheduled_gang){
+ /* basically the normal choosing of threads but with regards to scheduled_gang
+ tdq = runq_choose_gang(&tdq->realtime);
+ if (td != NULL)
+ return (td);
+
+ td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx);
+ if (td != NULL)
+ return (td);
+ */
+ }
+
  td = runq_choose(&tdq->tdq_realtime);
  if (td != NULL)
  return (td);
@@ -2295,6 +2312,22 @@
  return (load);
 }

+static void
+schedule_gang(void * arg){
+ struct tdq *tdq;
+ struct tdq *from_tdq = arg;
+ tdq = TDQ_SELF();
+
+ if(tdq == from_tdq){
+ /* Just for testing IPI. Code is never reached, and should never be*/
+ tdq->scheduled_gang = 1;
+// printf("[schedule_gang] received IPI from himself\n");
+ }
+ else{
+ tdq->scheduled_gang = 1;
+// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name);
+ }
+}
 /*
  * Choose the highest priority thread to run.  The thread is removed from
  * the run-queue while running however the load remains.  For SMP we set
@@ -2305,11 +2338,26 @@
 {
  struct thread *td;
  struct tdq *tdq;
+ cpuset_t map;

  tdq = TDQ_SELF();
  TDQ_LOCK_ASSERT(tdq, MA_OWNED);
  td = tdq_choose(tdq);
  if (td) {
+ if(tdq->scheduled_gang){
+ /* Scheduler called after IPI
+ jump over rendezvous*/
+ tdq->scheduled_gang = 0;
+ }
+ else{
+ if(td->gang){
+ map = all_cpus;
+ CPU_CLR(curcpu, &map);
+
+ smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq);
+ }
+ }
+
  tdq_runq_rem(tdq, td);
  tdq->tdq_lowpri = td->td_priority;
  return (td);