From owner-freebsd-virtualization@FreeBSD.ORG Wed Jun 10 20:14:49 2015 Return-Path: Delivered-To: freebsd-virtualization@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 107465FF for ; Wed, 10 Jun 2015 20:14:49 +0000 (UTC) (envelope-from stefan.andritoiu@gmail.com) Received: from mail-oi0-x241.google.com (mail-oi0-x241.google.com [IPv6:2607:f8b0:4003:c06::241]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CC41F13E6 for ; Wed, 10 Jun 2015 20:14:48 +0000 (UTC) (envelope-from stefan.andritoiu@gmail.com) Received: by oiax69 with SMTP id x69so4485472oia.1 for ; Wed, 10 Jun 2015 13:14:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=jlxV0A7Y7cO2zwg4jE5bhg1yZpOJE4PEw3t7WWgwkqk=; b=YBDFwmVKJM/pGYNDh3GXowIp1yY0ZPvkfjgU+YuGAd+hbB08fb3zR6xxG+g0iaBbTj KrxBRak89ujlJp4ougv/vuXxKu75+FzaXM/Lnk03lrXz3dqakrB/DsPbjaxCXoSfy5pi EWe2VbYVnChAk6C10LEjvyOIsrwsIstCYwnLgLzqbdfU6zY3pgXhdpWg6qSRYyJkLhDL lvSroL+wZ3KkUYbb5sbCNLHgx54O4mG+plCzaz/lnWc1zdWPmJgBGc06Cyd9Rj/iY41s i9kVW626yG1zlw22hXlG02CMmI6a9dPIpTooGIQNLUAVFB1Ab2p7JUSZDTOniuUl2kv4 mRAQ== MIME-Version: 1.0 X-Received: by 10.202.50.198 with SMTP id y189mr4135537oiy.21.1433967288018; Wed, 10 Jun 2015 13:14:48 -0700 (PDT) Received: by 10.60.82.168 with HTTP; Wed, 10 Jun 2015 13:14:47 -0700 (PDT) Date: Wed, 10 Jun 2015 23:14:47 +0300 Message-ID: Subject: Gang scheduling implementation in the ULE scheduler From: Stefan Andritoiu To: freebsd-virtualization@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Jun 2015 20:14:49 -0000 Hello, I am currently working on a gang scheduling implementation for the bhyve VCPU-threads on FreeBSD 10.1. I have added a new field "int gang" to the thread structure to specify the gang it is part of (0 for no gang), and have modified the bhyve code to initialize this field when a VCPU is created. I will post these modifications in another message. When I start a Virtual Machine, during the guest's boot, IPIs are sent and received correctly between CPUs, but after a few seconds I get: spin lock 0xffffffff8164c290 (smp rendezvous) held by 0xfffff8000296c000 (tid 100009) too long panic: spin lock held too long If I limit the number of IPIs that are sent, I do not have this problem. Which leads me to believe that (because of the constant context-switch when the guest boots), the high number of IPIs sent starve the system. Does anyone know what is happening? And maybe know of a possible solution? Thank you, Stefan ====================================================================================== I have added here the modifications to the sched_ule.c file and a brief explanation of it: In struct tdq, I have added two new field: - int scheduled_gang; /* Set to a non-zero value if the respective CPU is required to schedule a thread belonging to a gang. The value of scheduled_gang also being the ID of the gang that we want scheduled. For now I have considered only one running guest, so the value is 0 or 1 */ - int gang_leader; /* Set if the respective CPU is the one who has initialized gang scheduling. Zero otherwise. Not relevant to the final code and will be removed. Just for debugging purposes. */ Created a new function "static void schedule_gang(void * arg)" that will be called by each processor when it receives an IPI from the gang leader: - sets scheduled_gang = 1 - informs the system that it needs to reschedule. Not yet implemented In function "struct thread* tdq_choose (struct tdq * tdq)": if (tdq->scheduled_gang) - checks to see if a thread belonging to a gang must be scheduled. If so, calls functions that check the runqs and return a gang thread. I have yet to implement these functions. In function "sched_choose()": if (td->gang) - checks if the chosen thread is part of a gang. If so it signals all other CPUs to run function "schedule_gang(void * gang)". if (tdq->scheduled_gang) - if scheduled_gang is set it means that the scheduler is called after the the code in schedule_gang() has ran, and bypasses sending IPIs to the other CPUs. If not for this checkup, a CPU would receive a IPI; set scheduled_gang=1; the scheduler would be called and would choose a thread to run; that thread would be part of a gang; an IPI would be sent to all other CPUs. A constant back-and-forth of IPIs between the CPUs would be created. The CPU that initializes gang scheduling, does not receive an IPI, and does not even call the "schedule_gang(void * gang)" function. It continues in scheduling the gang-thread it selected, the one that started the gang scheduling process. =================================================================== --- sched_ule.c (revision 24) +++ sched_ule.c (revision 26) @@ -247,6 +247,9 @@ struct runq tdq_timeshare; /* timeshare run queue. */ struct runq tdq_idle; /* Queue of IDLE threads. */ char tdq_name[TDQ_NAME_LEN]; + + int gang_leader; + int scheduled_gang; #ifdef KTR char tdq_loadname[TDQ_LOADNAME_LEN]; #endif @@ -1308,6 +1311,20 @@ struct thread *td; TDQ_LOCK_ASSERT(tdq, MA_OWNED); + + /* Pick gang thread to run */ + if (tdq->scheduled_gang){ + /* basically the normal choosing of threads but with regards to scheduled_gang + tdq = runq_choose_gang(&tdq->realtime); + if (td != NULL) + return (td); + + td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx); + if (td != NULL) + return (td); + */ + } + td = runq_choose(&tdq->tdq_realtime); if (td != NULL) return (td); @@ -2295,6 +2312,22 @@ return (load); } +static void +schedule_gang(void * arg){ + struct tdq *tdq; + struct tdq *from_tdq = arg; + tdq = TDQ_SELF(); + + if(tdq == from_tdq){ + /* Just for testing IPI. Code is never reached, and should never be*/ + tdq->scheduled_gang = 1; +// printf("[schedule_gang] received IPI from himself\n"); + } + else{ + tdq->scheduled_gang = 1; +// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name); + } +} /* * Choose the highest priority thread to run. The thread is removed from * the run-queue while running however the load remains. For SMP we set @@ -2305,11 +2338,26 @@ { struct thread *td; struct tdq *tdq; + cpuset_t map; tdq = TDQ_SELF(); TDQ_LOCK_ASSERT(tdq, MA_OWNED); td = tdq_choose(tdq); if (td) { + if(tdq->scheduled_gang){ + /* Scheduler called after IPI + jump over rendezvous*/ + tdq->scheduled_gang = 0; + } + else{ + if(td->gang){ + map = all_cpus; + CPU_CLR(curcpu, &map); + + smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq); + } + } + tdq_runq_rem(tdq, td); tdq->tdq_lowpri = td->td_priority; return (td);