From owner-freebsd-virtualization@FreeBSD.ORG Thu Jun 11 00:02:25 2015 Return-Path: Delivered-To: freebsd-virtualization@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 83009669 for ; Thu, 11 Jun 2015 00:02:25 +0000 (UTC) (envelope-from neelnatu@gmail.com) Received: from mail-wi0-x231.google.com (mail-wi0-x231.google.com [IPv6:2a00:1450:400c:c05::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0D3191E01 for ; Thu, 11 Jun 2015 00:02:25 +0000 (UTC) (envelope-from neelnatu@gmail.com) Received: by wiga1 with SMTP id a1so62023733wig.0 for ; Wed, 10 Jun 2015 17:02:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=XIB9Jd04stfrBWJ2oxJq3BbAuIbDyA+sSDwwQb4+YME=; b=uzRq7g03LPI+r9cRn48FpJXmb07hYmIPEicK8A/kyUQZGqSfMvlQXgdCIG2Xxo0a+q LguAQgAacmhSHqs91PTZLLRfyzt18SEs4RYhOjSWG+0kmNIBlIamMdbf3sVLVcEvqzNo afU6kqlSwBSm9yaYv8j55qpW9FAxEFbaz9YHRrV72Jg0945FMUQtDHiz0X78QF9rpe/s mBXxGPLXeBvwCRxEqUCW8EXjzrV9AHwBTY7ffMnVaxR0dFh+n1Lx+ESh8HL3Q7hcgRfb vnkrwKla/W/KxuWyE0GzLkloRWMGS45rVbaHrSzg9uBG1LH9RS/baf3olK5eGEqQQKw1 7p1Q== MIME-Version: 1.0 X-Received: by 10.180.88.99 with SMTP id bf3mr23966701wib.75.1433980942741; Wed, 10 Jun 2015 17:02:22 -0700 (PDT) Received: by 10.27.52.18 with HTTP; Wed, 10 Jun 2015 17:02:22 -0700 (PDT) In-Reply-To: References: Date: Wed, 10 Jun 2015 17:02:22 -0700 Message-ID: Subject: Re: Gang scheduling implementation in the ULE scheduler From: Neel Natu To: Stefan Andritoiu Cc: "freebsd-virtualization@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Jun 2015 00:02:25 -0000 Hi Stefan, On Wed, Jun 10, 2015 at 1:14 PM, Stefan Andritoiu wrote: > Hello, > > I am currently working on a gang scheduling implementation for the > bhyve VCPU-threads on FreeBSD 10.1. > I have added a new field "int gang" to the thread structure to specify > the gang it is part of (0 for no gang), and have modified the bhyve > code to initialize this field when a VCPU is created. I will post > these modifications in another message. > > When I start a Virtual Machine, during the guest's boot, IPIs are sent > and received correctly between CPUs, but after a few seconds I get: > spin lock 0xffffffff8164c290 (smp rendezvous) held by > 0xfffff8000296c000 (tid 100009) too long > panic: spin lock held too long > > If I limit the number of IPIs that are sent, I do not have this > problem. Which leads me to believe that (because of the constant > context-switch when the guest boots), the high number of IPIs sent > starve the system. > > Does anyone know what is happening? And maybe know of a possible solution? > In your patch 'smp_rendezvous()' is being called with the TDQ locked. There are a few code paths in ULE where it will want to lock two TDQs at the same time (see tdq_lock_pair()). This has the potential to cause a deadlock if the 2nd TDQ in tdq_lock_pair() is the one that was locked before calling 'smp_rendezvous()'. To verify this theory can you set the following sysctls and repeat the test? $ sysctl kern.sched.steal_idle=0 $ sysctl kern.sched.rebalance=0 best Neel > Thank you, > Stefan > > > ====================================================================================== > I have added here the modifications to the sched_ule.c file and a > brief explanation of it: > > In struct tdq, I have added two new field: > - int scheduled_gang; > /* Set to a non-zero value if the respective CPU is required to > schedule a thread belonging to a gang. The value of scheduled_gang > also being the ID of the gang that we want scheduled. For now I have > considered only one running guest, so the value is 0 or 1 */ > - int gang_leader; > /* Set if the respective CPU is the one who has initialized gang > scheduling. Zero otherwise. Not relevant to the final code and will be > removed. Just for debugging purposes. */ > > Created a new function "static void schedule_gang(void * arg)" that > will be called by each processor when it receives an IPI from the gang > leader: > - sets scheduled_gang = 1 > - informs the system that it needs to reschedule. Not yet implemented > > In function "struct thread* tdq_choose (struct tdq * tdq)": > if (tdq->scheduled_gang) - checks to see if a thread belonging to > a gang must be scheduled. If so, calls functions that check the runqs > and return a gang thread. I have yet to implement these functions. > > In function "sched_choose()": > if (td->gang) - checks if the chosen thread is part of a gang. If > so it signals all other CPUs to run function "schedule_gang(void * > gang)". > if (tdq->scheduled_gang) - if scheduled_gang is set it means that > the scheduler is called after the the code in schedule_gang() has ran, > and bypasses sending IPIs to the other CPUs. If not for this checkup, > a CPU would receive a IPI; set scheduled_gang=1; the scheduler would > be called and would choose a thread to run; that thread would be part > of a gang; an IPI would be sent to all other CPUs. A constant > back-and-forth of IPIs between the CPUs would be created. > > The CPU that initializes gang scheduling, does not receive an IPI, and > does not even call the "schedule_gang(void * gang)" function. It > continues in scheduling the gang-thread it selected, the one that > started the gang scheduling process. > > > =================================================================== > --- sched_ule.c (revision 24) > +++ sched_ule.c (revision 26) > @@ -247,6 +247,9 @@ > struct runq tdq_timeshare; /* timeshare run queue. */ > struct runq tdq_idle; /* Queue of IDLE threads. */ > char tdq_name[TDQ_NAME_LEN]; > + > + int gang_leader; > + int scheduled_gang; > #ifdef KTR > char tdq_loadname[TDQ_LOADNAME_LEN]; > #endif > @@ -1308,6 +1311,20 @@ > struct thread *td; > > TDQ_LOCK_ASSERT(tdq, MA_OWNED); > + > + /* Pick gang thread to run */ > + if (tdq->scheduled_gang){ > + /* basically the normal choosing of threads but with regards to scheduled_gang > + tdq = runq_choose_gang(&tdq->realtime); > + if (td != NULL) > + return (td); > + > + td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx); > + if (td != NULL) > + return (td); > + */ > + } > + > td = runq_choose(&tdq->tdq_realtime); > if (td != NULL) > return (td); > @@ -2295,6 +2312,22 @@ > return (load); > } > > +static void > +schedule_gang(void * arg){ > + struct tdq *tdq; > + struct tdq *from_tdq = arg; > + tdq = TDQ_SELF(); > + > + if(tdq == from_tdq){ > + /* Just for testing IPI. Code is never reached, and should never be*/ > + tdq->scheduled_gang = 1; > +// printf("[schedule_gang] received IPI from himself\n"); > + } > + else{ > + tdq->scheduled_gang = 1; > +// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name); > + } > +} > /* > * Choose the highest priority thread to run. The thread is removed from > * the run-queue while running however the load remains. For SMP we set > @@ -2305,11 +2338,26 @@ > { > struct thread *td; > struct tdq *tdq; > + cpuset_t map; > > tdq = TDQ_SELF(); > TDQ_LOCK_ASSERT(tdq, MA_OWNED); > td = tdq_choose(tdq); > if (td) { > + if(tdq->scheduled_gang){ > + /* Scheduler called after IPI > + jump over rendezvous*/ > + tdq->scheduled_gang = 0; > + } > + else{ > + if(td->gang){ > + map = all_cpus; > + CPU_CLR(curcpu, &map); > + > + smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq); > + } > + } > + > tdq_runq_rem(tdq, td); > tdq->tdq_lowpri = td->td_priority; > return (td); > _______________________________________________ > freebsd-virtualization@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization > To unsubscribe, send any mail to "freebsd-virtualization-unsubscribe@freebsd.org"