Date: Thu, 11 Jun 2015 14:24:30 +0300 From: Stefan Andritoiu <stefan.andritoiu@gmail.com> To: Neel Natu <neelnatu@gmail.com> Cc: "freebsd-virtualization@freebsd.org" <freebsd-virtualization@freebsd.org> Subject: Re: Gang scheduling implementation in the ULE scheduler Message-ID: <CAO3d8=aKAqoMGgdBFDNL_E=0M4n8=4DRPaWQpx8h_toy2fGpNA@mail.gmail.com> In-Reply-To: <CAFgRE9EuaSCaMekyatumC56o-uza8ZoYM0zR-fjtcG8U5cSoew@mail.gmail.com> References: <CAO3d8=ZyPjH0Yrntw9t=5v9sC8scSDP%2BmOEzPg2Crd_qZeBVVQ@mail.gmail.com> <CAFgRE9EuaSCaMekyatumC56o-uza8ZoYM0zR-fjtcG8U5cSoew@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 11, 2015 at 3:02 AM, Neel Natu <neelnatu@gmail.com> wrote: > Hi Stefan, > > On Wed, Jun 10, 2015 at 1:14 PM, Stefan Andritoiu > <stefan.andritoiu@gmail.com> wrote: >> Hello, >> >> I am currently working on a gang scheduling implementation for the >> bhyve VCPU-threads on FreeBSD 10.1. >> I have added a new field "int gang" to the thread structure to specify >> the gang it is part of (0 for no gang), and have modified the bhyve >> code to initialize this field when a VCPU is created. I will post >> these modifications in another message. >> >> When I start a Virtual Machine, during the guest's boot, IPIs are sent >> and received correctly between CPUs, but after a few seconds I get: >> spin lock 0xffffffff8164c290 (smp rendezvous) held by >> 0xfffff8000296c000 (tid 100009) too long >> panic: spin lock held too long >> >> If I limit the number of IPIs that are sent, I do not have this >> problem. Which leads me to believe that (because of the constant >> context-switch when the guest boots), the high number of IPIs sent >> starve the system. >> >> Does anyone know what is happening? And maybe know of a possible solution? >> > > In your patch 'smp_rendezvous()' is being called with the TDQ locked. > > There are a few code paths in ULE where it will want to lock two TDQs > at the same time (see tdq_lock_pair()). This has the potential to > cause a deadlock if the 2nd TDQ in tdq_lock_pair() is the one that was > locked before calling 'smp_rendezvous()'. > > To verify this theory can you set the following sysctls and repeat the test? > $ sysctl kern.sched.steal_idle=0 > $ sysctl kern.sched.rebalance=0 > > best > Neel > Hi Neel I do not seem have a kern.sched.rebalance variable. I do have kern.sched.balance> I have tested with both sysctl kern.sched.steal_idle=0 sysctl kern.sched.balance=0 and sysctl kern.sched.steal_idle=0 sysctl kern.sched.balance=1 Unfortunately, in both cases the result is the same as before: panic: spin lock held too long best Stefan >> Thank you, >> Stefan >> >> >> ====================================================================================== >> I have added here the modifications to the sched_ule.c file and a >> brief explanation of it: >> >> In struct tdq, I have added two new field: >> - int scheduled_gang; >> /* Set to a non-zero value if the respective CPU is required to >> schedule a thread belonging to a gang. The value of scheduled_gang >> also being the ID of the gang that we want scheduled. For now I have >> considered only one running guest, so the value is 0 or 1 */ >> - int gang_leader; >> /* Set if the respective CPU is the one who has initialized gang >> scheduling. Zero otherwise. Not relevant to the final code and will be >> removed. Just for debugging purposes. */ >> >> Created a new function "static void schedule_gang(void * arg)" that >> will be called by each processor when it receives an IPI from the gang >> leader: >> - sets scheduled_gang = 1 >> - informs the system that it needs to reschedule. Not yet implemented >> >> In function "struct thread* tdq_choose (struct tdq * tdq)": >> if (tdq->scheduled_gang) - checks to see if a thread belonging to >> a gang must be scheduled. If so, calls functions that check the runqs >> and return a gang thread. I have yet to implement these functions. >> >> In function "sched_choose()": >> if (td->gang) - checks if the chosen thread is part of a gang. If >> so it signals all other CPUs to run function "schedule_gang(void * >> gang)". >> if (tdq->scheduled_gang) - if scheduled_gang is set it means that >> the scheduler is called after the the code in schedule_gang() has ran, >> and bypasses sending IPIs to the other CPUs. If not for this checkup, >> a CPU would receive a IPI; set scheduled_gang=1; the scheduler would >> be called and would choose a thread to run; that thread would be part >> of a gang; an IPI would be sent to all other CPUs. A constant >> back-and-forth of IPIs between the CPUs would be created. >> >> The CPU that initializes gang scheduling, does not receive an IPI, and >> does not even call the "schedule_gang(void * gang)" function. It >> continues in scheduling the gang-thread it selected, the one that >> started the gang scheduling process. >> >> >> =================================================================== >> --- sched_ule.c (revision 24) >> +++ sched_ule.c (revision 26) >> @@ -247,6 +247,9 @@ >> struct runq tdq_timeshare; /* timeshare run queue. */ >> struct runq tdq_idle; /* Queue of IDLE threads. */ >> char tdq_name[TDQ_NAME_LEN]; >> + >> + int gang_leader; >> + int scheduled_gang; >> #ifdef KTR >> char tdq_loadname[TDQ_LOADNAME_LEN]; >> #endif >> @@ -1308,6 +1311,20 @@ >> struct thread *td; >> >> TDQ_LOCK_ASSERT(tdq, MA_OWNED); >> + >> + /* Pick gang thread to run */ >> + if (tdq->scheduled_gang){ >> + /* basically the normal choosing of threads but with regards to scheduled_gang >> + tdq = runq_choose_gang(&tdq->realtime); >> + if (td != NULL) >> + return (td); >> + >> + td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx); >> + if (td != NULL) >> + return (td); >> + */ >> + } >> + >> td = runq_choose(&tdq->tdq_realtime); >> if (td != NULL) >> return (td); >> @@ -2295,6 +2312,22 @@ >> return (load); >> } >> >> +static void >> +schedule_gang(void * arg){ >> + struct tdq *tdq; >> + struct tdq *from_tdq = arg; >> + tdq = TDQ_SELF(); >> + >> + if(tdq == from_tdq){ >> + /* Just for testing IPI. Code is never reached, and should never be*/ >> + tdq->scheduled_gang = 1; >> +// printf("[schedule_gang] received IPI from himself\n"); >> + } >> + else{ >> + tdq->scheduled_gang = 1; >> +// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name); >> + } >> +} >> /* >> * Choose the highest priority thread to run. The thread is removed from >> * the run-queue while running however the load remains. For SMP we set >> @@ -2305,11 +2338,26 @@ >> { >> struct thread *td; >> struct tdq *tdq; >> + cpuset_t map; >> >> tdq = TDQ_SELF(); >> TDQ_LOCK_ASSERT(tdq, MA_OWNED); >> td = tdq_choose(tdq); >> if (td) { >> + if(tdq->scheduled_gang){ >> + /* Scheduler called after IPI >> + jump over rendezvous*/ >> + tdq->scheduled_gang = 0; >> + } >> + else{ >> + if(td->gang){ >> + map = all_cpus; >> + CPU_CLR(curcpu, &map); >> + >> + smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq); >> + } >> + } >> + >> tdq_runq_rem(tdq, td); >> tdq->tdq_lowpri = td->td_priority; >> return (td); >> _______________________________________________ >> freebsd-virtualization@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization >> To unsubscribe, send any mail to "freebsd-virtualization-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAO3d8=aKAqoMGgdBFDNL_E=0M4n8=4DRPaWQpx8h_toy2fGpNA>