From owner-freebsd-virtualization@FreeBSD.ORG Thu Jun 11 11:24:31 2015 Return-Path: Delivered-To: freebsd-virtualization@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 99093522 for ; Thu, 11 Jun 2015 11:24:31 +0000 (UTC) (envelope-from stefan.andritoiu@gmail.com) Received: from mail-ob0-x244.google.com (mail-ob0-x244.google.com [IPv6:2607:f8b0:4003:c01::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5AFC61ADD for ; Thu, 11 Jun 2015 11:24:31 +0000 (UTC) (envelope-from stefan.andritoiu@gmail.com) Received: by obbnt9 with SMTP id nt9so283251obb.1 for ; Thu, 11 Jun 2015 04:24:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=f8UPe8cghFrQ3foqwuhoL0gVQqhUyJBNvhM/QhObdGk=; b=CBpMkkOD75Nzq9lznGtzc1JLBwr2pAA00oJObYPk0hMfORH3lnYWCiRMZgoojYEDiX QBiDtowh3gZ+1ISv8iDmBHl8C0J2hZNHGzOeNREGDGF02g8MO/zqFvKjPZBXzPF6h+On 75hjXLmkbtjrdxHNiRkkcqEd6M74/A5n6hFJhaiyIE7SWZLzcjGglAjJU09uYW9YUyuS MXCu32gWyoz81XCW0Uz4YCPMQVVkKsxFa+t1Bhu18jVcODRC9CMlkx8XEzDNvivv8A5H /4S7hlTFDogGF1WaA5G3kd7PJSeaGxtG6oTgAI0E4iy8zXNxRRsfB92PK90kUVVwLaCP vxrg== MIME-Version: 1.0 X-Received: by 10.182.97.2 with SMTP id dw2mr7361680obb.85.1434021870084; Thu, 11 Jun 2015 04:24:30 -0700 (PDT) Received: by 10.60.82.168 with HTTP; Thu, 11 Jun 2015 04:24:30 -0700 (PDT) In-Reply-To: References: Date: Thu, 11 Jun 2015 14:24:30 +0300 Message-ID: Subject: Re: Gang scheduling implementation in the ULE scheduler From: Stefan Andritoiu To: Neel Natu Cc: "freebsd-virtualization@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Jun 2015 11:24:31 -0000 On Thu, Jun 11, 2015 at 3:02 AM, Neel Natu wrote: > Hi Stefan, > > On Wed, Jun 10, 2015 at 1:14 PM, Stefan Andritoiu > wrote: >> Hello, >> >> I am currently working on a gang scheduling implementation for the >> bhyve VCPU-threads on FreeBSD 10.1. >> I have added a new field "int gang" to the thread structure to specify >> the gang it is part of (0 for no gang), and have modified the bhyve >> code to initialize this field when a VCPU is created. I will post >> these modifications in another message. >> >> When I start a Virtual Machine, during the guest's boot, IPIs are sent >> and received correctly between CPUs, but after a few seconds I get: >> spin lock 0xffffffff8164c290 (smp rendezvous) held by >> 0xfffff8000296c000 (tid 100009) too long >> panic: spin lock held too long >> >> If I limit the number of IPIs that are sent, I do not have this >> problem. Which leads me to believe that (because of the constant >> context-switch when the guest boots), the high number of IPIs sent >> starve the system. >> >> Does anyone know what is happening? And maybe know of a possible solution? >> > > In your patch 'smp_rendezvous()' is being called with the TDQ locked. > > There are a few code paths in ULE where it will want to lock two TDQs > at the same time (see tdq_lock_pair()). This has the potential to > cause a deadlock if the 2nd TDQ in tdq_lock_pair() is the one that was > locked before calling 'smp_rendezvous()'. > > To verify this theory can you set the following sysctls and repeat the test? > $ sysctl kern.sched.steal_idle=0 > $ sysctl kern.sched.rebalance=0 > > best > Neel > Hi Neel I do not seem have a kern.sched.rebalance variable. I do have kern.sched.balance> I have tested with both sysctl kern.sched.steal_idle=0 sysctl kern.sched.balance=0 and sysctl kern.sched.steal_idle=0 sysctl kern.sched.balance=1 Unfortunately, in both cases the result is the same as before: panic: spin lock held too long best Stefan >> Thank you, >> Stefan >> >> >> ====================================================================================== >> I have added here the modifications to the sched_ule.c file and a >> brief explanation of it: >> >> In struct tdq, I have added two new field: >> - int scheduled_gang; >> /* Set to a non-zero value if the respective CPU is required to >> schedule a thread belonging to a gang. The value of scheduled_gang >> also being the ID of the gang that we want scheduled. For now I have >> considered only one running guest, so the value is 0 or 1 */ >> - int gang_leader; >> /* Set if the respective CPU is the one who has initialized gang >> scheduling. Zero otherwise. Not relevant to the final code and will be >> removed. Just for debugging purposes. */ >> >> Created a new function "static void schedule_gang(void * arg)" that >> will be called by each processor when it receives an IPI from the gang >> leader: >> - sets scheduled_gang = 1 >> - informs the system that it needs to reschedule. Not yet implemented >> >> In function "struct thread* tdq_choose (struct tdq * tdq)": >> if (tdq->scheduled_gang) - checks to see if a thread belonging to >> a gang must be scheduled. If so, calls functions that check the runqs >> and return a gang thread. I have yet to implement these functions. >> >> In function "sched_choose()": >> if (td->gang) - checks if the chosen thread is part of a gang. If >> so it signals all other CPUs to run function "schedule_gang(void * >> gang)". >> if (tdq->scheduled_gang) - if scheduled_gang is set it means that >> the scheduler is called after the the code in schedule_gang() has ran, >> and bypasses sending IPIs to the other CPUs. If not for this checkup, >> a CPU would receive a IPI; set scheduled_gang=1; the scheduler would >> be called and would choose a thread to run; that thread would be part >> of a gang; an IPI would be sent to all other CPUs. A constant >> back-and-forth of IPIs between the CPUs would be created. >> >> The CPU that initializes gang scheduling, does not receive an IPI, and >> does not even call the "schedule_gang(void * gang)" function. It >> continues in scheduling the gang-thread it selected, the one that >> started the gang scheduling process. >> >> >> =================================================================== >> --- sched_ule.c (revision 24) >> +++ sched_ule.c (revision 26) >> @@ -247,6 +247,9 @@ >> struct runq tdq_timeshare; /* timeshare run queue. */ >> struct runq tdq_idle; /* Queue of IDLE threads. */ >> char tdq_name[TDQ_NAME_LEN]; >> + >> + int gang_leader; >> + int scheduled_gang; >> #ifdef KTR >> char tdq_loadname[TDQ_LOADNAME_LEN]; >> #endif >> @@ -1308,6 +1311,20 @@ >> struct thread *td; >> >> TDQ_LOCK_ASSERT(tdq, MA_OWNED); >> + >> + /* Pick gang thread to run */ >> + if (tdq->scheduled_gang){ >> + /* basically the normal choosing of threads but with regards to scheduled_gang >> + tdq = runq_choose_gang(&tdq->realtime); >> + if (td != NULL) >> + return (td); >> + >> + td = runq_choose_from_gang(&tdq->tdq_timeshare, tdq->tdq_ridx); >> + if (td != NULL) >> + return (td); >> + */ >> + } >> + >> td = runq_choose(&tdq->tdq_realtime); >> if (td != NULL) >> return (td); >> @@ -2295,6 +2312,22 @@ >> return (load); >> } >> >> +static void >> +schedule_gang(void * arg){ >> + struct tdq *tdq; >> + struct tdq *from_tdq = arg; >> + tdq = TDQ_SELF(); >> + >> + if(tdq == from_tdq){ >> + /* Just for testing IPI. Code is never reached, and should never be*/ >> + tdq->scheduled_gang = 1; >> +// printf("[schedule_gang] received IPI from himself\n"); >> + } >> + else{ >> + tdq->scheduled_gang = 1; >> +// printf("[schedule_gang] received on cpu: %s \n", tdq->tdq_name); >> + } >> +} >> /* >> * Choose the highest priority thread to run. The thread is removed from >> * the run-queue while running however the load remains. For SMP we set >> @@ -2305,11 +2338,26 @@ >> { >> struct thread *td; >> struct tdq *tdq; >> + cpuset_t map; >> >> tdq = TDQ_SELF(); >> TDQ_LOCK_ASSERT(tdq, MA_OWNED); >> td = tdq_choose(tdq); >> if (td) { >> + if(tdq->scheduled_gang){ >> + /* Scheduler called after IPI >> + jump over rendezvous*/ >> + tdq->scheduled_gang = 0; >> + } >> + else{ >> + if(td->gang){ >> + map = all_cpus; >> + CPU_CLR(curcpu, &map); >> + >> + smp_rendezvous_cpus(map, NULL, schedule_gang, NULL, tdq); >> + } >> + } >> + >> tdq_runq_rem(tdq, td); >> tdq->tdq_lowpri = td->td_priority; >> return (td); >> _______________________________________________ >> freebsd-virtualization@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-virtualization >> To unsubscribe, send any mail to "freebsd-virtualization-unsubscribe@freebsd.org"