From owner-freebsd-stable@FreeBSD.ORG Wed Dec 14 00:36:32 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 17C73106564A; Wed, 14 Dec 2011 00:36:32 +0000 (UTC) (envelope-from fidaj@ukr.net) Received: from fsm2.ukr.net (fsm2.ukr.net [195.214.192.121]) by mx1.freebsd.org (Postfix) with ESMTP id 9EFAE8FC1A; Wed, 14 Dec 2011 00:36:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=fsm; h=Content-Transfer-Encoding:Content-Type:Mime-Version:References:In-Reply-To:Message-ID:Subject:Cc:To:From:Date; bh=18YzJbAWI+aRJeAk77qSqDa/FDJYKyD0tpp/v1biG78=; b=hU8yO3T7w8C35eMlOLcyxigofcXvWzFwT0yKaGwGu0+CIHOUQ65gVLbMS9S0uYZHPuuCF0dr1ReSYBXAbNGWg4ek9lS9Uf1kCB87IsQAYiE4xYutmpjDX20QLU1txXgmYO/xq6gVxXyGrg+LbeHfSbx/TXN1rF8RP0YafM1EnTg=; Received: from [178.137.138.140] (helo=nonamehost.) by fsm2.ukr.net with esmtpsa ID 1Racpi-000EAS-H6 ; Wed, 14 Dec 2011 02:36:30 +0200 Date: Wed, 14 Dec 2011 02:36:29 +0200 From: Ivan Klymenko To: mdf@FreeBSD.org Message-ID: <20111214023629.3ae8c928@nonamehost.> In-Reply-To: References: <4EE1EAFE.3070408@m5p.com> <4EE22421.9060707@gmail.com> <4EE6060D.5060201@mail.zedat.fu-berlin.de> <4EE69C5A.3090005@FreeBSD.org> <20111213104048.40f3e3de@nonamehost.> <20111213230441.GB42285@stack.nl> <4ee7e2d3.0a3c640a.4617.4a33SMTPIN_ADDED@mx.google.com> X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.6; amd64-portbld-freebsd10.0) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Doug Barton , freebsd-stable@freebsd.org, Tjoelker , "O. Hartmann" , Current FreeBSD , Jilles, freebsd-performance@freebsd.org Subject: Re: SCHED_ULE should not be the default X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Dec 2011 00:36:32 -0000 =D0=92 Tue, 13 Dec 2011 16:01:56 -0800 mdf@FreeBSD.org =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > On Tue, Dec 13, 2011 at 3:39 PM, Ivan Klymenko wrote: > > =D0=92 Wed, 14 Dec 2011 00:04:42 +0100 > > Jilles Tjoelker =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > > > >> On Tue, Dec 13, 2011 at 10:40:48AM +0200, Ivan Klymenko wrote: > >> > If the algorithm ULE does not contain problems - it means the > >> > problem has Core2Duo, or in a piece of code that uses the ULE > >> > scheduler. I already wrote in a mailing list that specifically in > >> > my case (Core2Duo) partially helps the following patch: > >> > --- sched_ule.c.orig =C2=A0 =C2=A0 =C2=A0 =C2=A02011-11-24 18:11:48.= 000000000 +0200 > >> > +++ sched_ule.c =C2=A0 =C2=A0 2011-12-10 22:47:08.000000000 +0200 > >> > @@ -794,7 +794,8 @@ > >> > =C2=A0 =C2=A0 =C2=A0* 1.5 * balance_interval. > >> > =C2=A0 =C2=A0 =C2=A0*/ > >> > =C2=A0 =C2=A0 balance_ticks =3D max(balance_interval / 2, 1); > >> > - =C2=A0 balance_ticks +=3D random() % balance_interval; > >> > +// balance_ticks +=3D random() % balance_interval; > >> > + =C2=A0 balance_ticks +=3D ((int)random()) % balance_interval; > >> > =C2=A0 =C2=A0 if (smp_started =3D=3D 0 || rebalance =3D=3D 0) > >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; > >> > =C2=A0 =C2=A0 tdq =3D TDQ_SELF(); > >> > >> This avoids a 64-bit division on 64-bit platforms but seems to > >> have no effect otherwise. Because this function is not called very > >> often, the change seems unlikely to help. > > > > Yes, this section does not apply to this problem :) > > Just I posted the latest patch which i using now... > > > >> > >> > @@ -2118,13 +2119,21 @@ > >> > =C2=A0 =C2=A0 struct td_sched *ts; > >> > > >> > =C2=A0 =C2=A0 THREAD_LOCK_ASSERT(td, MA_OWNED); > >> > + =C2=A0 if (td->td_pri_class & PRI_FIFO_BIT) > >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; > >> > + =C2=A0 ts =3D td->td_sched; > >> > + =C2=A0 /* > >> > + =C2=A0 =C2=A0* We used up one time slice. > >> > + =C2=A0 =C2=A0*/ > >> > + =C2=A0 if (--ts->ts_slice > 0) > >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; > >> > >> This skips most of the periodic functionality (long term load > >> balancer, saving switch count (?), insert index (?), interactivity > >> score update for long running thread) if the thread is not going to > >> be rescheduled right now. > >> > >> It looks wrong but it is a data point if it helps your workload. > > > > Yes, I did it for as long as possible to delay the execution of the > > code in section: ... > > #ifdef SMP > > =C2=A0 =C2=A0 =C2=A0 =C2=A0/* > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 * We run the long term load balancer infreq= uently on the > > first cpu. */ > > =C2=A0 =C2=A0 =C2=A0 =C2=A0if (balance_tdq =3D=3D tdq) { > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (balance_tick= s && --balance_ticks =3D=3D 0) > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0sched_balance(); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > #endif > > ... > > > >> > >> > =C2=A0 =C2=A0 tdq =3D TDQ_SELF(); > >> > =C2=A0#ifdef SMP > >> > =C2=A0 =C2=A0 /* > >> > =C2=A0 =C2=A0 =C2=A0* We run the long term load balancer infrequentl= y on the > >> > first cpu. */ > >> > - =C2=A0 if (balance_tdq =3D=3D tdq) { > >> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (balance_ticks && --balance_= ticks =3D=3D 0) > >> > + =C2=A0 if (balance_ticks && --balance_ticks =3D=3D 0) { > >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (balance_tdq =3D=3D tdq) > >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 sched_balance(); > >> > =C2=A0 =C2=A0 } > >> > =C2=A0#endif > >> > >> The main effect of this appears to be to disable the long term load > >> balancer completely after some time. At some point, a CPU other > >> than the first CPU (which uses balance_tdq) will set balance_ticks > >> =3D 0, and sched_balance() will never be called again. > >> > > > > That is, for the same reason as above in the text... > > > >> It also introduces a hypothetical race condition because the > >> access to balance_ticks is no longer restricted to one CPU under a > >> spinlock. > >> > >> If the long term load balancer may be causing trouble, try setting > >> kern.sched.balance_interval to a higher value with unpatched code. > > > > I checked it in the first place - but it did not help fix the > > situation... > > > > The impression of malfunction rebalancing... > > It seems that the thread is passed on to the same core that is > > loaded and so... Perhaps this is a consequence of an incorrect > > definition of the topology CPU? > > > >> > >> > @@ -2144,9 +2153,6 @@ > >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if > >> > (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx])) > >> > tdq->tdq_ridx =3D tdq->tdq_idx; } > >> > - =C2=A0 ts =3D td->td_sched; > >> > - =C2=A0 if (td->td_pri_class & PRI_FIFO_BIT) > >> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; > >> > =C2=A0 =C2=A0 if (PRI_BASE(td->td_pri_class) =3D=3D PRI_TIMESHARE) { > >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* > >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* We used a tick; ch= arge it to the thread so > >> > @@ -2157,11 +2163,6 @@ > >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sched_priority(td); > >> > =C2=A0 =C2=A0 } > >> > =C2=A0 =C2=A0 /* > >> > - =C2=A0 =C2=A0* We used up one time slice. > >> > - =C2=A0 =C2=A0*/ > >> > - =C2=A0 if (--ts->ts_slice > 0) > >> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; > >> > - =C2=A0 /* > >> > =C2=A0 =C2=A0 =C2=A0* We're out of time, force a requeue at userret(= ). > >> > =C2=A0 =C2=A0 =C2=A0*/ > >> > =C2=A0 =C2=A0 ts->ts_slice =3D sched_slice; > >> > >> > and refusal to use options FULL_PREEMPTION > >> > But no one has unsubscribed to my letter, my patch helps or not > >> > in the case of Core2Duo... > >> > There is a suspicion that the problems stem from the sections of > >> > code associated with the SMP... > >> > Maybe I'm in something wrong, but I want to help in solving this > >> > problem ... >=20 >=20 > Has anyone experiencing problems tried to set sysctl > kern.sched.steal_thresh=3D1 ? >=20 In my case, the variable kern.sched.steal_thresh and so has the value 1. > I don't remember what our specific problem at $WORK was, perhaps it > was just interrupt threads not getting serviced fast enough, but we've > hard-coded this to 1 and removed the code that sets it in > sched_initticks(). The same effect should be had by setting the > sysctl after a box is up. >=20 > Thanks, > matthew