From owner-freebsd-stable@FreeBSD.ORG Wed Dec 14 00:01:57 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 96C2C1065670; Wed, 14 Dec 2011 00:01:57 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-pz0-f54.google.com (mail-pz0-f54.google.com [209.85.210.54]) by mx1.freebsd.org (Postfix) with ESMTP id 54AFE8FC0C; Wed, 14 Dec 2011 00:01:57 +0000 (UTC) Received: by dakp5 with SMTP id p5so253302dak.13 for ; Tue, 13 Dec 2011 16:01:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=/d9tPhNRKuZlrGasaA/On3wANIhZMN6DaAHZeY1WxuE=; b=AQ1f2p6HGVtPmZlrafZzUATT3m0rwxjUkIK6Ma1mcTjP7CD6fH+TWTFgIVK0IuVIHI 96gnQ2Nl+aVo8/aYuPhbl4qlyVDtp4aoostkuzuCIn905eq7/lfh/FQyQoKaiCl5UAQZ zya9nkvb9uUfsueC4TJ/LzQLC+sOLfMOixItU= MIME-Version: 1.0 Received: by 10.68.201.193 with SMTP id kc1mr18801pbc.51.1323820916837; Tue, 13 Dec 2011 16:01:56 -0800 (PST) Sender: mdf356@gmail.com Received: by 10.68.197.198 with HTTP; Tue, 13 Dec 2011 16:01:56 -0800 (PST) In-Reply-To: <4ee7e2d3.0a3c640a.4617.4a33SMTPIN_ADDED@mx.google.com> References: <4EE1EAFE.3070408@m5p.com> <4EE22421.9060707@gmail.com> <4EE6060D.5060201@mail.zedat.fu-berlin.de> <4EE69C5A.3090005@FreeBSD.org> <20111213104048.40f3e3de@nonamehost.> <20111213230441.GB42285@stack.nl> <4ee7e2d3.0a3c640a.4617.4a33SMTPIN_ADDED@mx.google.com> Date: Tue, 13 Dec 2011 16:01:56 -0800 X-Google-Sender-Auth: cXWpb1OQ43-qtKIw90RlKMbBfDs Message-ID: From: mdf@FreeBSD.org To: Ivan Klymenko Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Doug Barton , freebsd-stable@freebsd.org, Jilles Tjoelker , "O. Hartmann" , Current FreeBSD , freebsd-performance@freebsd.org Subject: Re: SCHED_ULE should not be the default X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Dec 2011 00:01:57 -0000 On Tue, Dec 13, 2011 at 3:39 PM, Ivan Klymenko wrote: > =D0=92 Wed, 14 Dec 2011 00:04:42 +0100 > Jilles Tjoelker =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > >> On Tue, Dec 13, 2011 at 10:40:48AM +0200, Ivan Klymenko wrote: >> > If the algorithm ULE does not contain problems - it means the >> > problem has Core2Duo, or in a piece of code that uses the ULE >> > scheduler. I already wrote in a mailing list that specifically in >> > my case (Core2Duo) partially helps the following patch: >> > --- sched_ule.c.orig =C2=A0 =C2=A0 =C2=A0 =C2=A02011-11-24 18:11:48.00= 0000000 +0200 >> > +++ sched_ule.c =C2=A0 =C2=A0 2011-12-10 22:47:08.000000000 +0200 >> > @@ -794,7 +794,8 @@ >> > =C2=A0 =C2=A0 =C2=A0* 1.5 * balance_interval. >> > =C2=A0 =C2=A0 =C2=A0*/ >> > =C2=A0 =C2=A0 balance_ticks =3D max(balance_interval / 2, 1); >> > - =C2=A0 balance_ticks +=3D random() % balance_interval; >> > +// balance_ticks +=3D random() % balance_interval; >> > + =C2=A0 balance_ticks +=3D ((int)random()) % balance_interval; >> > =C2=A0 =C2=A0 if (smp_started =3D=3D 0 || rebalance =3D=3D 0) >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; >> > =C2=A0 =C2=A0 tdq =3D TDQ_SELF(); >> >> This avoids a 64-bit division on 64-bit platforms but seems to have no >> effect otherwise. Because this function is not called very often, the >> change seems unlikely to help. > > Yes, this section does not apply to this problem :) > Just I posted the latest patch which i using now... > >> >> > @@ -2118,13 +2119,21 @@ >> > =C2=A0 =C2=A0 struct td_sched *ts; >> > >> > =C2=A0 =C2=A0 THREAD_LOCK_ASSERT(td, MA_OWNED); >> > + =C2=A0 if (td->td_pri_class & PRI_FIFO_BIT) >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; >> > + =C2=A0 ts =3D td->td_sched; >> > + =C2=A0 /* >> > + =C2=A0 =C2=A0* We used up one time slice. >> > + =C2=A0 =C2=A0*/ >> > + =C2=A0 if (--ts->ts_slice > 0) >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; >> >> This skips most of the periodic functionality (long term load >> balancer, saving switch count (?), insert index (?), interactivity >> score update for long running thread) if the thread is not going to >> be rescheduled right now. >> >> It looks wrong but it is a data point if it helps your workload. > > Yes, I did it for as long as possible to delay the execution of the code = in section: > ... > #ifdef SMP > =C2=A0 =C2=A0 =C2=A0 =C2=A0/* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 * We run the long term load balancer infreque= ntly on the first cpu. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 */ > =C2=A0 =C2=A0 =C2=A0 =C2=A0if (balance_tdq =3D=3D tdq) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (balance_ticks = && --balance_ticks =3D=3D 0) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0sched_balance(); > =C2=A0 =C2=A0 =C2=A0 =C2=A0} > #endif > ... > >> >> > =C2=A0 =C2=A0 tdq =3D TDQ_SELF(); >> > =C2=A0#ifdef SMP >> > =C2=A0 =C2=A0 /* >> > =C2=A0 =C2=A0 =C2=A0* We run the long term load balancer infrequently = on the >> > first cpu. */ >> > - =C2=A0 if (balance_tdq =3D=3D tdq) { >> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (balance_ticks && --balance_ti= cks =3D=3D 0) >> > + =C2=A0 if (balance_ticks && --balance_ticks =3D=3D 0) { >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (balance_tdq =3D=3D tdq) >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = sched_balance(); >> > =C2=A0 =C2=A0 } >> > =C2=A0#endif >> >> The main effect of this appears to be to disable the long term load >> balancer completely after some time. At some point, a CPU other than >> the first CPU (which uses balance_tdq) will set balance_ticks =3D 0, and >> sched_balance() will never be called again. >> > > That is, for the same reason as above in the text... > >> It also introduces a hypothetical race condition because the access to >> balance_ticks is no longer restricted to one CPU under a spinlock. >> >> If the long term load balancer may be causing trouble, try setting >> kern.sched.balance_interval to a higher value with unpatched code. > > I checked it in the first place - but it did not help fix the situation..= . > > The impression of malfunction rebalancing... > It seems that the thread is passed on to the same core that is loaded and= so... > Perhaps this is a consequence of an incorrect definition of the topology = CPU? > >> >> > @@ -2144,9 +2153,6 @@ >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if >> > (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx])) >> > tdq->tdq_ridx =3D tdq->tdq_idx; } >> > - =C2=A0 ts =3D td->td_sched; >> > - =C2=A0 if (td->td_pri_class & PRI_FIFO_BIT) >> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; >> > =C2=A0 =C2=A0 if (PRI_BASE(td->td_pri_class) =3D=3D PRI_TIMESHARE) { >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* We used a tick; char= ge it to the thread so >> > @@ -2157,11 +2163,6 @@ >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sched_priority(td); >> > =C2=A0 =C2=A0 } >> > =C2=A0 =C2=A0 /* >> > - =C2=A0 =C2=A0* We used up one time slice. >> > - =C2=A0 =C2=A0*/ >> > - =C2=A0 if (--ts->ts_slice > 0) >> > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; >> > - =C2=A0 /* >> > =C2=A0 =C2=A0 =C2=A0* We're out of time, force a requeue at userret(). >> > =C2=A0 =C2=A0 =C2=A0*/ >> > =C2=A0 =C2=A0 ts->ts_slice =3D sched_slice; >> >> > and refusal to use options FULL_PREEMPTION >> > But no one has unsubscribed to my letter, my patch helps or not in >> > the case of Core2Duo... >> > There is a suspicion that the problems stem from the sections of >> > code associated with the SMP... >> > Maybe I'm in something wrong, but I want to help in solving this >> > problem ... Has anyone experiencing problems tried to set sysctl kern.sched.steal_thres= h=3D1 ? I don't remember what our specific problem at $WORK was, perhaps it was just interrupt threads not getting serviced fast enough, but we've hard-coded this to 1 and removed the code that sets it in sched_initticks(). The same effect should be had by setting the sysctl after a box is up. Thanks, matthew