Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Mar 2024 14:00:58 +0000
From:      Nuno Teixeira <eduardo@freebsd.org>
To:        Drew Gallatin <gallatin@freebsd.org>
Cc:        Konstantin Belousov <kib@freebsd.org>, rrs <rrs@lakerest.net>, Mike Karels <mike@karels.net>,  tuexen <tuexen@freebsd.org>, garyj@gmx.de, current@freebsd.org, net@freebsd.org,  Randall Stewart <rrs@freebsd.org>
Subject:   Re: Request for Testing: TCP RACK
Message-ID:  <CAFDf7U%2Bsd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw@mail.gmail.com>
In-Reply-To: <fc160c79-1884-4f68-8310-35e7ac0b9dd6@app.fastmail.com>
References:  <6e795e9c-8de4-4e02-9a96-8fabfaa4e66f@app.fastmail.com> <CAFDf7UKDWSnhm%2BTwP=ZZ9dkk0jmAgjGKPLpkX-CKuw3yH233gQ@mail.gmail.com> <CAFDf7UJq9SCnU-QYmS3t6EknP369w2LR0dNkQAc-NaRLvwVfoQ@mail.gmail.com> <A3F1FC0C-C199-4565-8E07-B233ED9E7B2E@freebsd.org> <6047C8EF-B1B0-4286-93FA-AA38F8A18656@karels.net> <ZfiI7GcbTwSG8kkO@kib.kiev.ua> <8031cd99-ded8-4b06-93b3-11cc729a8b2c@app.fastmail.com> <ZfiY-xUUM3wrBEz_@kib.kiev.ua> <38c54399-6c96-44d8-a3a2-3cc1bfbe50c2@app.fastmail.com> <27d8144f-0658-46f6-b8f3-35eb60061644@lakerest.net> <Zft8odA0s49eLhvk@kib.kiev.ua> <fc160c79-1884-4f68-8310-35e7ac0b9dd6@app.fastmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000986d170614b8f4a6
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hello all!

Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop
(amd64)!

Thanks all!

Drew Gallatin <gallatin@freebsd.org> escreveu (quinta, 21/03/2024 =C3=A0(s)
12:58):

> The entire point is to *NOT* go through the overhead of scheduling
> something asynchronously, but to take advantage of the fact that a
> user/kernel transition is going to trash the cache anyway.
>
> In the common case of a system which has less than the threshold  number
> of connections , we access the tcp_hpts_softclock function pointer, make
> one function call, and access hpts_that_need_softclock, and then return.
> So that's 2 variables and a function call.
>
> I think it would be preferable to avoid that call, and to move the
> declaration of tcp_hpts_softclock and hpts_that_need_softclock so that th=
ey
> are in the same cacheline.  Then we'd be hitting just a single line in th=
e
> common case.  (I've made comments on the review to that effect).
>
> Also, I wonder if the threshold could get higher by default, so that hpts
> is never called in this context unless we're to the point where we're
> scheduling thousands of runs of the hpts thread (and taking all those clo=
ck
> interrupts).
>
> Drew
>
> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote:
>
> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:
> > Ok I have created
> >
> > https://reviews.freebsd.org/D44420
> >
> >
> > To address the issue. I also attach a short version of the patch that
> Nuno
> > can try and validate
> >
> > it works. Drew you may want to try this and validate the optimization
> does
> > kick in since I can
> >
> > only now test that it does not on my local box :)
> The patch still causes access to all cpu's cachelines on each userret.
> It would be much better to inc/check the threshold and only schedule the
> call when exceeded.  Then the call can occur in some dedicated context,
> like per-CPU thread, instead of userret.
>
> >
> >
> > R
> >
> >
> >
> > On 3/18/24 3:42 PM, Drew Gallatin wrote:
> > > No.  The goal is to run on every return to userspace for every thread=
.
> > >
> > > Drew
> > >
> > > On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote:
> > > > On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote:
> > > > > I got the idea from
> > > > >
> https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf
> > > > > The gist is that the TCP pacing stuff needs to run frequently, an=
d
> > > > > rather than run it out of a clock interrupt, its more efficient t=
o
> run
> > > > > it out of a system call context at just the point where we return
> to
> > > > > userspace and the cache is trashed anyway. The current
> implementation
> > > > > is fine for our workload, but probably not idea for a generic
> system.
> > > > > Especially one where something is banging on system calls.
> > > > >
> > > > > Ast's could be the right tool for this, but I'm super unfamiliar
> with
> > > > > them, and I can't find any docs on them.
> > > > >
> > > > > Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent
> to
> > > > > what's happening here?
> > > > This call would need some AST number added, and then it registers t=
he
> > > > ast to run on next return to userspace, for the current thread.
> > > >
> > > > Is it enough?
> > > > >
> > > > > Drew
> > > >
> > > > >
> > > > > On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote:
> > > > > > On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote:
> > > > > > > On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote:
> > > > > > >
> > > > > > > >> On 18. Mar 2024, at 12:42, Nuno Teixeira
> > > > <eduardo@freebsd.org> wrote:
> > > > > > > >>
> > > > > > > >> Hello all!
> > > > > > > >>
> > > > > > > >> It works just fine!
> > > > > > > >> System performance is OK.
> > > > > > > >> Using patch on main-n268841-b0aaf8beb126(-dirty).
> > > > > > > >>
> > > > > > > >> ---
> > > > > > > >> net.inet.tcp.functions_available:
> > > > > > > >> Stack                           D
> > > > Alias                            PCB count
> > > > > > > >> freebsd freebsd                          0
> > > > > > > >> rack                            *
> > > > rack                             38
> > > > > > > >> ---
> > > > > > > >>
> > > > > > > >> It would be so nice that we can have a sysctl tunnable for
> > > > this patch
> > > > > > > >> so we could do more tests without recompiling kernel.
> > > > > > > > Thanks for testing!
> > > > > > > >
> > > > > > > > @gallatin: can you come up with a patch that is acceptable
> > > > for Netflix
> > > > > > > > and allows to mitigate the performance regression.
> > > > > > >
> > > > > > > Ideally, tcphpts could enable this automatically when it
> > > > starts to be
> > > > > > > used (enough?), but a sysctl could select auto/on/off.
> > > > > > There is already a well-known mechanism to request execution of
> the
> > > > > > specific function on return to userspace, namely AST.  The
> difference
> > > > > > with the current hack is that the execution is requested for on=
e
> > > > callback
> > > > > > in the context of the specific thread.
> > > > > >
> > > > > > Still, it might be worth a try to use it; what is the reason to
> > > > hit a thread
> > > > > > that does not do networking, with TCP processing?
> > > > > >
> > > > > > >
> > > > > > > Mike
> > > > > > >
> > > > > > > > Best regards
> > > > > > > > Michael
> > > > > > > >>
> > > > > > > >> Thanks all!
> > > > > > > >> Really happy here :)
> > > > > > > >>
> > > > > > > >> Cheers,
> > > > > > > >>
> > > > > > > >> Nuno Teixeira <eduardo@freebsd.org> escreveu (domingo,
> > > > 17/03/2024 =C3=A0(s) 20:26):
> > > > > > > >>>
> > > > > > > >>> Hello,
> > > > > > > >>>
> > > > > > > >>>> I don't have the full context, but it seems like the
> > > > complaint is a performance regression in bonnie++ and perhaps other
> > > > things when tcp_hpts is loaded, even when it is not used.  Is that
> > > > correct?
> > > > > > > >>>>
> > > > > > > >>>> If so, I suspect its because we drive the
> > > > tcp_hpts_softclock() routine from userret(), in order to avoid tons
> > > > of timer interrupts and context switches.  To test this theory,  yo=
u
> > > > could apply a patch like:
> > > > > > > >>>
> > > > > > > >>> It's affecting overall system performance, bonnie was jus=
t
> > > > a way to
> > > > > > > >>> get some numbers to compare.
> > > > > > > >>>
> > > > > > > >>> Tomorrow I will test patch.
> > > > > > > >>>
> > > > > > > >>> Thanks!
> > > > > > > >>>
> > > > > > > >>> --
> > > > > > > >>> Nuno Teixeira
> > > > > > > >>> FreeBSD Committer (ports)
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Nuno Teixeira
> > > > > > > >> FreeBSD Committer (ports)
> > > > > > >
> > > > > >
> > > >
> > >
>
> > diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c
> > index 8c4d2d41a3eb..eadbee19f69c 100644
> > --- a/sys/netinet/tcp_hpts.c
> > +++ b/sys/netinet/tcp_hpts.c
> > @@ -216,6 +216,7 @@ struct tcp_hpts_entry {
> >  void *ie_cookie;
> >  uint16_t p_num; /* The hpts number one per cpu */
> >  uint16_t p_cpu; /* The hpts CPU */
> > + uint8_t hit_callout_thresh;
> >  /* There is extra space in here */
> >  /* Cache line 0x100 */
> >  struct callout co __aligned(CACHE_LINE_SIZE);
> > @@ -269,6 +270,11 @@ static struct hpts_domain_info {
> >  int cpu[MAXCPU];
> >  } hpts_domains[MAXMEMDOM];
> >
> > +counter_u64_t hpts_that_need_softclock;
> > +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, needsoftclock,
> CTLFLAG_RD,
> > +    &hpts_that_need_softclock,
> > +    "Number of hpts threads that need softclock");
> > +
> >  counter_u64_t hpts_hopelessly_behind;
> >
> >  SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless,
> CTLFLAG_RD,
> > @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, precision,
> CTLFLAG_RW,
> >      &tcp_hpts_precision, 120,
> >      "Value for PRE() precision of callout");
> >  SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW,
> > -    &conn_cnt_thresh, 0,
> > +    &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD,
> >      "How many connections (below) make us use the callout based
> mechanism");
> >  SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW,
> >      &hpts_does_tp_logging, 0,
> > @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void)
> >  struct tcp_hpts_entry *hpts;
> >  int ticks_ran;
> >
> > + if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0)
> > + return;
> > +
> >  hpts =3D tcp_choose_hpts_to_run();
> >
> >  if (hpts->p_hpts_active) {
> > @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx)
> >  ticks_ran =3D tcp_hptsi(hpts, 1);
> >  tv.tv_sec =3D 0;
> >  tv.tv_usec =3D hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT;
> > + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) &&
> (hpts->hit_callout_thresh =3D=3D 0)) {
> > + hpts->hit_callout_thresh =3D 1;
> > + counter_u64_add(hpts_that_need_softclock, 1);
> > + } else if ((hpts->p_on_queue_cnt <=3D conn_cnt_thresh) &&
> (hpts->hit_callout_thresh =3D=3D 1)) {
> > + hpts->hit_callout_thresh =3D 0;
> > + counter_u64_add(hpts_that_need_softclock, -1);
> > + }
> >  if (hpts->p_on_queue_cnt >=3D conn_cnt_thresh) {
> >  if(hpts->p_direct_wake =3D=3D 0) {
> >  /*
> > @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void)
> >  cpu_top =3D NULL;
> >  #endif
> >  tcp_pace.rp_num_hptss =3D ncpus;
> > + hpts_that_need_softclock =3D counter_u64_alloc(M_WAITOK);
> >  hpts_hopelessly_behind =3D counter_u64_alloc(M_WAITOK);
> >  hpts_loops =3D counter_u64_alloc(M_WAITOK);
> >  back_tosleep =3D counter_u64_alloc(M_WAITOK);
> > @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void)
> >  free(tcp_pace.grps, M_TCPHPTS);
> >  #endif
> >
> > + counter_u64_free(hpts_that_need_softclock);
> >  counter_u64_free(hpts_hopelessly_behind);
> >  counter_u64_free(hpts_loops);
> >  counter_u64_free(back_tosleep);
>
>
>
>

--=20
Nuno Teixeira
FreeBSD Committer (ports)

--000000000000986d170614b8f4a6
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hello all!</div><div><br></div><div>Running rack @b7b=
78c1c169 &quot;Optimize HPTS...&quot; very happy on my laptop (amd64)!</div=
><div><br></div><div>Thanks all!<br></div></div><br><div class=3D"gmail_quo=
te"><div dir=3D"ltr" class=3D"gmail_attr">Drew Gallatin &lt;<a href=3D"mail=
to:gallatin@freebsd.org">gallatin@freebsd.org</a>&gt; escreveu (quinta, 21/=
03/2024 =C3=A0(s) 12:58):<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><div class=3D"msg3886066866197344175"><u></u><div><div>The entir=
e point is to *NOT* go through the overhead of scheduling something asynchr=
onously, but to take advantage of the fact that a user/kernel transition is=
 going to trash the cache anyway.<br></div><div><br></div><div>In the commo=
n case of a system which has less than the threshold=C2=A0 number of connec=
tions , we access the tcp_hpts_softclock function pointer, make one functio=
n call, and access hpts_that_need_softclock, and then return.=C2=A0 So that=
&#39;s 2 variables and a function call.<br></div><div><br></div><div>I thin=
k it would be preferable to avoid that call, and to move the declaration of=
 tcp_hpts_softclock and hpts_that_need_softclock so that they are in the sa=
me cacheline.=C2=A0 Then we&#39;d be hitting just a single line in the comm=
on case.=C2=A0 (I&#39;ve made comments on the review to that effect).<br></=
div><div><br></div><div>Also, I wonder if the threshold could get higher by=
 default, so that hpts is never called in this context unless we&#39;re to =
the point where we&#39;re scheduling thousands of runs of the hpts thread (=
and taking all those clock interrupts).<br></div><div><br></div><div>Drew<b=
r></div><div><br></div><div>On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Be=
lousov wrote:<br></div><blockquote type=3D"cite" id=3D"m_388606686619734417=
5qt"><div>On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:<br></div><di=
v>&gt; Ok I have created<br></div><div>&gt;=C2=A0<br></div><div>&gt;=C2=A0<=
a href=3D"https://reviews.freebsd.org/D44420" target=3D"_blank">https://rev=
iews.freebsd.org/D44420</a><br></div><div>&gt;=C2=A0<br></div><div>&gt;=C2=
=A0<br></div><div>&gt; To address the issue. I also attach a short version =
of the patch that Nuno<br></div><div>&gt; can try and validate<br></div><di=
v>&gt;=C2=A0<br></div><div>&gt; it works. Drew you may want to try this and=
 validate the optimization does<br></div><div>&gt; kick in since I can<br><=
/div><div>&gt;=C2=A0<br></div><div>&gt; only now test that it does not on m=
y local box :)<br></div><div>The patch still causes access to all cpu&#39;s=
 cachelines on each userret.<br></div><div>It would be much better to inc/c=
heck the threshold and only schedule the<br></div><div>call when exceeded.=
=C2=A0 Then the call can occur in some dedicated context,<br></div><div>lik=
e per-CPU thread, instead of userret.<br></div><div><br></div><div>&gt;=C2=
=A0<br></div><div>&gt;=C2=A0<br></div><div>&gt; R<br></div><div>&gt;=C2=A0<=
br></div><div>&gt;=C2=A0<br></div><div>&gt;=C2=A0<br></div><div>&gt; On 3/1=
8/24 3:42 PM, Drew Gallatin wrote:<br></div><div>&gt; &gt; No.=C2=A0 The go=
al is to run on every return to userspace for every thread.<br></div><div>&=
gt; &gt;=C2=A0<br></div><div>&gt; &gt; Drew<br></div><div>&gt; &gt;=C2=A0<b=
r></div><div>&gt; &gt; On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belouso=
v wrote:<br></div><div>&gt; &gt; &gt; On Mon, Mar 18, 2024 at 03:13:11PM -0=
400, Drew Gallatin wrote:<br></div><div>&gt; &gt; &gt; &gt; I got the idea =
from<br></div><div>&gt; &gt; &gt; &gt;=C2=A0<a href=3D"https://people.mpi-s=
ws.org/~druschel/publications/soft-timers-tocs.pdf" target=3D"_blank">https=
://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf</a><br></=
div><div>&gt; &gt; &gt; &gt; The gist is that the TCP pacing stuff needs to=
 run frequently, and<br></div><div>&gt; &gt; &gt; &gt; rather than run it o=
ut of a clock interrupt, its more efficient to run<br></div><div>&gt; &gt; =
&gt; &gt; it out of a system call context at just the point where we return=
 to<br></div><div>&gt; &gt; &gt; &gt; userspace and the cache is trashed an=
yway. The current implementation<br></div><div>&gt; &gt; &gt; &gt; is fine =
for our workload, but probably not idea for a generic system.<br></div><div=
>&gt; &gt; &gt; &gt; Especially one where something is banging on system ca=
lls.<br></div><div>&gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; As=
t&#39;s could be the right tool for this, but I&#39;m super unfamiliar with=
<br></div><div>&gt; &gt; &gt; &gt; them, and I can&#39;t find any docs on t=
hem.<br></div><div>&gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; Wo=
uld ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent to<br></div=
><div>&gt; &gt; &gt; &gt; what&#39;s happening here?<br></div><div>&gt; &gt=
; &gt; This call would need some AST number added, and then it registers th=
e<br></div><div>&gt; &gt; &gt; ast to run on next return to userspace, for =
the current thread.<br></div><div>&gt; &gt; &gt;=C2=A0<br></div><div>&gt; &=
gt; &gt; Is it enough?<br></div><div>&gt; &gt; &gt; &gt;<br></div><div>&gt;=
 &gt; &gt; &gt; Drew<br></div><div>&gt; &gt; &gt;=C2=A0<br></div><div>&gt; =
&gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; On Mon, Mar 18, 2024, at 2=
:33 PM, Konstantin Belousov wrote:<br></div><div>&gt; &gt; &gt; &gt; &gt; O=
n Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote:<br></div><div>&=
gt; &gt; &gt; &gt; &gt; &gt; On 18 Mar 2024, at 7:04,=C2=A0<a href=3D"mailt=
o:tuexen@freebsd.org" target=3D"_blank">tuexen@freebsd.org</a> wrote:<br></=
div><div>&gt; &gt; &gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; &g=
t; &gt; &gt;&gt; On 18. Mar 2024, at 12:42, Nuno Teixeira<br></div><div>&gt=
; &gt; &gt; &lt;<a href=3D"mailto:eduardo@freebsd.org" target=3D"_blank">ed=
uardo@freebsd.org</a>&gt; wrote:<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt=
; &gt;&gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; Hello all!<=
br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;<br></div><div>&gt; &gt=
; &gt; &gt; &gt; &gt; &gt;&gt; It works just fine!<br></div><div>&gt; &gt; =
&gt; &gt; &gt; &gt; &gt;&gt; System performance is OK.<br></div><div>&gt; &=
gt; &gt; &gt; &gt; &gt; &gt;&gt; Using patch on main-n268841-b0aaf8beb126(-=
dirty).<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;<br></div><div>=
&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; ---<br></div><div>&gt; &gt; &gt; &gt=
; &gt; &gt; &gt;&gt; net.inet.tcp.functions_available:<br></div><div>&gt; &=
gt; &gt; &gt; &gt; &gt; &gt;&gt; Stack=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 D<br></div><div>&gt; &gt; &gt=
; Alias=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 PCB count<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt;=
 &gt;&gt; freebsd freebsd=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 0<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &=
gt;&gt; rack=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 *<br></div><div>&gt; &gt; &gt; rack=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 38<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; ---<br></div><d=
iv>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;<br></div><div>&gt; &gt; &gt; &gt;=
 &gt; &gt; &gt;&gt; It would be so nice that we can have a sysctl tunnable =
for<br></div><div>&gt; &gt; &gt; this patch<br></div><div>&gt; &gt; &gt; &g=
t; &gt; &gt; &gt;&gt; so we could do more tests without recompiling kernel.=
<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks for testing!<br></=
div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &g=
t; &gt; &gt; &gt; @gallatin: can you come up with a patch that is acceptabl=
e<br></div><div>&gt; &gt; &gt; for Netflix<br></div><div>&gt; &gt; &gt; &gt=
; &gt; &gt; &gt; and allows to mitigate the performance regression.<br></di=
v><div>&gt; &gt; &gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; &gt;=
 &gt; Ideally, tcphpts could enable this automatically when it<br></div><di=
v>&gt; &gt; &gt; starts to be<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; u=
sed (enough?), but a sysctl could select auto/on/off.<br></div><div>&gt; &g=
t; &gt; &gt; &gt; There is already a well-known mechanism to request execut=
ion of the<br></div><div>&gt; &gt; &gt; &gt; &gt; specific function on retu=
rn to userspace, namely AST.=C2=A0 The difference<br></div><div>&gt; &gt; &=
gt; &gt; &gt; with the current hack is that the execution is requested for =
one<br></div><div>&gt; &gt; &gt; callback<br></div><div>&gt; &gt; &gt; &gt;=
 &gt; in the context of the specific thread.<br></div><div>&gt; &gt; &gt; &=
gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; Still, it might be worth a =
try to use it; what is the reason to<br></div><div>&gt; &gt; &gt; hit a thr=
ead<br></div><div>&gt; &gt; &gt; &gt; &gt; that does not do networking, wit=
h TCP processing?<br></div><div>&gt; &gt; &gt; &gt; &gt;<br></div><div>&gt;=
 &gt; &gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; Mike<=
br></div><div>&gt; &gt; &gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &g=
t; &gt; &gt; &gt; Best regards<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; =
&gt; Michael<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;<br></div>=
<div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; Thanks all!<br></div><div>&gt; =
&gt; &gt; &gt; &gt; &gt; &gt;&gt; Really happy here :)<br></div><div>&gt; &=
gt; &gt; &gt; &gt; &gt; &gt;&gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt=
; &gt;&gt; Cheers,<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;<br>=
</div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; Nuno Teixeira &lt;<a href=
=3D"mailto:eduardo@freebsd.org" target=3D"_blank">eduardo@freebsd.org</a>&g=
t; escreveu (domingo,<br></div><div>&gt; &gt; &gt; 17/03/2024 =C3=A0(s) 20:=
26):<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt;<br></div><div=
>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; Hello,<br></div><div>&gt; &gt; =
&gt; &gt; &gt; &gt; &gt;&gt;&gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt=
; &gt;&gt;&gt;&gt; I don&#39;t have the full context, but it seems like the=
<br></div><div>&gt; &gt; &gt; complaint is a performance regression in bonn=
ie++ and perhaps other<br></div><div>&gt; &gt; &gt; things when tcp_hpts is=
 loaded, even when it is not used.=C2=A0 Is that<br></div><div>&gt; &gt; &g=
t; correct?<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt;&gt;<br=
></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt;&gt; If so, I suspect=
 its because we drive the<br></div><div>&gt; &gt; &gt; tcp_hpts_softclock()=
 routine from userret(), in order to avoid tons<br></div><div>&gt; &gt; &gt=
; of timer interrupts and context switches.=C2=A0 To test this theory,=C2=
=A0 you<br></div><div>&gt; &gt; &gt; could apply a patch like:<br></div><di=
v>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt;<br></div><div>&gt; &gt; &gt; &=
gt; &gt; &gt; &gt;&gt;&gt; It&#39;s affecting overall system performance, b=
onnie was just<br></div><div>&gt; &gt; &gt; a way to<br></div><div>&gt; &gt=
; &gt; &gt; &gt; &gt; &gt;&gt;&gt; get some numbers to compare.<br></div><d=
iv>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt;<br></div><div>&gt; &gt; &gt; =
&gt; &gt; &gt; &gt;&gt;&gt; Tomorrow I will test patch.<br></div><div>&gt; =
&gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt;<br></div><div>&gt; &gt; &gt; &gt; &gt=
; &gt; &gt;&gt;&gt; Thanks!<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt=
;&gt;&gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; --<br></=
div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; Nuno Teixeira<br></div>=
<div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; FreeBSD Committer (ports)<b=
r></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;<br></div><div>&gt; &gt;=
 &gt; &gt; &gt; &gt; &gt;&gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &=
gt;&gt;<br></div><div>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; --<br></div><d=
iv>&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; Nuno Teixeira<br></div><div>&gt; =
&gt; &gt; &gt; &gt; &gt; &gt;&gt; FreeBSD Committer (ports)<br></div><div>&=
gt; &gt; &gt; &gt; &gt; &gt;<br></div><div>&gt; &gt; &gt; &gt; &gt;<br></di=
v><div>&gt; &gt; &gt;=C2=A0<br></div><div>&gt; &gt;=C2=A0<br></div><div><br=
></div><div>&gt; diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts=
.c<br></div><div>&gt; index 8c4d2d41a3eb..eadbee19f69c 100644<br></div><div=
>&gt; --- a/sys/netinet/tcp_hpts.c<br></div><div>&gt; +++ b/sys/netinet/tcp=
_hpts.c<br></div><div>&gt; @@ -216,6 +216,7 @@ struct tcp_hpts_entry {<br><=
/div><div>&gt;=C2=A0 	void *ie_cookie;<br></div><div>&gt;=C2=A0 	uint16_t p=
_num;		/* The hpts number one per cpu */<br></div><div>&gt;=C2=A0 	uint16_t=
 p_cpu;		/* The hpts CPU */<br></div><div>&gt; +	uint8_t hit_callout_thresh=
;<br></div><div>&gt;=C2=A0 	/* There is extra space in here */<br></div><di=
v>&gt;=C2=A0 	/* Cache line 0x100 */<br></div><div>&gt;=C2=A0 	struct callo=
ut co __aligned(CACHE_LINE_SIZE);<br></div><div>&gt; @@ -269,6 +270,11 @@ s=
tatic struct hpts_domain_info {<br></div><div>&gt;=C2=A0 	int cpu[MAXCPU];<=
br></div><div>&gt;=C2=A0 } hpts_domains[MAXMEMDOM];<br></div><div>&gt;=C2=
=A0=C2=A0<br></div><div>&gt; +counter_u64_t hpts_that_need_softclock;<br></=
div><div>&gt; +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, needs=
oftclock, CTLFLAG_RD,<br></div><div>&gt; +=C2=A0=C2=A0=C2=A0 &amp;hpts_that=
_need_softclock,<br></div><div>&gt; +=C2=A0=C2=A0=C2=A0 &quot;Number of hpt=
s threads that need softclock&quot;);<br></div><div>&gt; +<br></div><div>&g=
t;=C2=A0 counter_u64_t hpts_hopelessly_behind;<br></div><div>&gt;=C2=A0=C2=
=A0<br></div><div>&gt;=C2=A0 SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, O=
ID_AUTO, hopeless, CTLFLAG_RD,<br></div><div>&gt; @@ -334,7 +340,7 @@ SYSCT=
L_INT(_net_inet_tcp_hpts, OID_AUTO, precision, CTLFLAG_RW,<br></div><div>&g=
t;=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &amp;tcp_hpts_precision, 120,<br></div><di=
v>&gt;=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &quot;Value for PRE() precision of cal=
lout&quot;);<br></div><div>&gt;=C2=A0 SYSCTL_INT(_net_inet_tcp_hpts, OID_AU=
TO, cnt_thresh, CTLFLAG_RW,<br></div><div>&gt; -=C2=A0=C2=A0=C2=A0 &amp;con=
n_cnt_thresh, 0,<br></div><div>&gt; +=C2=A0=C2=A0=C2=A0 &amp;conn_cnt_thres=
h, DEFAULT_CONNECTION_THESHOLD,<br></div><div>&gt;=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 &quot;How many connections (below) make us use the callout based mec=
hanism&quot;);<br></div><div>&gt;=C2=A0 SYSCTL_INT(_net_inet_tcp_hpts, OID_=
AUTO, logging, CTLFLAG_RW,<br></div><div>&gt;=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 &amp;hpts_does_tp_logging, 0,<br></div><div>&gt; @@ -1548,6 +1554,9 @@ __t=
cp_run_hpts(void)<br></div><div>&gt;=C2=A0 	struct tcp_hpts_entry *hpts;<br=
></div><div>&gt;=C2=A0 	int ticks_ran;<br></div><div>&gt;=C2=A0=C2=A0<br></=
div><div>&gt; +	if (counter_u64_fetch(hpts_that_need_softclock) =3D=3D 0)<b=
r></div><div>&gt; +		return;<br></div><div>&gt; +<br></div><div>&gt;=C2=A0 =
	hpts =3D tcp_choose_hpts_to_run();<br></div><div>&gt;=C2=A0=C2=A0<br></div=
><div>&gt;=C2=A0 	if (hpts-&gt;p_hpts_active) {<br></div><div>&gt; @@ -1683=
,6 +1692,13 @@ tcp_hpts_thread(void *ctx)<br></div><div>&gt;=C2=A0 	ticks_r=
an =3D tcp_hptsi(hpts, 1);<br></div><div>&gt;=C2=A0 	tv.tv_sec =3D 0;<br></=
div><div>&gt;=C2=A0 	tv.tv_usec =3D hpts-&gt;p_hpts_sleep_time * HPTS_TICKS=
_PER_SLOT;<br></div><div>&gt; +	if ((hpts-&gt;p_on_queue_cnt &gt; conn_cnt_=
thresh) &amp;&amp; (hpts-&gt;hit_callout_thresh =3D=3D 0)) {<br></div><div>=
&gt; +		hpts-&gt;hit_callout_thresh =3D 1;<br></div><div>&gt; +		counter_u6=
4_add(hpts_that_need_softclock, 1);<br></div><div>&gt; +	} else if ((hpts-&=
gt;p_on_queue_cnt &lt;=3D conn_cnt_thresh) &amp;&amp; (hpts-&gt;hit_callout=
_thresh =3D=3D 1)) {<br></div><div>&gt; +		hpts-&gt;hit_callout_thresh =3D =
0;<br></div><div>&gt; +		counter_u64_add(hpts_that_need_softclock, -1);<br>=
</div><div>&gt; +	}<br></div><div>&gt;=C2=A0 	if (hpts-&gt;p_on_queue_cnt &=
gt;=3D conn_cnt_thresh) {<br></div><div>&gt;=C2=A0 		if(hpts-&gt;p_direct_w=
ake =3D=3D 0) {<br></div><div>&gt;=C2=A0 			/*<br></div><div>&gt; @@ -1818,=
6 +1834,7 @@ tcp_hpts_mod_load(void)<br></div><div>&gt;=C2=A0 	cpu_top =3D =
NULL;<br></div><div>&gt;=C2=A0 #endif<br></div><div>&gt;=C2=A0 	tcp_pace.rp=
_num_hptss =3D ncpus;<br></div><div>&gt; +	hpts_that_need_softclock =3D cou=
nter_u64_alloc(M_WAITOK);<br></div><div>&gt;=C2=A0 	hpts_hopelessly_behind =
=3D counter_u64_alloc(M_WAITOK);<br></div><div>&gt;=C2=A0 	hpts_loops =3D c=
ounter_u64_alloc(M_WAITOK);<br></div><div>&gt;=C2=A0 	back_tosleep =3D coun=
ter_u64_alloc(M_WAITOK);<br></div><div>&gt; @@ -2042,6 +2059,7 @@ tcp_hpts_=
mod_unload(void)<br></div><div>&gt;=C2=A0 	free(tcp_pace.grps, M_TCPHPTS);<=
br></div><div>&gt;=C2=A0 #endif<br></div><div>&gt;=C2=A0=C2=A0<br></div><di=
v>&gt; +	counter_u64_free(hpts_that_need_softclock);<br></div><div>&gt;=C2=
=A0 	counter_u64_free(hpts_hopelessly_behind);<br></div><div>&gt;=C2=A0 	co=
unter_u64_free(hpts_loops);<br></div><div>&gt;=C2=A0 	counter_u64_free(back=
_tosleep);<br></div><div><br></div><div><br></div></blockquote><div><br></d=
iv></div></div></blockquote></div><br clear=3D"all"><br><span class=3D"gmai=
l_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature=
"><div dir=3D"ltr"><span style=3D"color:rgb(102,102,102)">Nuno Teixeira<br>=
FreeBSD Committer (ports)</span></div></div>

--000000000000986d170614b8f4a6--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFDf7U%2Bsd0Hx6scDOvMuxOQaqVb8wi1ODrwT=uKDCqMn_81QEw>