Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 Sep 2020 00:05:54 -0400
From:      Liang Tian <l.tian.email@gmail.com>
To:        Randall Stewart <rrs@netflix.com>
Cc:        FreeBSD Transport <freebsd-transport@freebsd.org>
Subject:   Re: Fast recovery ssthresh value
Message-ID:  <CAJhigrhy1JeBvmUduvnyfGFd9cTgYSfgcP4kwR3RtMqEUdOhsQ@mail.gmail.com>
In-Reply-To: <A982EE58-1F2F-400B-B8AA-9B3B5523826B@netflix.com>
References:  <CAJhigrhbguXQzeYGfMtPRK03fp6KR65q8gjB9e9L-5tGGsuyzQ@mail.gmail.com> <SN4PR0601MB3728D1F8ABC9C86972B6C53886590@SN4PR0601MB3728.namprd06.prod.outlook.com> <CAJhigrjdRzK5fKpE9jTQM5p-wzKUBALK7Cc34_Qbi7HCZ_NCXw@mail.gmail.com> <SN4PR0601MB372817A4C0D80D981B1CE52586270@SN4PR0601MB3728.namprd06.prod.outlook.com> <A982EE58-1F2F-400B-B8AA-9B3B5523826B@netflix.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Randall,

Yes, rack is definitely the next thing I would experiment with. We are
using the networking code in user space and I was able to integrate
default and bbr stack. I still need to integrate rack(something is
off) and also solve some problems with timer granularity.
Thanks and I'll probably come back with questions on rack soon:)

Regards,
Liang

On Thu, Sep 10, 2020 at 9:35 AM Randall Stewart <rrs@netflix.com> wrote:
>
> Liang:
>
> Or if you are on head, you can use rack which not only
> has PRR built into it, but also has Rack and TLP as
> well.
>
> Of course its only in Head unless you want to go to the effort
> of back-porting it :)
>
> Note that NF uses this stack for all of its TCP connections in
> the Big-I (but of course we use Head too) :)
>
> R
>
> > On Sep 10, 2020, at 5:49 AM, Scheffenegger, Richard <Richard.Scheffeneg=
ger@netapp.com> wrote:
> >
> > Hi Liang,
> >
> > Yes, you are absolutely correct about this observation. The SACK loss r=
ecovery will only send  one MSS per received ACK right now - and when there=
 is ACK thinning present, will fail to timely recover all the missing packe=
ts, eventually receiving no more ACK to clock out more retransmissions...
> >
> > I have a Diff in review, to implement Proportional Rate Reduction:
> >
> > https://reviews.freebsd.org/D18892
> >
> > Which should address not only that issue about ACK thinning, but also t=
he issue that current SACK loss recovery has to wait until pipe drops below=
 ssthresh, before the retransmissions are clocked out. And then, they would=
 actually be clocked out at the same rate at the incoming ACKs. This would =
be the same rate as when the overload happened (barring any ACK thinning), =
and as a secondary effect, it was observed that this behavior too can lead =
to self-inflicted loss - of retransmissions.
> >
> > If you have the ability to patch your kernel with D18892 and observe ho=
w the reaction is in your dramatic ACK thinning scenario, that would be goo=
d to know! The assumption of the Patch was, that - as per TCP RFC requireme=
nts - there is one ACK for each received out-of-sequence data segment, and =
ACK drops / thinning are not happening on such a massive scale as you descr=
ibe it.
> >
> > Best regards,
> >
> > Richard Scheffenegger
> >
> > -----Original Message-----
> > From: owner-freebsd-transport@freebsd.org <owner-freebsd-transport@free=
bsd.org> On Behalf Of Liang Tian
> > Sent: Mittwoch, 9. September 2020 19:16
> > To: Scheffenegger, Richard <Richard.Scheffenegger@netapp.com>
> > Cc: FreeBSD Transport <freebsd-transport@freebsd.org>
> > Subject: Re: Fast recovery ssthresh value
> >
> > Hi Richard,
> >
> > Thanks for the explanation and sorry for the late reply.
> > I've been investigating SACK loss recovery and I think I'm seeing an is=
sue similar to the ABC L value issue that I reported
> > previously(https://reviews.freebsd.org/D26120) and I do believe there i=
s a deviation to RFC3517:
> > The issue happens when a DupAck is received during SACK loss recovery i=
n the presence of ACK Thinning or receiver enabling LRO, which means the SA=
CK block edges could expand by more than 1 SMSS(We've seen 30*SMSS), i.e. a=
 single DupAck could decrement `pipe` by more than 1 SMSS.
> > In RFC3517,
> > (C) If cwnd - pipe >=3D 1 SMSS, the sender SHOULD transmit one or more =
segments...
> >        (C.5) If cwnd - pipe >=3D 1 SMSS, return to (C.1) So based on RF=
C, the sender should be able to send more segments if such DupAck is receiv=
ed, because of the big change to `pipe`.
> >
> > In the current implementation, the cwin variable, which controls the am=
ount of data that can be transmitted based on the new information, is dicta=
ted by snd_cwnd. The snd_cwnd is incremented by 1 SMSS for each DupAck rece=
ived. I believe this effectively limits the retransmission triggered by eac=
h DupAck to 1 SMSS -  deviation.
> > 307         cwin =3D
> > 308             imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxmt, =
0);
> >
> > As a result, SACK is not doing enough recovery in this scenario and los=
s has to be recovered by RTO.
> > Again, I'd appreciate feedback from the community.
> >
> > Regards,
> > Liang Tian
> >
> >
> >
> >
> > On Sun, Aug 23, 2020 at 3:56 PM Scheffenegger, Richard <Richard.Scheffe=
negger@netapp.com> wrote:
> >>
> >> Hi Liang,
> >>
> >> In SACK loss recovery, you can recover up to ssthresh (prior cwnd/2 [o=
r 70% in case of cubic]) lost bytes - at least in theory.
> >>
> >> In comparison, (New)Reno can only recover one lost packet per window, =
and then keeps on transmitting new segments (ack + cwnd), even before the r=
eceipt of the retransmitted packet is acked.
> >>
> >> For historic reasons, the semantic of the variable cwnd is overloaded =
during loss recovery, and it doesn't "really" indicate cwnd, but rather ind=
icates if/when retransmissions can happen.
> >>
> >>
> >> In both cases (also the simple one, with only one packet loss), cwnd s=
hould be equal (or near equal) to ssthresh by the time loss recovery is fin=
ished - but NOT before! While it may appear like slow-start, the value of t=
he cwnd variable really increases by acked_bytes only per ACK (not acked_by=
tes + SMSS), since the left edge (snd_una) doesn't move right - unlike duri=
ng slow-start. But numerically, these different phases (slow-start / sack l=
oss-recovery) may appear very similar.
> >>
> >> You could check this using the (loadable) SIFTR module, which captures=
 t_flags (indicating if cong/loss recovery is active), ssthresh, cwnd, and =
other parameters.
> >>
> >> That is at least how things are supposed to work; or have you investig=
ated the timing and behavior of SACK loss recovery and found a deviation to=
 RFC3517? Note that FBSD currently has not fully implemented RFC6675 suppor=
t (which deviates slightly from 3517 under specific circumstances; I have a=
 patch pending to implemente 6675 rescue retransmissions, but haven't tweak=
ed the other aspects of 6675 vs. 3517.
> >>
> >> BTW: While freebsd-net is not the wrong DL per se, TCP, UDP, SCTP spec=
ific questions can also be posted to freebsd-transport, which is more narro=
wly focused.
> >>
> >> Best regards,
> >>
> >> Richard Scheffenegger
> >>
> >> -----Original Message-----
> >> From: owner-freebsd-net@freebsd.org <owner-freebsd-net@freebsd.org> On
> >> Behalf Of Liang Tian
> >> Sent: Sonntag, 23. August 2020 00:14
> >> To: freebsd-net <freebsd-net@freebsd.org>
> >> Subject: Fast recovery ssthresh value
> >>
> >> Hi all,
> >>
> >> When 3 dupacks are received and TCP enter fast recovery, if SACK is us=
ed, the CWND is set to maxseg:
> >>
> >> 2593                     if (tp->t_flags & TF_SACK_PERMIT) {
> >> 2594                         TCPSTAT_INC(
> >> 2595                             tcps_sack_recovery_episode);
> >> 2596                         tp->snd_recover =3D tp->snd_nxt;
> >> 2597                         tp->snd_cwnd =3D maxseg;
> >> 2598                         (void) tp->t_fb->tfb_tcp_output(tp);
> >> 2599                         goto drop;
> >> 2600                     }
> >>
> >> Otherwise(SACK is not in use), CWND is set to maxseg before
> >> tcp_output() and then set back to snd_ssthresh+inflation
> >> 2601                     tp->snd_nxt =3D th->th_ack;
> >> 2602                     tp->snd_cwnd =3D maxseg;
> >> 2603                     (void) tp->t_fb->tfb_tcp_output(tp);
> >> 2604                     KASSERT(tp->snd_limited <=3D 2,
> >> 2605                         ("%s: tp->snd_limited too big",
> >> 2606                         __func__));
> >> 2607                     tp->snd_cwnd =3D tp->snd_ssthresh +
> >> 2608                          maxseg *
> >> 2609                          (tp->t_dupacks - tp->snd_limited);
> >> 2610                     if (SEQ_GT(onxt, tp->snd_nxt))
> >> 2611                         tp->snd_nxt =3D onxt;
> >> 2612                     goto drop;
> >>
> >> I'm wondering in the SACK case, should CWND be set back to ssthresh(wh=
ich has been slashed in cc_cong_signal() a few lines above) before line 259=
9, like non-SACK case, instead of doing slow start from maxseg?
> >> I read rfc6675 and a few others, and it looks like that's the case. I =
appreciate your opinion, again.
> >>
> >> Thanks,
> >> Liang
> >> _______________________________________________
> >> freebsd-net@freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
> > _______________________________________________
> > freebsd-transport@freebsd.org mailing list https://lists.freebsd.org/ma=
ilman/listinfo/freebsd-transport
> > To unsubscribe, send any mail to "freebsd-transport-unsubscribe@freebsd=
.org"
> > _______________________________________________
> > freebsd-transport@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-transport
> > To unsubscribe, send any mail to "freebsd-transport-unsubscribe@freebsd=
.org"
>
> ------
> Randall Stewart
> rrs@netflix.com
>
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJhigrhy1JeBvmUduvnyfGFd9cTgYSfgcP4kwR3RtMqEUdOhsQ>