Date: Thu, 10 Sep 2020 09:35:43 -0400 From: Randall Stewart <rrs@netflix.com> To: l.tian.email@gmail.com Cc: FreeBSD Transport <freebsd-transport@freebsd.org> Subject: Re: Fast recovery ssthresh value Message-ID: <A982EE58-1F2F-400B-B8AA-9B3B5523826B@netflix.com> In-Reply-To: <SN4PR0601MB372817A4C0D80D981B1CE52586270@SN4PR0601MB3728.namprd06.prod.outlook.com> References: <CAJhigrhbguXQzeYGfMtPRK03fp6KR65q8gjB9e9L-5tGGsuyzQ@mail.gmail.com> <SN4PR0601MB3728D1F8ABC9C86972B6C53886590@SN4PR0601MB3728.namprd06.prod.outlook.com> <CAJhigrjdRzK5fKpE9jTQM5p-wzKUBALK7Cc34_Qbi7HCZ_NCXw@mail.gmail.com> <SN4PR0601MB372817A4C0D80D981B1CE52586270@SN4PR0601MB3728.namprd06.prod.outlook.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Liang: Or if you are on head, you can use rack which not only has PRR built into it, but also has Rack and TLP as well. Of course its only in Head unless you want to go to the effort of back-porting it :) Note that NF uses this stack for all of its TCP connections in the Big-I (but of course we use Head too) :) R > On Sep 10, 2020, at 5:49 AM, Scheffenegger, Richard = <Richard.Scheffenegger@netapp.com> wrote: >=20 > Hi Liang, >=20 > Yes, you are absolutely correct about this observation. The SACK loss = recovery will only send one MSS per received ACK right now - and when = there is ACK thinning present, will fail to timely recover all the = missing packets, eventually receiving no more ACK to clock out more = retransmissions... >=20 > I have a Diff in review, to implement Proportional Rate Reduction: >=20 > https://reviews.freebsd.org/D18892 >=20 > Which should address not only that issue about ACK thinning, but also = the issue that current SACK loss recovery has to wait until pipe drops = below ssthresh, before the retransmissions are clocked out. And then, = they would actually be clocked out at the same rate at the incoming = ACKs. This would be the same rate as when the overload happened (barring = any ACK thinning), and as a secondary effect, it was observed that this = behavior too can lead to self-inflicted loss - of retransmissions. >=20 > If you have the ability to patch your kernel with D18892 and observe = how the reaction is in your dramatic ACK thinning scenario, that would = be good to know! The assumption of the Patch was, that - as per TCP RFC = requirements - there is one ACK for each received out-of-sequence data = segment, and ACK drops / thinning are not happening on such a massive = scale as you describe it. >=20 > Best regards, >=20 > Richard Scheffenegger >=20 > -----Original Message----- > From: owner-freebsd-transport@freebsd.org = <owner-freebsd-transport@freebsd.org> On Behalf Of Liang Tian > Sent: Mittwoch, 9. September 2020 19:16 > To: Scheffenegger, Richard <Richard.Scheffenegger@netapp.com> > Cc: FreeBSD Transport <freebsd-transport@freebsd.org> > Subject: Re: Fast recovery ssthresh value >=20 > Hi Richard, >=20 > Thanks for the explanation and sorry for the late reply. > I've been investigating SACK loss recovery and I think I'm seeing an = issue similar to the ABC L value issue that I reported > previously(https://reviews.freebsd.org/D26120) and I do believe there = is a deviation to RFC3517: > The issue happens when a DupAck is received during SACK loss recovery = in the presence of ACK Thinning or receiver enabling LRO, which means = the SACK block edges could expand by more than 1 SMSS(We've seen = 30*SMSS), i.e. a single DupAck could decrement `pipe` by more than 1 = SMSS. > In RFC3517, > (C) If cwnd - pipe >=3D 1 SMSS, the sender SHOULD transmit one or more = segments... > (C.5) If cwnd - pipe >=3D 1 SMSS, return to (C.1) So based on = RFC, the sender should be able to send more segments if such DupAck is = received, because of the big change to `pipe`. >=20 > In the current implementation, the cwin variable, which controls the = amount of data that can be transmitted based on the new information, is = dictated by snd_cwnd. The snd_cwnd is incremented by 1 SMSS for each = DupAck received. I believe this effectively limits the retransmission = triggered by each DupAck to 1 SMSS - deviation. > 307 cwin =3D > 308 imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxmt, = 0); >=20 > As a result, SACK is not doing enough recovery in this scenario and = loss has to be recovered by RTO. > Again, I'd appreciate feedback from the community. >=20 > Regards, > Liang Tian >=20 >=20 >=20 >=20 > On Sun, Aug 23, 2020 at 3:56 PM Scheffenegger, Richard = <Richard.Scheffenegger@netapp.com> wrote: >>=20 >> Hi Liang, >>=20 >> In SACK loss recovery, you can recover up to ssthresh (prior cwnd/2 = [or 70% in case of cubic]) lost bytes - at least in theory. >>=20 >> In comparison, (New)Reno can only recover one lost packet per window, = and then keeps on transmitting new segments (ack + cwnd), even before = the receipt of the retransmitted packet is acked. >>=20 >> For historic reasons, the semantic of the variable cwnd is overloaded = during loss recovery, and it doesn't "really" indicate cwnd, but rather = indicates if/when retransmissions can happen. >>=20 >>=20 >> In both cases (also the simple one, with only one packet loss), cwnd = should be equal (or near equal) to ssthresh by the time loss recovery is = finished - but NOT before! While it may appear like slow-start, the = value of the cwnd variable really increases by acked_bytes only per ACK = (not acked_bytes + SMSS), since the left edge (snd_una) doesn't move = right - unlike during slow-start. But numerically, these different = phases (slow-start / sack loss-recovery) may appear very similar. >>=20 >> You could check this using the (loadable) SIFTR module, which = captures t_flags (indicating if cong/loss recovery is active), ssthresh, = cwnd, and other parameters. >>=20 >> That is at least how things are supposed to work; or have you = investigated the timing and behavior of SACK loss recovery and found a = deviation to RFC3517? Note that FBSD currently has not fully implemented = RFC6675 support (which deviates slightly from 3517 under specific = circumstances; I have a patch pending to implemente 6675 rescue = retransmissions, but haven't tweaked the other aspects of 6675 vs. 3517. >>=20 >> BTW: While freebsd-net is not the wrong DL per se, TCP, UDP, SCTP = specific questions can also be posted to freebsd-transport, which is = more narrowly focused. >>=20 >> Best regards, >>=20 >> Richard Scheffenegger >>=20 >> -----Original Message----- >> From: owner-freebsd-net@freebsd.org <owner-freebsd-net@freebsd.org> = On=20 >> Behalf Of Liang Tian >> Sent: Sonntag, 23. August 2020 00:14 >> To: freebsd-net <freebsd-net@freebsd.org> >> Subject: Fast recovery ssthresh value >>=20 >> Hi all, >>=20 >> When 3 dupacks are received and TCP enter fast recovery, if SACK is = used, the CWND is set to maxseg: >>=20 >> 2593 if (tp->t_flags & TF_SACK_PERMIT) { >> 2594 TCPSTAT_INC( >> 2595 tcps_sack_recovery_episode); >> 2596 tp->snd_recover =3D tp->snd_nxt; >> 2597 tp->snd_cwnd =3D maxseg; >> 2598 (void) tp->t_fb->tfb_tcp_output(tp); >> 2599 goto drop; >> 2600 } >>=20 >> Otherwise(SACK is not in use), CWND is set to maxseg before >> tcp_output() and then set back to snd_ssthresh+inflation >> 2601 tp->snd_nxt =3D th->th_ack; >> 2602 tp->snd_cwnd =3D maxseg; >> 2603 (void) tp->t_fb->tfb_tcp_output(tp); >> 2604 KASSERT(tp->snd_limited <=3D 2, >> 2605 ("%s: tp->snd_limited too big", >> 2606 __func__)); >> 2607 tp->snd_cwnd =3D tp->snd_ssthresh + >> 2608 maxseg * >> 2609 (tp->t_dupacks - tp->snd_limited); >> 2610 if (SEQ_GT(onxt, tp->snd_nxt)) >> 2611 tp->snd_nxt =3D onxt; >> 2612 goto drop; >>=20 >> I'm wondering in the SACK case, should CWND be set back to = ssthresh(which has been slashed in cc_cong_signal() a few lines above) = before line 2599, like non-SACK case, instead of doing slow start from = maxseg? >> I read rfc6675 and a few others, and it looks like that's the case. I = appreciate your opinion, again. >>=20 >> Thanks, >> Liang >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to = "freebsd-net-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-transport@freebsd.org mailing list = https://lists.freebsd.org/mailman/listinfo/freebsd-transport > To unsubscribe, send any mail to = "freebsd-transport-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-transport@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-transport > To unsubscribe, send any mail to = "freebsd-transport-unsubscribe@freebsd.org" ------ Randall Stewart rrs@netflix.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A982EE58-1F2F-400B-B8AA-9B3B5523826B>