Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 7 Oct 2015 12:17:40 -0400
From:      Randall Stewart <rrs@netflix.com>
To:        FreeBSD Transports <transport@FreeBSD.org>
Subject:   The trouble with sack..
Message-ID:  <DA8A5844-8F11-42D5-B923-3F329203B867@netflix.com>

next in thread | raw e-mail | index | archive | help
Greetings all:

Hiren and I have been poking a little bit with the TCP-Sack =
implementation
in FreeBSD and I think we have pretty much determined its sub-optimal to =
phrase
it nicely :-)

All the sack-scoreboard stuff works, but what we do with the scoreboard =
and
how we handle SACKs really does not match what the TCP RFC=92s say we =
should.

Here are a few of examples (there are probably more that we will yet =
discover):

1) When we finally recognize its time to Fast Retransmit we shut the =
cwnd to 1MTU.  The
    SACK RFC=92s tell us to go to 1/2 of the pervious cwnd (which is =
also stored in ssthresh).

2) When we recognize a dup-ack we *will not* recognize it if for example =
if the rwnd changes even
    if new SACK information is reported in the sack blocks. This is due =
to the fact that in non-SACK you don=92t
    (on purpose) recognize ACK=92s where the window changed (since you =
can=92t really tell if its a
     plain window update or a dup-ack).. This means we occasionally miss =
out
    on stroking the dup-ack counter and getting out of recovery....

3) When we have more than one hole the goal of SACK was to retransmit =
every time that
    a hole had 3 dup-acks so that one could recover multiple blocks that =
were lost. We just
    plain don=92t track dup-acks per hole. We do continue to count, but =
we will wait to retransmit
    anything until after we have drained 1/2 the data in flight from the =
network at a minimum. And only then
    do we start incrementing cwnd (remember we crashed it to 1 MTU) so =
that we can retransmit. There
    may be some other twists in the code that we are missing but this is =
what we believe (this could could
    probably win the C obfuscation contest if someone were willing to =
enter it :-D)

4) The way we calculate what is in flight with SACK is wrong, basically =
we don=92t arrive at
     whats really in flight, which with SACK you can know if you have a =
properly maintained=20
     scoreboard (which we do have).

Hiren and I have a few ideas on how to fix some of these, but I think we =
may want to discuss
first what  Gleb talked about doing at BSD-Canada, at least so I am =
told, which is to
have each inpcb have a set of function pointers so we can create =93new=94=
 versions of say
tcp_do_segment and tcp_output.. without changing original ones..

This way, has we develop fixes and improvements,  we can keep the old =
code in place without
disrupting everyone and then after everyone has vetted and played with =
the =93new=94 code we can
switch things out :-)

By the way just looking around at NF and doing some quick survery=92s of =
SACK, about 99% of
NF connections seem to have sack enabled, so its pretty much widely =
deployed now.. and its rare
we are *not* using the SACK cases in our TCP stack..

Best wishes

R
--------
Randall Stewart
rrs@netflix.com
803-317-4952








Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DA8A5844-8F11-42D5-B923-3F329203B867>