Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Jul 2012 14:23:30 -0700
From:      Kevin Oberman <kob6558@gmail.com>
To:        Lawrence Stewart <lstewart@freebsd.org>
Cc:        freebsd-net@freebsd.org, Andrew Gallatin <gallatin@cs.duke.edu>, Andrew Gallatin <gallatin@myri.com>
Subject:   Re: Major performance hit with ToS setting
Message-ID:  <CAN6yY1txqK18Z0xgFmVvNzy1d9R7oGwGcZfOFjAiWOoZwS%2Br9Q@mail.gmail.com>
In-Reply-To: <4FCBFFC8.8000402@freebsd.org>
References:  <CAN6yY1sLxFJ18ANO7nQqLetnJiT-K6pHC-X3yT1dWuWGa0VLUg@mail.gmail.com> <4FBF88CE.20209@cs.duke.edu> <CAN6yY1v%2Bvf=SW7WDGHxCkJtOdj8K3f450jNxFWK_Jc%2B-pFg0nA@mail.gmail.com> <4FC82D6C.4050309@freebsd.org> <CAN6yY1v08qk2VhXFg0Qiz-pMM6md2c_E_kEvA-oqbxuvSN1JDg@mail.gmail.com> <4FCBFFC8.8000402@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jun 3, 2012 at 5:22 PM, Lawrence Stewart <lstewart@freebsd.org> wro=
te:
> On 06/03/12 15:18, Kevin Oberman wrote:
>>
>> On Fri, Jun 1, 2012 at 2:48 AM, Lawrence Stewart<lstewart@freebsd.org>
>> =C2=A0wrote:
>>>
>>> On 05/31/12 13:33, Kevin Oberman wrote:
>>> [snip]
>>>>
>>>>
>>>> I used SIFTR at the suggestion of Lawrence Stewart who headed the
>>>>
>>>> project to bring plugable congestion algorithms to FreeBSD and found
>>>> really odd congestion behavior. First, I do see a triple ACK, but the
>>>> congestion window suddenly drops from 73K to 8K. If I understand
>>>> CUBIC, it should half the congestion window, not what is happening..
>>>> It then increases slowly (in slow start) to 82K. while the slow-start
>>>> bytes are INCREASING, the congestion window again goes to 8K while the
>>>> SS size moves from 36K up to 52K. It just continues to bound wildly
>>>> between 8K (always the low point) and between 64k and 82K. The swings
>>>> start at 83K and, over the first few seconds the peaks drop to about
>>>> 64K.
>>>
>>>
>>>
>>> Oh, and a comment about this behaviour. Dropping back to 8k (1MSS) is
>>> only
>>> nasty if the TF_{CONG|FAST}RECOVERY flags are *not* set i.e. if you see
>>> cwnd
>>> grow, drop to 8k with those flags set, and then when the flags are unse=
t,
>>> cwnd starts at the value of ssthresh, then that is perfectly normal
>>> recovery
>>> behaviour. What *is* nasty is if an RTO fires, which will reset cwnd to
>>> 8k,
>>> ssthresh to 2*MSS and make the connection effectively start from scratc=
h
>>> again.
>>>
>>> There is evidence of RTOs in your siftr output, which is bad news e.g
>>> here's
>>> one example of 2 side-by-side log lines from your trace:
>>>
>>> # Direction,time,ssthresh,cwnd,flags
>>> i,1338319593.574706,27044,27044,1630544864
>>> o,1338319593.831482,16384,8192,1092625377
>>>
>>> Note the 300ms gap, and how cwnd resets to 1MSS and flags go from
>>> 1630544864
>>> (TF_WASCRECOVERY|TF_CONGRECOVERY|TF_WASFRECOVERY|TF_FASTRECOVERY) to
>>> 1092625377 (TF_WASCRECOVERY|TF_WASFRECOVERY).
>>
>>
>> What can I say but that you are right. When I looked at the interface
>> stats I found that the link overflow drops were through the roof! This
>> confuses me a bit since the traffic is outbound and I woudl assume
>> from the description on hte Myricom web page that these are input
>> drops. A problem a problem with that card? =C2=A0On systems that are
>> working "normally", I still see a sharp drop with the ToS bits set,
>> but nothing nearly as drastic. Now it is a drop from 4.5G to 728M on a
>> cross-country (US) circuit.
>>
>> I am now looking for issues on the route that might explain the
>> performance, but the question of why the drop-of only shows up in
>> FreeBSD 8 means something odd is still going on. It is even possible
>> that the problem is with 7 and the losses are due to the policy for
>> ToS 32 on the path. ToS 32 is less than best effort in our network.
>> Maybe the marking was getting lost on 7. Not likely, but possible.
>
>
> The receiver is FreeBSD 7? If so, have you tuned your reassembly queue on
> that machine? If not, that could explain the RTOs you're seeing. Send
> through the output of "sysctl net.inet.tcp.reass" and "netstat -sp tcp"
> obtained from the receiver immediately before and after running a short
> ToS=3D32 test.

I just wanted to let those kind enough to help with this that I have
analyzed the problem and pretty much understand what is happening.

I've done a lot of testing fully understand what is going on. First,
the problem is clearly tied to FreeBSD 8, but it is not anything wrong
with FreeBSD. Instead it is a real fluke problem with the handing of
the DSCP and TOS bits by a single Juniper router when TSO is used. V7
did not support TOS, so v7 does not show the problem.

I have done packet capture on both ends and something really strange
happens with the TSO. I see a couple of large segments move normally.
Then things start getting weird. As soon as slow-start allows things
to speed up just a bit, the second segment of a transfer is discarded
and  TCP tries to recover, but with the long pipe (RTT is around 50
ms. at 10G, there is a lot of data in the pipe when the problem is
detected and the NAK is sent. Actually, 7 or 8 are sent before the
transmitting system receives one and can start to recover. Then, it
just happens again and again.

the root problem is a router that seems to be re-marking the ToS bits
from 0x20 to 0x24 which is adding the "loss priority" bit. Even though
the circuit is not busy, TSO results in all segments being sent
"back-to-back" and, with the change in the IP Precedence bits, the
second packet gets dropped if ANY other traffic is present.

We have a ticket open with the router vendor and I hope that we can
get this resolved quickly, but I would not bet on it. In nay case, it
is not a FreeBSD issue, though some  what I see makes me suspect that
our stack may not be responding well to this situation of massive loss
in large segments. But the losses are so severe that I am far from
certain and really can't expect anything but terrible results.

Again, thinks to Lawrence, Bjorn, and Andrew for their and efforts to
look at this and the the wireshark folks, without whom I would
probably still be trying to understand what is going on.
--=20
R. Kevin Oberman, Network Engineer
E-mail: kob6558@gmail.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAN6yY1txqK18Z0xgFmVvNzy1d9R7oGwGcZfOFjAiWOoZwS%2Br9Q>