Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Mar 2010 04:18:30 -0800
From:      Steven Noonan <steven@uplinklabs.net>
To:        yongari@freebsd.org
Cc:        freebsd-net@freebsd.org
Subject:   Re: kern/144689: [re] TCP transfer corruption using if_re
Message-ID:  <f488382f1003130418s116e9c1frfd210db4127b4a9@mail.gmail.com>
In-Reply-To: <f488382f1003122157i12968043h31c8020007f7e8a1@mail.gmail.com>
References:  <201003121754.o2CHsH7V065932@freefall.freebsd.org> <f488382f1003121619y17780ed9x52765b9a9133fb2@mail.gmail.com> <f488382f1003121624j34a8aee8kc127e82c08c3fe37@mail.gmail.com> <f488382f1003122157i12968043h31c8020007f7e8a1@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 12, 2010 at 9:57 PM, Steven Noonan <steven@uplinklabs.net> wrot=
e:
> On Fri, Mar 12, 2010 at 4:24 PM, Steven Noonan <steven@uplinklabs.net> wr=
ote:
>> On Fri, Mar 12, 2010 at 4:19 PM, Steven Noonan <steven@uplinklabs.net> w=
rote:
>>> On Fri, Mar 12, 2010 at 9:54 AM, =C2=A0<yongari@freebsd.org> wrote:
>>>> Synopsis: [re] TCP transfer corruption using if_re
>>>>
>>>> State-Changed-From-To: open->feedback
>>>> State-Changed-By: yongari
>>>> State-Changed-When: Fri Mar 12 17:53:37 UTC 2010
>>>> State-Changed-Why:
>>>> This looks like Rx checksum offloading issue. Would you try
>>>> disabling Rx checksum offloading and test it again?
>>>> #ifconfig re0 -rxcsum
>>>> Also show me dmesg output(re(4) related part).
>>>>
>>>>
>>>> Responsible-Changed-From-To: freebsd-net->yongari
>>>> Responsible-Changed-By: yongari
>>>> Responsible-Changed-When: Fri Mar 12 17:53:37 UTC 2010
>>>> Responsible-Changed-Why:
>>>> Mine.
>>>>
>>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=3D144689
>>>>
>>>
>>> Hmm. Disabling Rx checksum offloading helped for one clone process,
>>> but then this showed up in dmesg during my second test (it seems to be
>>> doing this regularly for some reason):
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>>
>>> And no, the cable isn't loose or something. It just decides to take
>>> the interface down and put it back up.
>>>
>>> Here's the rest of 'dmesg | grep re0':
>>>
>>> firewire0: <IEEE1394(FireWire) bus> on fwohci0
>>> dcons_crom0: <dcons configuration ROM> on firewire0
>>> fwe0: <Ethernet over FireWire> on firewire0
>>> fwip0: <IP over FireWire> on firewire0
>>> firewire0: 1 nodes, maxhop <=3D 0 cable IRM irm(0) =C2=A0(me)
>>> firewire0: bus manager 0
>>> re0: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet>
>>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on
>>> cardbus0
>>> re0: Chip rev. 0x10000000
>>> re0: MAC rev. 0x00000000
>>> miibus1: <MII bus> on re0
>>> re0: Ethernet address: 00:18:4d:6e:c0:29
>>> re0: [FILTER]
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: detached
>>> re0: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet>
>>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on
>>> cardbus0
>>> re0: Chip rev. 0x10000000
>>> re0: MAC rev. 0x00000000
>>> miibus1: <MII bus> on re0
>>> re0: Ethernet address: 00:18:4d:6e:c0:29
>>> re0: [FILTER]
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: PHY read failed
>>> re0: link state changed to DOWN
>>> re0: link state changed to UP
>>> re0: PHY read failed
>>>
>>> - Steven
>>>
>>
>> I should note that the connection was _lost_ during the second test abov=
e.
>>
>> I also tested again, and it looks like it added another "re0: PHY read
>> failed" before silently dropping the connection.
>>
>> - Steven
>>
>
> I did a couple captures with Wireshark on the client end. One is with
> rxcsum enabled on the machine running git-daemon, one is without
> rxcsum.
>
> http://www.uplinklabs.net/~tycho/files/git-cap-norxcsum.bz2
> http://www.uplinklabs.net/~tycho/files/git-cap.bz2
>
> Obviously, you can look at the data yourself and make more sense of
> it, but here are things I noticed in the captures:
>
> With rxcsum:
> - There are some silent problems that occur in the middle of the
> capture. Client-to-server: 'TCP ACKed lost segment' a few times, then
> 'TCP previous segment lost'. This happens multiple times during the
> capture (before 'git-upload-pack' starts sending data).
> - Occasional 'TCP window update's. These are highlighted in black for
> reasons unknown to me. It seems like this would be normal.
> - The server calls 'git-upload-pack' and we start seeing a large
> number of client-to-server TCP RST flags being sent and then the
> connection gets closed due to some detected data corruption in the
> transfer.
>
> Without rxcsum:
> - About the same amount of client-to-server 'TCP ACKed lost segment's.
> - 'git-upload-pack' kicks in and things get _really_ hairy. 'TCP Dup
> ACK' detected by the client many many times.
> - Finally, a series of 'TCP retransmission's from server to client
> happen (which is where the connection hangs).
> - I closed the connection which triggered the final two 'TCP RST's.
>
> Also, I forgot to note in my original report that I checked if there
> was packet loss by using a ping flood, and one packet in the 1.5
> million packets sent was lost. But I'm not sure whether it's
> checksumming the data of these packets, so they could be coming back
> with perfectly valid ICMP headers but corrupted data.
>

Also, hilariously horrible hack:

- On the server machine, start git-daemon listening on 127.0.0.1:9418.
- On the server machine, run 'ssh -L <public IP>:9418:127.0.0.1:9418
user@localhost'.

Then remote git clones work as expected. Very strange. It will have to
do until I get a less insane solution.

I don't understand why it makes a difference. Is git-daemon using TCP
socket options that causes this network interface driver to
malfunction?

- Steven



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f488382f1003130418s116e9c1frfd210db4127b4a9>