Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Mar 2010 12:31:22 -0700
From:      Steven Noonan <steven@uplinklabs.net>
To:        pyunyh@gmail.com
Cc:        freebsd-net@freebsd.org, bug-followup@freebsd.org, yongari@freebsd.org
Subject:   Re: kern/144689: [re] TCP transfer corruption using if_re
Message-ID:  <f488382f1003161231s2fbd7d39yf615941d028c18e8@mail.gmail.com>
In-Reply-To: <20100316182322.GF2001@michelle.cdnetworks.com>
References:  <201003121754.o2CHsH7V065932@freefall.freebsd.org> <f488382f1003121619y17780ed9x52765b9a9133fb2@mail.gmail.com> <f488382f1003121624j34a8aee8kc127e82c08c3fe37@mail.gmail.com> <f488382f1003122157i12968043h31c8020007f7e8a1@mail.gmail.com> <f488382f1003130418s116e9c1frfd210db4127b4a9@mail.gmail.com> <20100316182322.GF2001@michelle.cdnetworks.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Mar 16, 2010 at 11:23 AM, Pyun YongHyeon <pyunyh@gmail.com> wrote:
> On Sat, Mar 13, 2010 at 04:18:30AM -0800, Steven Noonan wrote:
>> On Fri, Mar 12, 2010 at 9:57 PM, Steven Noonan <steven@uplinklabs.net> wrote:
>> > On Fri, Mar 12, 2010 at 4:24 PM, Steven Noonan <steven@uplinklabs.net> wrote:
>> >> On Fri, Mar 12, 2010 at 4:19 PM, Steven Noonan <steven@uplinklabs.net> wrote:
>> >>> On Fri, Mar 12, 2010 at 9:54 AM, ??<yongari@freebsd.org> wrote:
>> >>>> Synopsis: [re] TCP transfer corruption using if_re
>> >>>>
>> >>>> State-Changed-From-To: open->feedback
>> >>>> State-Changed-By: yongari
>> >>>> State-Changed-When: Fri Mar 12 17:53:37 UTC 2010
>> >>>> State-Changed-Why:
>> >>>> This looks like Rx checksum offloading issue. Would you try
>> >>>> disabling Rx checksum offloading and test it again?
>> >>>> #ifconfig re0 -rxcsum
>> >>>> Also show me dmesg output(re(4) related part).
>> >>>>
>> >>>>
>> >>>> Responsible-Changed-From-To: freebsd-net->yongari
>> >>>> Responsible-Changed-By: yongari
>> >>>> Responsible-Changed-When: Fri Mar 12 17:53:37 UTC 2010
>> >>>> Responsible-Changed-Why:
>> >>>> Mine.
>> >>>>
>> >>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=144689
>> >>>>
>> >>>
>> >>> Hmm. Disabling Rx checksum offloading helped for one clone process,
>> >>> but then this showed up in dmesg during my second test (it seems to be
>> >>> doing this regularly for some reason):
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>>
>> >>> And no, the cable isn't loose or something. It just decides to take
>> >>> the interface down and put it back up.
>> >>>
>> >>> Here's the rest of 'dmesg | grep re0':
>> >>>
>> >>> firewire0: <IEEE1394(FireWire) bus> on fwohci0
>> >>> dcons_crom0: <dcons configuration ROM> on firewire0
>> >>> fwe0: <Ethernet over FireWire> on firewire0
>> >>> fwip0: <IP over FireWire> on firewire0
>> >>> firewire0: 1 nodes, maxhop <= 0 cable IRM irm(0) ??(me)
>> >>> firewire0: bus manager 0
>> >>> re0: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet>
>> >>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on
>> >>> cardbus0
>> >>> re0: Chip rev. 0x10000000
>> >>> re0: MAC rev. 0x00000000
>> >>> miibus1: <MII bus> on re0
>> >>> re0: Ethernet address: 00:18:4d:6e:c0:29
>> >>> re0: [FILTER]
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: detached
>> >>> re0: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet>
>> >>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on
>> >>> cardbus0
>> >>> re0: Chip rev. 0x10000000
>> >>> re0: MAC rev. 0x00000000
>> >>> miibus1: <MII bus> on re0
>> >>> re0: Ethernet address: 00:18:4d:6e:c0:29
>> >>> re0: [FILTER]
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: PHY read failed
>> >>> re0: link state changed to DOWN
>> >>> re0: link state changed to UP
>> >>> re0: PHY read failed
>> >>>
>> >>> - Steven
>> >>>
>> >>
>> >> I should note that the connection was _lost_ during the second test above.
>> >>
>> >> I also tested again, and it looks like it added another "re0: PHY read
>> >> failed" before silently dropping the connection.
>> >>
>> >> - Steven
>> >>
>> >
>> > I did a couple captures with Wireshark on the client end. One is with
>> > rxcsum enabled on the machine running git-daemon, one is without
>> > rxcsum.
>> >
>> > http://www.uplinklabs.net/~tycho/files/git-cap-norxcsum.bz2
>> > http://www.uplinklabs.net/~tycho/files/git-cap.bz2
>> >
>> > Obviously, you can look at the data yourself and make more sense of
>> > it, but here are things I noticed in the captures:
>> >
>> > With rxcsum:
>> > - There are some silent problems that occur in the middle of the
>> > capture. Client-to-server: 'TCP ACKed lost segment' a few times, then
>> > 'TCP previous segment lost'. This happens multiple times during the
>> > capture (before 'git-upload-pack' starts sending data).
>> > - Occasional 'TCP window update's. These are highlighted in black for
>> > reasons unknown to me. It seems like this would be normal.
>> > - The server calls 'git-upload-pack' and we start seeing a large
>> > number of client-to-server TCP RST flags being sent and then the
>> > connection gets closed due to some detected data corruption in the
>> > transfer.
>> >
>> > Without rxcsum:
>> > - About the same amount of client-to-server 'TCP ACKed lost segment's.
>> > - 'git-upload-pack' kicks in and things get _really_ hairy. 'TCP Dup
>> > ACK' detected by the client many many times.
>> > - Finally, a series of 'TCP retransmission's from server to client
>> > happen (which is where the connection hangs).
>> > - I closed the connection which triggered the final two 'TCP RST's.
>> >
>> > Also, I forgot to note in my original report that I checked if there
>> > was packet loss by using a ping flood, and one packet in the 1.5
>> > million packets sent was lost. But I'm not sure whether it's
>> > checksumming the data of these packets, so they could be coming back
>> > with perfectly valid ICMP headers but corrupted data.
>> >
>>
>> Also, hilariously horrible hack:
>>
>> - On the server machine, start git-daemon listening on 127.0.0.1:9418.
>> - On the server machine, run 'ssh -L <public IP>:9418:127.0.0.1:9418
>> user@localhost'.
>>
>> Then remote git clones work as expected. Very strange. It will have to
>> do until I get a less insane solution.
>>
>
> The real issue looks like PHY read failure which can result in
> unexpected behavior. I don't see rgephy(4) related message here,
> would you show me the output of "devinfo -rv | grep phy"?
> By chance are you using PCMCIA ethernet controller?

I am. It's a Netgear GA511. I think I said in my original post that it
was connected via cardbus.

xerxes ~ # devinfo -rv | grep phy
                    rgephy0 pnpinfo oui=0x732 model=0x11 rev=0x3 at phyno=1
                inphy0 pnpinfo oui=0xaa00 model=0x33 rev=0x0 at phyno=1


>
>> I don't understand why it makes a difference. Is git-daemon using TCP
>> socket options that causes this network interface driver to
>> malfunction?
>>
>
> No, I don't think so. It would be a bug in driver.
>
>> - Steven
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f488382f1003161231s2fbd7d39yf615941d028c18e8>