Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 21 Sep 1998 22:31:13 -0400 (EDT)
From:      Bill Paul <wpaul@skynet.ctr.columbia.edu>
To:        current@FreeBSD.ORG, freebsd-net@FreeBSD.ORG
Cc:        wollman@FreeBSD.ORG
Subject:   Strange behavior with ARP and IP fragmentation
Message-ID:  <199809220231.WAA25016@skynet.ctr.columbia.edu>

next in thread | raw e-mail | index | archive | help
Hello:

For those who don't know, I've been working on yet another fast ethernet
driver lately for the RealTek 8139 chip. This chip sucks, but that's not
why I'm writing. Today, while running some tests, I noticed some odd
IP fragmentation behavior which I thought was due to a bug in my driver 
code, but I've since been able to duplicate the problem on another 
machine with a 3c509 card using the ep driver. This has me a little
confused.

Here's the deal: one of the tests I do involves sending ICMP datagrams
with ping using various payload sizes (using the -s flag). By using a
packet size larger than 1500 bytes, I can get the system to queue up
a small number of ethernet frames fairly quickly and observe the result.
This lets me see if the driver is transmitting rapidly queued 
sequences of frames correctly. I use the -c flag with ping to limit the 
number of packets so that I can check short bursts of frames rather than 
a huge stream. (Watching a massive bunch of frames fly through tcpdump at 
100Mbps makes it hard to spot glitches.)

One thing I do a lot is this:

# ifconfig <interface> 10.0.0.2 netmask 0xffffff00 up
# ping -c 1 -s 4096 10.0.0.1

10.0.0.1 is another machine attached to the interface under test using
a crossover cable. I run tcpdump on this host to monitor traffic from
the first machine so I can see what the NIC is sending. Assuming the
system has just been booted, the 10.0.0.2 host will not yet have an
ARP entry for the 10.0.0.1 host, so the sequence should go something
like this:

10.0.0.2: sends an ARP request for 10.0.0.1
10.0.0.1: sends an ARP reply to 10.0.0.2
10.0.0.2: sends the first fragment of an ICMP echo request which should
          be about 1514 bytes long. The ICMP packet is fragmented since
          4096 bytes is larger than the interface MTU of 1500 bytes.
10.0.0.2: sends the next fragment, also of 1514 bytes
10.0.0.2: sends the last fragment, somewhere in the neigborhood of
          1068 bytes
10.0.0.1: sends the first fragment of an ICMP echo reply. Again, the
          fragmentation occurs because the reply is also 4096 bytes.
10.0.0.1: sends the next frag
10.0.0.1: sends the last frag

At this point, ping reports that the reply was received and all is
happy and there is much rejoicing.

Not.

What I observed is that the ARP request and ARP reply proceed as expected,
but the first portion of the ICMP packet transmitted is in fact the last
fragment. The first two fragments have been vanished into the void. Since 
the ICMP echo request is contained in the first fragment, the host on the 
other side discards the fragment and never sends a reply. The result is 
that 'ping -c 1 -s 4096 10.0.0.1' just sits there and no reply is ever 
received.

On the other hand, sending a second ICMP request immediately after the
first does work.

Below is a tcpdump capture of an actual exchange between two machines.
Harpsichord is a Micron Pentium Pro 200Mhz machine with a 3Com 3c509
ethernet adapter running FreeBSD 2.2.6. Sax is an IBM RS/6000 model 390 
running AIX 4.1.4.

First, I run tcpdump on harpsichord to capture the session:

[/homes/rwpaul]:harpsichord{1}#	tcpdump -n -e -i ep0 host sax and harpsichord
tcpdump: listening on ep0

Now I type 'ping -c 1 -s 4096 sax' on harpsichord. Note: there is no
ARP entry for sax on harpsichord at this point. The resulting exchange is 
shown below:

21:41:03.105011 0:60:97:6c:6f:b0 ff:ff:ff:ff:ff:ff 0806 42: arp who-has 
	128.59.68.56 tell 128.59.68.72
21:41:03.105338 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0806 60: arp reply 
	128.59.68.56 is-at 10:0:5a:fa:4e:9e
21:41:03.105970 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1178:
	128.59.68.72 > 128.59.68.56: (frag 15401:1144@2960)

Note that the only part of the ICMP datagram to make it out the door
is the final fragment. This fails to illicit a response from the RS/6000,
so the ping times out.

Now I issue the same ping command to send another 4096 byte ICMP request.
This time, an ARP entry for sax exists on harpsichord, so no ARP packets
are sent. This time, everything looks normal:

21:41:19.647643 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1514: 
	128.59.68.72 > 128.59.68.56: icmp: echo request (frag 15424:1480@0+)
21:41:19.648423 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1514: 
	128.59.68.72 > 128.59.68.56: (frag 15424:1480@1480+)
21:41:19.649053 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1178: 
	128.59.68.72 > 128.59.68.56: (frag 15424:1144@2960)
21:41:19.652758 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0800 1514: 
	128.59.68.56 > 128.59.68.72: icmp: echo reply (frag 12732:1480@0+)
21:41:19.654060 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0800 1514: 
	128.59.68.56 > 128.59.68.72: (frag 12732:1480@1480+)
21:41:19.655099 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0800 1178: 
	128.59.68.56 > 128.59.68.72: (frag 12732:1144@2960)


I originally observed this behavior on a 3.0CAM snapshot with my
not quite complete (but largely functional) RealTek driver, however
it appears to manifest itself on 2.2.x too. I'm at a loss to explain
what's going on here, but something's clearly wrong. For a while I was
convinced that my driver was at fault, but after adding some debug
code I realized that the transmit start routine was only being called
with one fragment, so the other fragments weren't even making it to
the device driver stage. This is further evidenced by the fact that
I can reproduce the problem on 2.2.6 with a totally different driver.
I have no idea if this behavior goes all the way back to 2.1.x.

Note that larger ICMP datagram sizes will also trigger the behavior:
on FreeBSD 3.0, I was able to specify a size of 8100 bytes without
ping complaining, but again only the last fragment of the first
datagram gets transmitted (subsequent datagrams send after the ARP
request/reply exchange are send properly).

If anybody has any insights on this, I'd love to hear them. I really
don't want to wade through TCP/IP Illustrated Vol.II trying to track
this down.

-Bill

-- 
=============================================================================
-Bill Paul            (212) 854-6020 | System Manager, Master of Unix-Fu
Work:         wpaul@ctr.columbia.edu | Center for Telecommunications Research
Home:  wpaul@skynet.ctr.columbia.edu | Columbia University, New York City
=============================================================================
 "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness"
=============================================================================

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199809220231.WAA25016>