Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Dec 2007 13:02:12 -0500
From:      Mark Fullmer <maf@eng.oar.net>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Kostik Belousov <kostikbel@gmail.com>, freebsd-net@FreeBSD.org, freebsd-stable@freebsd.org
Subject:   Re: Packet loss every 30.999 seconds
Message-ID:  <985A3F99-B3F4-451E-BD77-E2EB4351E323@eng.oar.net>
In-Reply-To: <20071223032944.G48303@delplex.bde.org>
References:  <20071221234347.GS25053@tnn.dglawrence.com> <MDEHLPKNGKAHNMBLJOLKMEKLJAAC.davids@webmaster.com> <20071222050743.GP57756@deviant.kiev.zoral.com.ua> <20071223032944.G48303@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote:
>
> I still don't understand the original problem, that the kernel is not
> even preemptible enough for network interrupts to work (except in 5.2
> where Giant breaks things).  Perhaps I misread the problem, and it is
> actually that networking works but userland is unable to run in time
> to avoid packet loss.
>

The test is done with UDP packets between two servers.  The em
driver is incrementing the received packet count correctly but
the packet is not making it up the network stack.  If
the application was not servicing the socket fast enough I would
expect to see the "dropped due to full socket buffers" (udps_fullsock)
counter incrementing, as shown by netstat -s.

I grab a copy of netstat -s, netstat -i, and netstat -m
before and after testing.  Other than the link packets counter,
I haven't seen any other indication of where the packet is getting
lost.  The em driver has a debugging stats option which does not
indicate receive side overflows.

I'm fairly certain this same behavior can be seen with the fxp
driver, but I'll need to double check.

These are results I sent a few days ago after setting up a
test without an ethernet switch between the sender and receiver.

The switch was originally used to verify the sender was actually
transmitting.  With spanning tree, ethernet keepalives, and CDP
(cisco proprietary neighbor protocol) disabled and static ARP entries
on the sender and receiver I can account for all packets making
it to the receiver.

##

> Back to back test with no ethernet switch between two em interfaces,
> same result.  The receiving side has been up > 1 day and exhibits
> the problem.  These are also two different servers.  The small
> gettimeofday() syscall tester also shows the same ~30
> second pattern of high latency between syscalls.
>
> Receiver test application reports 3699 missed packets
>
> Sender netstat -i:
>
> (before test)
> em1    1500 <Link#2>      00:04:23:cf:51:b7       20     0  
> 15975785     0     0
> em1    1500 10.1/24       10.1.0.2                37     -  
> 15975801     -     -
>
> (after test)
> em1    1500 <Link#2>      00:04:23:cf:51:b7       22     0  
> 25975822     0     0
> em1    1500 10.1/24       10.1.0.2                39     -  
> 25975838     -     -
>
> total IP packets sent in during test = end - start
> 25975838-15975801 =  10000037 (expected, 1,000,000 packets test +  
> overhead)
>
> Receiver netstat -i:
>
> (before test)
> em1    1500 <Link#2>      00:04:23:c4:cc:89 15975785     0        
> 21     0     0
> em1    1500 10.1/24       10.1.0.1          15969626     -        
> 19     -     -
>
> (after test)
> em1    1500 <Link#2>      00:04:23:c4:cc:89 25975822     0        
> 23     0     0
> em1    1500 10.1/24       10.1.0.1          25965964     -        
> 21     -     -
>
> total ethernet frames received during test = end - start
> 25975822-15975785 = 10000037 (as expected)
>
> total IP packets processed during test = end - start
> 25965964-15969626 = 9996338 (expecting 10000037)
>
> Missed packets = expected - received
> 10000037-9996338 = 3699
>
> netstat -i accounts for the 3699 missed packets also reported by the
> application
>
> Looking closer at the tester output again shows the periodic
> ~30 second windows of packet loss.
>
> There's a second problem here in that packets are just disappearing
> before they make it to ip_input(), or there's a dropped packets
> counter I've not found yet.
>
> I can provide remote access to anyone who wants to take a look, this
> is very easy to duplicate.  The ~ 1 day uptime before the behavior
> surfaces is not making this easy to isolate.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?985A3F99-B3F4-451E-BD77-E2EB4351E323>