Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Nov 2011 22:01:45 +0000
From:      Matthew Seaman <>
Subject:   Re: Diagnosing packet loss
Message-ID:  <>
In-Reply-To: <>
References:  <>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 22/11/2011 20:33, Kees Jan Koster wrote:
> I am stuck with a machine that shows serious packet loss (about 1% of
> all traffic is dropped). I tried the obvious (new network cable,
> different switch port, different ethernet interface on the machine),
> but the problems remain.
> Another machine that sits in the same rack and is hooked up to the
> same switch shows no such packet loss issues. The problematic machine
> is a dual Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47
> CEST 2011.
> The machine is lightly loaded. A MySQL slave is running, but the
> machine is not serving queries. Plus a Munin server process.
> I am at a loss where to start diagnosing this. Can you advise me
> where to look? Are there network buffers that may be overflowing?

You say "lightly loaded," but how much does that actually equate to in
kb/s or Mb/s?  I'd call anything less than about 1Mb/s on a GB ethernet
link pretty light, but other people have different ideas.

Check for duplex mismatch -- normally everything just works allowing the
NIC and the switch to autonegotiate, but every so often some bright
spark gets the idea that wiring down the speed setting is a good idea.
Trouble is you have to set *both* ends of the ethernet link to the same
settings -- if one end is trying to auto and the other is fixed, you'll
end up with the auto end defaulting to 100baseTX half-duplex and
performance will suck, and suck increasingly hard as network load goes
up.  Amazing how often that 'set both ends the same' thing leads to grief=

Another hideously embarrassing error would be to spend ages debugging
before finding out you had a duplicate IP number on your network.  Can
you definitely rule that out?

A third networking problem that also has the potential to make you the
butt of a few jokes is if your network cables are kinked, crushed, over
stretched or simply cable-tied too tightly.  Anything like that can
cause signal leakage between the pairs of conductors in the cable which
can be enough to disrupt packet transmission.  Simply snipping through a
too-tight cable tie can have a magical-seeming effect.

What sort on NICs are there on your machine?  It's well known that re(4)
interfaces simply cannot keep up with the throughput of a good server
NIC like em(4) or bge(4).  [But re(4)'s are cheap and good enough for
most home systems...]  If you can try swapping in a reasonably good NIC
card -- beg, borrow or steal from another machine just for a few hours
to use for testing -- and see if that cures the problem.

Other considerations: are you doing anything beyond just plain ethernet
networking?  Any VLANS?  What about ipsec or other
tunnelled/encapsulated traffic?  Are you using RSTP or lagg to make your
networking resilient to failures?  If the answer to any of these is
"yes" -- does temporarily disabling that feature and doing it simple and
stupid help with the packet loss?

Do you get the same sort of packet loss if you take the switch away and
just run a cable direct between two machines.  (Nb. If your NICs don't
support MSIx you'll need a crossover cable.)

On another host on your net, can you use wireshark to capture and
examine the traffic from your failing machine?  For best results, either
wire the two machines directly together or configure the switch port
your wireshark box is connected to as a /monitor/ port so it sees all
the traffic coming out of your problem box.

Does your NIC have hardware checksumming?  If so, does disabling that
help with the error rate? (see ifconfig(8) and the man page for your NIC
in section 4 for how.)  There have been a number of instances of buggy
checksumming causing problems in the past.  Nb. with hardware
checksumming, the checksum field is calculated and inserted in packets
very late; after any way of examining the packets as they leave your
machine has ceased to be possible.  Makes it look like the checksums are
all wrong if you sample the traffic on the originating machine.  This is
why you need to use another, external machine to watch for this sort of



Dr Matthew J Seaman MA, D.Phil.                   7 Priory Courtyard
                                                  Flat 3
PGP:     Ramsgate
JID:               Kent, CT11 9PW

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

Version: GnuPG/MacGPG2 v2.0.16 (Darwin)
Comment: Using GnuPG with Mozilla -



Want to link to this message? Use this URL: <>