From owner-freebsd-net@FreeBSD.ORG  Fri Dec 28 06:49:23 2007
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4864716A419;
	Fri, 28 Dec 2007 06:49:23 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id C4A5F13C44B;
	Fri, 28 Dec 2007 06:49:22 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	[211.30.219.213])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lBS6nGth032192
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 28 Dec 2007 17:49:18 +1100
Date: Fri, 28 Dec 2007 17:49:14 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20071228155323.X3858@besplex.bde.org>
Message-ID: <20071228170151.C4166@besplex.bde.org>
References: <20071221234347.GS25053@tnn.dglawrence.com>
	<MDEHLPKNGKAHNMBLJOLKMEKLJAAC.davids@webmaster.com>
	<20071222050743.GP57756@deviant.kiev.zoral.com.ua>
	<20071223032944.G48303@delplex.bde.org>
	<985A3F99-B3F4-451E-BD77-E2EB4351E323@eng.oar.net>
	<20071228143411.C3587@besplex.bde.org>
	<20071228155323.X3858@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Kostik Belousov <kostikbel@gmail.com>, freebsd-stable@FreeBSD.org,
	freebsd-net@FreeBSD.org
Subject: Re: Packet loss every 30.999 seconds
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2007 06:49:23 -0000

On Fri, 28 Dec 2007, Bruce Evans wrote:

> On Fri, 28 Dec 2007, Bruce Evans wrote:
>
>> In previous mail, you (Mark) wrote:
>> 
>> # With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
>> # kern.ipc.maxsockbuf=20480000, then use setsockopt() with SO_RCVBUF
>> # in the application.  If packets were dropped they would show up
>> # with netstat -s as "dropped due to full socket buffers".
>> # # Since the packet never makes it to ip_input() I no longer have
>> # any way to count drops.  There will always be corner cases where
>> # interrupts are lost and drops not accounted for if the adapter
>> # hardware can't report them, but right now I've got no way to
>> # estimate any loss.

I found where drops are recorded for the net.isr.direct=0 case.  It is
in net.inet.ip.intr_queue.drops.  The netisr subsystem just calls
IF_HANDOFF(), and IF_HANDOFF() calls _IF_DROP() if the queue fills up.
_IF_DROP(ifq) just increments ifq->ip_drops.  The usual case for netisrs
is for the queue to be ipintrq for NETISR_IP.  The following details
don't help:

- drops for input queues don't seem to be displayed by any utilities
   (except ones for ipintrq are displayed primitively by
   sysctl net.inet.ip.intr_queue_drops).  netstat and systat only
   display drops for send queues and ip frags.
- the netisr subsystem's drop count doesn't seem to be displayed by any
   utilities except sysctl.  It only counts drops due to there not being
   a queue; other drops are counted by _IF_DROP() in the per-queue counter.
   Users have a hard time integrating all these primitively displayed drop
   counts with other error counters.
- the length of ipintrq defaults to the default ifq length of ipqmaxlen =
   IPQ_MAXLEN = 50.  This is inadequate if there is just one NIC in the
   system that has an rx ring size of >= slightly less than 50.  But 1
   Gbps NICs should have an rx ring size of 256 or 512 (I think the
   size is 256 for em; it is 256 for bge due to bogus configuration of
   hardware that can handle it being 512).  If the larger hardware rx
   ring is actually used, then ipintrq drops are almost ensured in the
   direct=0 case, so using the larger h/w ring is worse than useless
   (it also increases cache misses).  This is for just one NIC.  This
   problem is often limited by handling rx packets in small bursts, at
   a cost of extra overhead.  Interrupt moderation increases it by
   increasing burst sizes.

   This contrasts with the handling of send queues.  Send queues are
   per-interface and most drivers increase the default length from 50
   to their ring size (-1 for bogus reasons).  I think this is only an
   optimization, while a similar change for rx queues is important for
   avoiding packet loss.  For send queues, the ifq acts mainly as a
   primitive implementation of watermarks.  I have found that tx queue
   lengths need to be more like 5000 than 50 or 500 to provide enough
   buffering when applications are delayed by other applications or
   just by sleeping until the next clock tick, and use tx queues of
   length ~20000 (a couple of clock ticks at HZ = 100), but now think
   queue lengths should be restricted to more like 50 since long queues
   cannot fit in L2 caches (not to mention they are bad for latency).

The length of ipintrq can be changed using sysctl
net.inet.ip.intrq_queue_maxlen.  Changing it from 50 to 1024 turns most
or all ipintrq drops into "socket buffer full" drops
(640 kpps input packets and 434 kpps socket buffer fulls with direct=0;
  640 kpps input packets and 324 kpps socket buffer fulls with direct=1).

Bruce