From owner-freebsd-stable@FreeBSD.ORG  Tue May  3 07:29:26 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1BD2F106566C
	for <freebsd-stable@freebsd.org>; Tue,  3 May 2011 07:29:26 +0000 (UTC)
	(envelope-from dhartmei@insomnia.benzedrine.cx)
Received: from insomnia.benzedrine.cx (106-30.3-213.fix.bluewin.ch
	[213.3.30.106]) by mx1.freebsd.org (Postfix) with ESMTP id 69EEA8FC08
	for <freebsd-stable@freebsd.org>; Tue,  3 May 2011 07:29:25 +0000 (UTC)
Received: from insomnia.benzedrine.cx (localhost.benzedrine.cx [127.0.0.1])
	by insomnia.benzedrine.cx (8.14.1/8.13.4) with ESMTP id p4370gxU015502
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO);
	Tue, 3 May 2011 09:00:42 +0200 (MEST)
Received: (from dhartmei@localhost)
	by insomnia.benzedrine.cx (8.14.1/8.12.10/Submit) id p4370gYQ032745;
	Tue, 3 May 2011 09:00:42 +0200 (MEST)
Date: Tue, 3 May 2011 09:00:42 +0200
From: Daniel Hartmeier <daniel@benzedrine.cx>
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
Message-ID: <20110503070042.GA9657@insomnia.benzedrine.cx>
References: <20110503015854.GA31444@icarus.home.lan>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20110503015854.GA31444@icarus.home.lan>
User-Agent: Mutt/1.5.12-2006-07-14
Cc: freebsd-stable@freebsd.org, freebsd-pf@freebsd.org
Subject: Re: RELENG_8 pf stack issue (state count spiraling out of control)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 03 May 2011 07:29:26 -0000

I read those graphs differently: the problem doesn't arise slowly,
but rather seems to start suddenly at 13:00.

Right after 13:00, traffic on em0 drops, i.e. the firewall seems
to stop forwarding packets completely.

Yet, at the same time, the states start to increase, almost linearly
at about one state every two seconds, until the limit of 10,000 is
reached. Reaching the limit seems to be only a side-effect of a
problem that started at 13:00.

> Here's one piece of core.0.txt which makes no sense to me -- the "rate"
> column.  I have a very hard time believing that was the interrupt rate
> of all the relevant devices at the time (way too high).  Maybe this data
> becomes wrong only during a coredump?  The total column I could believe.
> 
> ------------------------------------------------------------------------
> vmstat -i
> 
> interrupt                          total       rate
> irq4: uart0                        54768        912
> irq6: fdc0                             1          0
> irq17: uhci1+                        172          2
> irq23: uhci3 ehci1+                 2367         39
> cpu0: timer                  13183882632  219731377
> irq256: em0                    260491055    4341517
> irq257: em1                    127555036    2125917
> irq258: ahci0                  225923164    3765386
> cpu2: timer                  13183881837  219731363
> cpu1: timer                  13002196469  216703274
> cpu3: timer                  13183881783  219731363
> Total                        53167869284  886131154
> ------------------------------------------------------------------------

I find this suspect as well, but I don't have an explanation yet.

Are you using anything non-GENERIC related to timers, like change
HZ or enable polling?

Are you sure the problem didn't start right at 13:00, and cause complete
packet loss for the entire period, and that it grew gradually worse
instead?

Daniel