From owner-freebsd-stable@FreeBSD.ORG Tue May 3 07:29:26 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1BD2F106566C for ; Tue, 3 May 2011 07:29:26 +0000 (UTC) (envelope-from dhartmei@insomnia.benzedrine.cx) Received: from insomnia.benzedrine.cx (106-30.3-213.fix.bluewin.ch [213.3.30.106]) by mx1.freebsd.org (Postfix) with ESMTP id 69EEA8FC08 for ; Tue, 3 May 2011 07:29:25 +0000 (UTC) Received: from insomnia.benzedrine.cx (localhost.benzedrine.cx [127.0.0.1]) by insomnia.benzedrine.cx (8.14.1/8.13.4) with ESMTP id p4370gxU015502 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 3 May 2011 09:00:42 +0200 (MEST) Received: (from dhartmei@localhost) by insomnia.benzedrine.cx (8.14.1/8.12.10/Submit) id p4370gYQ032745; Tue, 3 May 2011 09:00:42 +0200 (MEST) Date: Tue, 3 May 2011 09:00:42 +0200 From: Daniel Hartmeier To: Jeremy Chadwick Message-ID: <20110503070042.GA9657@insomnia.benzedrine.cx> References: <20110503015854.GA31444@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110503015854.GA31444@icarus.home.lan> User-Agent: Mutt/1.5.12-2006-07-14 Cc: freebsd-stable@freebsd.org, freebsd-pf@freebsd.org Subject: Re: RELENG_8 pf stack issue (state count spiraling out of control) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2011 07:29:26 -0000 I read those graphs differently: the problem doesn't arise slowly, but rather seems to start suddenly at 13:00. Right after 13:00, traffic on em0 drops, i.e. the firewall seems to stop forwarding packets completely. Yet, at the same time, the states start to increase, almost linearly at about one state every two seconds, until the limit of 10,000 is reached. Reaching the limit seems to be only a side-effect of a problem that started at 13:00. > Here's one piece of core.0.txt which makes no sense to me -- the "rate" > column. I have a very hard time believing that was the interrupt rate > of all the relevant devices at the time (way too high). Maybe this data > becomes wrong only during a coredump? The total column I could believe. > > ------------------------------------------------------------------------ > vmstat -i > > interrupt total rate > irq4: uart0 54768 912 > irq6: fdc0 1 0 > irq17: uhci1+ 172 2 > irq23: uhci3 ehci1+ 2367 39 > cpu0: timer 13183882632 219731377 > irq256: em0 260491055 4341517 > irq257: em1 127555036 2125917 > irq258: ahci0 225923164 3765386 > cpu2: timer 13183881837 219731363 > cpu1: timer 13002196469 216703274 > cpu3: timer 13183881783 219731363 > Total 53167869284 886131154 > ------------------------------------------------------------------------ I find this suspect as well, but I don't have an explanation yet. Are you using anything non-GENERIC related to timers, like change HZ or enable polling? Are you sure the problem didn't start right at 13:00, and cause complete packet loss for the entire period, and that it grew gradually worse instead? Daniel