Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 May 2011 09:00:42 +0200
From:      Daniel Hartmeier <daniel@benzedrine.cx>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-stable@freebsd.org, freebsd-pf@freebsd.org
Subject:   Re: RELENG_8 pf stack issue (state count spiraling out of control)
Message-ID:  <20110503070042.GA9657@insomnia.benzedrine.cx>
In-Reply-To: <20110503015854.GA31444@icarus.home.lan>
References:  <20110503015854.GA31444@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
I read those graphs differently: the problem doesn't arise slowly,
but rather seems to start suddenly at 13:00.

Right after 13:00, traffic on em0 drops, i.e. the firewall seems
to stop forwarding packets completely.

Yet, at the same time, the states start to increase, almost linearly
at about one state every two seconds, until the limit of 10,000 is
reached. Reaching the limit seems to be only a side-effect of a
problem that started at 13:00.

> Here's one piece of core.0.txt which makes no sense to me -- the "rate"
> column.  I have a very hard time believing that was the interrupt rate
> of all the relevant devices at the time (way too high).  Maybe this data
> becomes wrong only during a coredump?  The total column I could believe.
> 
> ------------------------------------------------------------------------
> vmstat -i
> 
> interrupt                          total       rate
> irq4: uart0                        54768        912
> irq6: fdc0                             1          0
> irq17: uhci1+                        172          2
> irq23: uhci3 ehci1+                 2367         39
> cpu0: timer                  13183882632  219731377
> irq256: em0                    260491055    4341517
> irq257: em1                    127555036    2125917
> irq258: ahci0                  225923164    3765386
> cpu2: timer                  13183881837  219731363
> cpu1: timer                  13002196469  216703274
> cpu3: timer                  13183881783  219731363
> Total                        53167869284  886131154
> ------------------------------------------------------------------------

I find this suspect as well, but I don't have an explanation yet.

Are you using anything non-GENERIC related to timers, like change
HZ or enable polling?

Are you sure the problem didn't start right at 13:00, and cause complete
packet loss for the entire period, and that it grew gradually worse
instead?

Daniel



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110503070042.GA9657>