From owner-freebsd-pf@FreeBSD.ORG Tue May 3 02:12:14 2011 Return-Path: Delivered-To: freebsd-pf@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1B2D5106564A for ; Tue, 3 May 2011 02:12:14 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta15.westchester.pa.mail.comcast.net (qmta15.westchester.pa.mail.comcast.net [76.96.59.228]) by mx1.freebsd.org (Postfix) with ESMTP id B8BE28FC13 for ; Tue, 3 May 2011 02:12:13 +0000 (UTC) Received: from omta21.westchester.pa.mail.comcast.net ([76.96.62.72]) by qmta15.westchester.pa.mail.comcast.net with comcast id ept61g0061ZXKqc5Fpyyge; Tue, 03 May 2011 01:58:58 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta21.westchester.pa.mail.comcast.net with comcast id epyw1g00d1t3BNj3hpyxTl; Tue, 03 May 2011 01:58:58 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id E58B49B418; Mon, 2 May 2011 18:58:54 -0700 (PDT) Date: Mon, 2 May 2011 18:58:54 -0700 From: Jeremy Chadwick To: freebsd-pf@freebsd.org Message-ID: <20110503015854.GA31444@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable@freebsd.org Subject: RELENG_8 pf stack issue (state count spiraling out of control) X-BeenThere: freebsd-pf@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Technical discussion and general questions about packet filter \(pf\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2011 02:12:14 -0000 (Please keep me CC'd as I'm not subscribed to freebsd-pf. And apologies for cross-posting, but the issue is severe enough that I wanted to make it known on -stable) The below issue I'm describing is from a machine running 8.2-PRERELEASE (RELENG_8) using src dated Tue Feb 15 05:46:02 PST 2011. Please read the story in full, as I have taken the time to describe everything I did, plus log output, as well as induce a panic via "call doadump" from ddb so I have a capture of the system at the time. I also have a theory as to what caused the problem, but how to trigger it is unknown; it may be a rare race condition. This morning I woke up to find a report from one of our users that he could not connect to a specific TCP port (not SSH) on one of our servers. I also found that I couldn't SSH into the same box. Serial console was working fine, and the serial console log showed no sign of any problems. I started to debug the issue of me not being able to SSH into the machine and within a few minutes became immediately concerned: pfctl indicated we had reached the maximum number permitted state table entries (10,000). ============================================================ # pfctl -s info Status: Enabled for 76 days 06:49:10 Debug: Urgent Interface Stats for em0 IPv4 IPv6 Bytes In 8969748840 0 Bytes Out 8296135477 0 Packets In Passed 128211763 0 Blocked 621379 0 Packets Out Passed 138483868 0 Blocked 2579 0 State Table Total Rate current entries 10000 searches 267316807 40.6/s inserts 4440553 0.7/s removals 4430553 0.7/s Counters match 5067474 0.8/s bad-offset 0 0.0/s fragment 324 0.0/s short 0 0.0/s normalize 32 0.0/s memory 336946 0.1/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 0 0.0/s proto-cksum 1611 0.0/s state-mismatch 509 0.0/s state-insert 0 0.0/s state-limit 0 0.0/s src-limit 0 0.0/s synproxy 0 0.0/s # pfctl -s memory states hard limit 10000 src-nodes hard limit 10000 frags hard limit 5000 tables hard limit 1000 table-entries hard limit 100000 ============================================================ The above is mainly for em0 (our WAN interface); our LAN interface (em1) was not impacted because we use "set skip on em1". And it's a good thing too: we have lots of LAN-based services that this machine provides that would have been impacted. We also use "set skip on lo0". I immediately went to look at our monitoring graphs, which monitor pf state (specifically state table entries), polled via bsnmpd(8). This data is even more frightening: http://jdc.parodius.com/freebsd/pf-issue/pf_states-day.png http://jdc.parodius.com/freebsd/pf-issue/pf_states-week.png Literally something was spiraling out of control, starting at approx. 2011/05/01 (Sun) at 12:30 PDT. The situation became dire at approx. 19:45 PDT the same day, but I wasn't aware of it until said user brought an issue to my attention. You can see from the network I/O graphs (taken from SNMP polling our switch, NOT from the host/box itself) that there was no DoS attack or anything like that occurring -- this was something within FreeBSD itself. More evidence of that will become apparent. http://jdc.parodius.com/freebsd/pf-issue/port_03-day.png http://jdc.parodius.com/freebsd/pf-issue/port_03-week.png The first thing I did was "/etc/rc.d/pf reload". This command hung. Any attempt to send Ctrl-C/SIGINT did nothing. I was able to Ctrl-Z/SIGSTOP it, then use kill %1, but the actual reload process did not truly die (despite csh stating "Terminated"). The only way to kill it was to kill -9. Attempts to shut down any daemons which utilised the network -- including things that only used em1 -- would not shut down. This included things like postfix, mysqld, and some inet-based services. I was forced to kill -9 them. Things like bsnmpd, however, did shut down. Equally as uncomfortable, "shutdown -r now" did not reboot the system. That is to say, wall(1)'s announcement was shown, but the actual stopping of services did not begin. The next thing I tried was "/etc/rc.d/pf stop", which worked. Then I did "/etc/rc.d/pf start", which also worked. However, what I saw next surely indicated a bug in the pf layer somewhere -- "pfctl -s states" and "pfctl -s info" disagreed on the state count: ============================================================ # pfctl -s info Status: Enabled for 0 days 00:00:16 Debug: Urgent Interface Stats for em0 IPv4 IPv6 Bytes In 3459 0 Bytes Out 0 0 Packets In Passed 0 0 Blocked 29 0 Packets Out Passed 0 0 Blocked 0 0 State Table Total Rate current entries 10000 searches 29 1.8/s inserts 0 0.0/s removals 0 0.0/s Counters match 29 1.8/s bad-offset 0 0.0/s fragment 0 0.0/s short 0 0.0/s normalize 0 0.0/s memory 18 1.1/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 0 0.0/s proto-cksum 0 0.0/s state-mismatch 0 0.0/s state-insert 0 0.0/s state-limit 0 0.0/s src-limit 0 0.0/s synproxy 0 0.0/s # pfctl -s state | wc -l 0 ============================================================ The "pf uptime" shown above, by the way, matches system uptime. I then attempted "pfctl -F state", but nothing changed (looked the same as above). Since I could not reboot the box, I was forced to drop to ddb via serial console. I did some commands like "ps" and the like, and then "call doadump" to induce a kernel panic, and then "reboot" (which worked). Once the machine came back up, savecore(8) ran, wrote the data out, and everything has been fine since. /var/crash/core.txt.0 is ~68KBytes and I do not feel comfortable sharing its content publicly, but will be happy to hand it to developer(s) who are interested. Relevant tidbits I can discern: ------------------------------------------------------------------------ ps -axl UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 0 422 0 0 -16 0 0 0 pftm DL ?? 1362773081:04.00 [pfpurge] ------------------------------------------------------------------------ ------------------------------------------------------------------------ vmstat -z ITEM SIZE LIMIT USED FREE REQUESTS FAILURES pfsrctrpl: 152, 10000, 0, 0, 0, 0 pfrulepl: 912, 0, 40, 88, 806, 0 pfstatepl: 392, 10000, 10000, 0, 4440553, 341638 pfaltqpl: 240, 0, 0, 0, 0, 0 pfpooladdrpl: 88, 0, 0, 0, 0, 0 pfrktable: 1296, 1002, 4, 20, 112, 0 pfrkentry: 216, 100008, 603, 891, 15384, 0 pfrkentry2: 216, 0, 0, 0, 0, 0 pffrent: 32, 5050, 0, 303, 1620, 0 pffrag: 80, 0, 0, 135, 807, 0 pffrcache: 80, 10035, 0, 0, 0, 0 pffrcent: 24, 50022, 0, 0, 0, 0 pfstatescrub: 40, 0, 0, 0, 0, 0 pfiaddrpl: 120, 0, 0, 0, 0, 0 pfospfen: 112, 0, 696, 30, 18096, 0 pfosfp: 40, 0, 407, 97, 10582, 0 ------------------------------------------------------------------------ You can see evidence of processes not exiting/doing what they should do here: ------------------------------------------------------------------------ fstat USER CMD PID FD MOUNT INUM MODE SZ|DV R/W root shutdown 91155 root / 2 drwxr-xr-x 512 r root shutdown 91155 wd / 2 drwxr-xr-x 512 r root shutdown 91155 text / 47195 -r-sr-x--- 15912 r root shutdown 91155 0 /dev 38 crw------- ttyu0 rw root shutdown 91155 1 /dev 38 crw------- ttyu0 rw root shutdown 91155 2 /dev 38 crw------- ttyu0 rw root sh 91129 root / 2 drwxr-xr-x 512 r root sh 91129 wd / 2 drwxr-xr-x 512 r root sh 91129 text / 44 -r-xr-xr-x 134848 r root sh 91129 0 /dev 38 crw------- ttyu0 rw root sh 91129 1* pipe ffffff01e78fc9e0 <-> ffffff01e78fc888 0 rw root sh 91129 2 /dev 20 crw-rw-rw- null w root shutdown 91115 root / 2 drwxr-xr-x 512 r root shutdown 91115 wd /storage 5 drwx------ 37 r root shutdown 91115 text / 47195 -r-sr-x--- 15912 r root shutdown 91115 0 /dev 38 crw------- ttyu0 rw root shutdown 91115 1 /dev 38 crw------- ttyu0 rw root shutdown 91115 2 /dev 38 crw------- ttyu0 rw root shutdown 91115 3* local dgram ffffff008ff92960 root sh 90818 root / 2 drwxr-xr-x 512 r root sh 90818 wd / 70659 drwxr-xr-x 512 r root sh 90818 text / 44 -r-xr-xr-x 134848 r root sh 90818 0 /dev 38 crw------- ttyu0 rw root sh 90818 1* pipe ffffff0043f1ecb8 <-> ffffff0043f1eb60 0 rw root sh 90818 2 /dev 20 crw-rw-rw- null w root csh 90802 root / 2 drwxr-xr-x 512 r root csh 90802 wd / 2 drwxr-xr-x 512 r root csh 90802 text / 51 -r-xr-xr-x 358752 r root csh 90802 15 /dev 38 crw------- ttyu0 rw root csh 90802 16 /dev 38 crw------- ttyu0 rw root csh 90802 17 /dev 38 crw------- ttyu0 rw root csh 90802 18 /dev 38 crw------- ttyu0 rw root csh 90802 19 /dev 38 crw------- ttyu0 rw ------------------------------------------------------------------------ No indication of mbuf exhaustion, putting further focus on the pf stack: ------------------------------------------------------------------------ netstat -m 2054/1786/3840 mbufs in use (current/cache/total) 2048/1414/3462/25600 mbuf clusters in use (current/cache/total/max) 2048/896 mbuf+clusters out of packet secondary zone in use (current/cache) 0/320/320/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/19200 9k jumbo clusters in use (current/cache/total/max) 0/0/0/12800 16k jumbo clusters in use (current/cache/total/max) 4609K/4554K/9164K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines ------------------------------------------------------------------------ Here's one piece of core.0.txt which makes no sense to me -- the "rate" column. I have a very hard time believing that was the interrupt rate of all the relevant devices at the time (way too high). Maybe this data becomes wrong only during a coredump? The total column I could believe. ------------------------------------------------------------------------ vmstat -i interrupt total rate irq4: uart0 54768 912 irq6: fdc0 1 0 irq17: uhci1+ 172 2 irq23: uhci3 ehci1+ 2367 39 cpu0: timer 13183882632 219731377 irq256: em0 260491055 4341517 irq257: em1 127555036 2125917 irq258: ahci0 225923164 3765386 cpu2: timer 13183881837 219731363 cpu1: timer 13002196469 216703274 cpu3: timer 13183881783 219731363 Total 53167869284 886131154 ------------------------------------------------------------------------ Here's what a normal "vmstat -i" shows from the command-line: # vmstat -i interrupt total rate irq4: uart0 518 0 irq6: fdc0 1 0 irq23: uhci3 ehci1+ 145 0 cpu0: timer 19041199 1999 irq256: em0 614280 64 irq257: em1 168529 17 irq258: ahci0 355536 37 cpu2: timer 19040462 1999 cpu1: timer 19040458 1999 cpu3: timer 19040454 1999 Total 77301582 8119 We graph many aspects of this box, including CPU load, memory/swap usage, etc. and none show any sign that the interrupt rate on all of those devices was even remotely out of control. (I would expect to see CPU through the roof given the above data) I have since rebuilt/reinstalled world/kernel on the machine with the latest RELENG_8 code (box is now 8.2-STABLE #0: Mon May 2 14:44:18 PDT 2011), hoping whatever this was may have been fixed. As for what I think may have triggered it, but I have no hard evidence of such: on April 29th, I changed our pf.conf and did "/etc/rc.d/pf reload". The pf.conf change was a single line: Old: scrub on em0 all New: scrub in on em0 all Why it took the problem approximately 3 days to start is unknown. It's the only change we've made to the system (truly/honestly), and it was a change to pf.conf. If anyone has advice (or has seen the above problem), or is interested in debugging it -- as I said, I have a vmcore -- I'm happy to assist in any way I can. I would hate for someone else to get bit by this, and really am hoping its something that has been fixed between February and now. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |