From owner-freebsd-pf@FreeBSD.ORG Tue May 3 05:22:51 2011 Return-Path: Delivered-To: freebsd-pf@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A4E4F106566C for ; Tue, 3 May 2011 05:22:51 +0000 (UTC) (envelope-from dudu@dudu.ro) Received: from mail-qw0-f54.google.com (mail-qw0-f54.google.com [209.85.216.54]) by mx1.freebsd.org (Postfix) with ESMTP id 34D2A8FC16 for ; Tue, 3 May 2011 05:22:50 +0000 (UTC) Received: by qwc9 with SMTP id 9so3730732qwc.13 for ; Mon, 02 May 2011 22:22:50 -0700 (PDT) Received: by 10.229.114.77 with SMTP id d13mr6803064qcq.219.1304400170184; Mon, 02 May 2011 22:22:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.105.18 with HTTP; Mon, 2 May 2011 22:22:10 -0700 (PDT) In-Reply-To: <20110503015854.GA31444@icarus.home.lan> References: <20110503015854.GA31444@icarus.home.lan> From: Vlad Galu Date: Tue, 3 May 2011 07:22:10 +0200 Message-ID: To: Jeremy Chadwick Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-stable@freebsd.org, freebsd-pf@freebsd.org Subject: Re: RELENG_8 pf stack issue (state count spiraling out of control) X-BeenThere: freebsd-pf@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Technical discussion and general questions about packet filter \(pf\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2011 05:22:51 -0000 On Tue, May 3, 2011 at 3:58 AM, Jeremy Chadwick wrote: > (Please keep me CC'd as I'm not subscribed to freebsd-pf. And apologies > for cross-posting, but the issue is severe enough that I wanted to make > it known on -stable) > > The below issue I'm describing is from a machine running 8.2-PRERELEASE > (RELENG_8) using src dated Tue Feb 15 05:46:02 PST 2011. > > Please read the story in full, as I have taken the time to describe > everything I did, plus log output, as well as induce a panic via "call > doadump" from ddb so I have a capture of the system at the time. I also > have a theory as to what caused the problem, but how to trigger it is > unknown; it may be a rare race condition. > > > This morning I woke up to find a report from one of our users that he > could not connect to a specific TCP port (not SSH) on one of our > servers. I also found that I couldn't SSH into the same box. Serial > console was working fine, and the serial console log showed no sign of > any problems. > > I started to debug the issue of me not being able to SSH into the > machine and within a few minutes became immediately concerned: pfctl > indicated we had reached the maximum number permitted state table > entries (10,000). > > ============================================================ > # pfctl -s info > Status: Enabled for 76 days 06:49:10 Debug: Urgent > > Interface Stats for em0 IPv4 IPv6 > Bytes In 8969748840 0 > Bytes Out 8296135477 0 > Packets In > Passed 128211763 0 > Blocked 621379 0 > Packets Out > Passed 138483868 0 > Blocked 2579 0 > > State Table Total Rate > current entries 10000 > searches 267316807 40.6/s > inserts 4440553 0.7/s > removals 4430553 0.7/s > Counters > match 5067474 0.8/s > bad-offset 0 0.0/s > fragment 324 0.0/s > short 0 0.0/s > normalize 32 0.0/s > memory 336946 0.1/s > bad-timestamp 0 0.0/s > congestion 0 0.0/s > ip-option 0 0.0/s > proto-cksum 1611 0.0/s > state-mismatch 509 0.0/s > state-insert 0 0.0/s > state-limit 0 0.0/s > src-limit 0 0.0/s > synproxy 0 0.0/s > > # pfctl -s memory > states hard limit 10000 > src-nodes hard limit 10000 > frags hard limit 5000 > tables hard limit 1000 > table-entries hard limit 100000 > ============================================================ > > The above is mainly for em0 (our WAN interface); our LAN interface (em1) > was not impacted because we use "set skip on em1". And it's a good > thing too: we have lots of LAN-based services that this machine provides > that would have been impacted. We also use "set skip on lo0". > > I immediately went to look at our monitoring graphs, which monitor pf > state (specifically state table entries), polled via bsnmpd(8). This > data is even more frightening: > > http://jdc.parodius.com/freebsd/pf-issue/pf_states-day.png > http://jdc.parodius.com/freebsd/pf-issue/pf_states-week.png > > Literally something was spiraling out of control, starting at approx. > 2011/05/01 (Sun) at 12:30 PDT. The situation became dire at approx. > 19:45 PDT the same day, but I wasn't aware of it until said user brought > an issue to my attention. > > You can see from the network I/O graphs (taken from SNMP polling our > switch, NOT from the host/box itself) that there was no DoS attack or > anything like that occurring -- this was something within FreeBSD > itself. More evidence of that will become apparent. > > http://jdc.parodius.com/freebsd/pf-issue/port_03-day.png > http://jdc.parodius.com/freebsd/pf-issue/port_03-week.png > > The first thing I did was "/etc/rc.d/pf reload". This command hung. > Any attempt to send Ctrl-C/SIGINT did nothing. I was able to > Ctrl-Z/SIGSTOP it, then use kill %1, but the actual reload process did > not truly die (despite csh stating "Terminated"). The only way to kill > it was to kill -9. > > Attempts to shut down any daemons which utilised the network -- > including things that only used em1 -- would not shut down. This > included things like postfix, mysqld, and some inet-based services. I > was forced to kill -9 them. Things like bsnmpd, however, did shut down. > > Equally as uncomfortable, "shutdown -r now" did not reboot the system. > That is to say, wall(1)'s announcement was shown, but the actual > stopping of services did not begin. > > The next thing I tried was "/etc/rc.d/pf stop", which worked. Then I > did "/etc/rc.d/pf start", which also worked. However, what I saw next > surely indicated a bug in the pf layer somewhere -- "pfctl -s states" > and "pfctl -s info" disagreed on the state count: > > ============================================================ > # pfctl -s info > Status: Enabled for 0 days 00:00:16 Debug: Urgent > > Interface Stats for em0 IPv4 IPv6 > Bytes In 3459 0 > Bytes Out 0 0 > Packets In > Passed 0 0 > Blocked 29 0 > Packets Out > Passed 0 0 > Blocked 0 0 > > State Table Total Rate > current entries 10000 > searches 29 1.8/s > inserts 0 0.0/s > removals 0 0.0/s > Counters > match 29 1.8/s > bad-offset 0 0.0/s > fragment 0 0.0/s > short 0 0.0/s > normalize 0 0.0/s > memory 18 1.1/s > bad-timestamp 0 0.0/s > congestion 0 0.0/s > ip-option 0 0.0/s > proto-cksum 0 0.0/s > state-mismatch 0 0.0/s > state-insert 0 0.0/s > state-limit 0 0.0/s > src-limit 0 0.0/s > synproxy 0 0.0/s > > # pfctl -s state | wc -l > 0 > ============================================================ > > The "pf uptime" shown above, by the way, matches system uptime. > > I then attempted "pfctl -F state", but nothing changed (looked the same > as above). > > Since I could not reboot the box, I was forced to drop to ddb via serial > console. I did some commands like "ps" and the like, and then "call > doadump" to induce a kernel panic, and then "reboot" (which worked). > > Once the machine came back up, savecore(8) ran, wrote the data out, and > everything has been fine since. /var/crash/core.txt.0 is ~68KBytes and > I do not feel comfortable sharing its content publicly, but will be > happy to hand it to developer(s) who are interested. Relevant tidbits I > can discern: > > ------------------------------------------------------------------------ > ps -axl > > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > 0 422 0 0 -16 0 0 0 pftm DL ?? 1362773081:04.00 > [pfpurge] > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------ > vmstat -z > > ITEM SIZE LIMIT USED FREE REQUESTS > FAILURES > pfsrctrpl: 152, 10000, 0, 0, 0, > 0 > pfrulepl: 912, 0, 40, 88, 806, > 0 > pfstatepl: 392, 10000, 10000, 0, 4440553, > 341638 > pfaltqpl: 240, 0, 0, 0, 0, > 0 > pfpooladdrpl: 88, 0, 0, 0, 0, > 0 > pfrktable: 1296, 1002, 4, 20, 112, > 0 > pfrkentry: 216, 100008, 603, 891, 15384, > 0 > pfrkentry2: 216, 0, 0, 0, 0, > 0 > pffrent: 32, 5050, 0, 303, 1620, > 0 > pffrag: 80, 0, 0, 135, 807, > 0 > pffrcache: 80, 10035, 0, 0, 0, > 0 > pffrcent: 24, 50022, 0, 0, 0, > 0 > pfstatescrub: 40, 0, 0, 0, 0, > 0 > pfiaddrpl: 120, 0, 0, 0, 0, > 0 > pfospfen: 112, 0, 696, 30, 18096, > 0 > pfosfp: 40, 0, 407, 97, 10582, > 0 > ------------------------------------------------------------------------ > > You can see evidence of processes not exiting/doing what they should do > here: > > ------------------------------------------------------------------------ > fstat > > USER CMD PID FD MOUNT INUM MODE SZ|DV R/W > root shutdown 91155 root / 2 drwxr-xr-x 512 r > root shutdown 91155 wd / 2 drwxr-xr-x 512 r > root shutdown 91155 text / 47195 -r-sr-x--- 15912 r > root shutdown 91155 0 /dev 38 crw------- ttyu0 rw > root shutdown 91155 1 /dev 38 crw------- ttyu0 rw > root shutdown 91155 2 /dev 38 crw------- ttyu0 rw > root sh 91129 root / 2 drwxr-xr-x 512 r > root sh 91129 wd / 2 drwxr-xr-x 512 r > root sh 91129 text / 44 -r-xr-xr-x 134848 r > root sh 91129 0 /dev 38 crw------- ttyu0 rw > root sh 91129 1* pipe ffffff01e78fc9e0 <-> ffffff01e78fc888 > 0 rw > root sh 91129 2 /dev 20 crw-rw-rw- null w > root shutdown 91115 root / 2 drwxr-xr-x 512 r > root shutdown 91115 wd /storage 5 drwx------ 37 r > root shutdown 91115 text / 47195 -r-sr-x--- 15912 r > root shutdown 91115 0 /dev 38 crw------- ttyu0 rw > root shutdown 91115 1 /dev 38 crw------- ttyu0 rw > root shutdown 91115 2 /dev 38 crw------- ttyu0 rw > root shutdown 91115 3* local dgram ffffff008ff92960 > root sh 90818 root / 2 drwxr-xr-x 512 r > root sh 90818 wd / 70659 drwxr-xr-x 512 r > root sh 90818 text / 44 -r-xr-xr-x 134848 r > root sh 90818 0 /dev 38 crw------- ttyu0 rw > root sh 90818 1* pipe ffffff0043f1ecb8 <-> ffffff0043f1eb60 > 0 rw > root sh 90818 2 /dev 20 crw-rw-rw- null w > root csh 90802 root / 2 drwxr-xr-x 512 r > root csh 90802 wd / 2 drwxr-xr-x 512 r > root csh 90802 text / 51 -r-xr-xr-x 358752 r > root csh 90802 15 /dev 38 crw------- ttyu0 rw > root csh 90802 16 /dev 38 crw------- ttyu0 rw > root csh 90802 17 /dev 38 crw------- ttyu0 rw > root csh 90802 18 /dev 38 crw------- ttyu0 rw > root csh 90802 19 /dev 38 crw------- ttyu0 rw > ------------------------------------------------------------------------ > > No indication of mbuf exhaustion, putting further focus on the pf stack: > > ------------------------------------------------------------------------ > netstat -m > > 2054/1786/3840 mbufs in use (current/cache/total) > 2048/1414/3462/25600 mbuf clusters in use (current/cache/total/max) > 2048/896 mbuf+clusters out of packet secondary zone in use (current/cache) > 0/320/320/12800 4k (page size) jumbo clusters in use > (current/cache/total/max) > 0/0/0/19200 9k jumbo clusters in use (current/cache/total/max) > 0/0/0/12800 16k jumbo clusters in use (current/cache/total/max) > 4609K/4554K/9164K bytes allocated to network (current/cache/total) > 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > 0/0/0 requests for jumbo clusters denied (4k/9k/16k) > 0 requests for sfbufs denied > 0 requests for sfbufs delayed > 0 requests for I/O initiated by sendfile > 0 calls to protocol drain routines > ------------------------------------------------------------------------ > > Here's one piece of core.0.txt which makes no sense to me -- the "rate" > column. I have a very hard time believing that was the interrupt rate > of all the relevant devices at the time (way too high). Maybe this data > becomes wrong only during a coredump? The total column I could believe. > > ------------------------------------------------------------------------ > vmstat -i > > interrupt total rate > irq4: uart0 54768 912 > irq6: fdc0 1 0 > irq17: uhci1+ 172 2 > irq23: uhci3 ehci1+ 2367 39 > cpu0: timer 13183882632 219731377 > irq256: em0 260491055 4341517 > irq257: em1 127555036 2125917 > irq258: ahci0 225923164 3765386 > cpu2: timer 13183881837 219731363 > cpu1: timer 13002196469 216703274 > cpu3: timer 13183881783 219731363 > Total 53167869284 886131154 > ------------------------------------------------------------------------ > > Here's what a normal "vmstat -i" shows from the command-line: > > # vmstat -i > interrupt total rate > irq4: uart0 518 0 > irq6: fdc0 1 0 > irq23: uhci3 ehci1+ 145 0 > cpu0: timer 19041199 1999 > irq256: em0 614280 64 > irq257: em1 168529 17 > irq258: ahci0 355536 37 > cpu2: timer 19040462 1999 > cpu1: timer 19040458 1999 > cpu3: timer 19040454 1999 > Total 77301582 8119 > > We graph many aspects of this box, including CPU load, memory/swap > usage, etc. and none show any sign that the interrupt rate on all of > those devices was even remotely out of control. (I would expect to see > CPU through the roof given the above data) > > I have since rebuilt/reinstalled world/kernel on the machine with the > latest RELENG_8 code (box is now 8.2-STABLE #0: Mon May 2 14:44:18 PDT > 2011), hoping whatever this was may have been fixed. > > As for what I think may have triggered it, but I have no hard evidence > of such: on April 29th, I changed our pf.conf and did "/etc/rc.d/pf > reload". The pf.conf change was a single line: > > Old: scrub on em0 all > New: scrub in on em0 all > > Why it took the problem approximately 3 days to start is unknown. It's > the only change we've made to the system (truly/honestly), and it was a > change to pf.conf. > > If anyone has advice (or has seen the above problem), or is interested > in debugging it -- as I said, I have a vmcore -- I'm happy to assist in > any way I can. I would hate for someone else to get bit by this, and > really am hoping its something that has been fixed between February and > now. > > I'm seeing this as well. You could change your scrub rules so that you specifically avoid TCP reassembly (that creates states). -- Good, fast & cheap. Pick any two.