From owner-freebsd-stable@FreeBSD.ORG Tue May 3 05:45:33 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 185BA106566C for ; Tue, 3 May 2011 05:45:33 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta08.emeryville.ca.mail.comcast.net (qmta08.emeryville.ca.mail.comcast.net [76.96.30.80]) by mx1.freebsd.org (Postfix) with ESMTP id F1C858FC12 for ; Tue, 3 May 2011 05:45:31 +0000 (UTC) Received: from omta08.emeryville.ca.mail.comcast.net ([76.96.30.12]) by qmta08.emeryville.ca.mail.comcast.net with comcast id etje1g0060FhH24A8tlXCs; Tue, 03 May 2011 05:45:31 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta08.emeryville.ca.mail.comcast.net with comcast id etlW1g00D1t3BNj8UtlWNj; Tue, 03 May 2011 05:45:31 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id EAD8F9B418; Mon, 2 May 2011 22:45:29 -0700 (PDT) Date: Mon, 2 May 2011 22:45:29 -0700 From: Jeremy Chadwick To: Jason Hellenthal Message-ID: <20110503054529.GA35688@icarus.home.lan> References: <20110503015854.GA31444@icarus.home.lan> <20110503050634.GB53414@DataIX.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110503050634.GB53414@DataIX.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable@freebsd.org, freebsd-pf@freebsd.org Subject: Re: RELENG_8 pf stack issue (state count spiraling out of control) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2011 05:45:33 -0000 On Tue, May 03, 2011 at 01:06:34AM -0400, Jason Hellenthal wrote: > On Mon, May 02, 2011 at 06:58:54PM -0700, Jeremy Chadwick wrote: > >(Please keep me CC'd as I'm not subscribed to freebsd-pf. And apologies > >for cross-posting, but the issue is severe enough that I wanted to make > >it known on -stable) > > > >The below issue I'm describing is from a machine running 8.2-PRERELEASE > >(RELENG_8) using src dated Tue Feb 15 05:46:02 PST 2011. > > > >Please read the story in full, as I have taken the time to describe > >everything I did, plus log output, as well as induce a panic via "call > >doadump" from ddb so I have a capture of the system at the time. I also > >have a theory as to what caused the problem, but how to trigger it is > >unknown; it may be a rare race condition. > > > > > >This morning I woke up to find a report from one of our users that he > >could not connect to a specific TCP port (not SSH) on one of our > >servers. I also found that I couldn't SSH into the same box. Serial > >console was working fine, and the serial console log showed no sign of > >any problems. > > > >I started to debug the issue of me not being able to SSH into the > >machine and within a few minutes became immediately concerned: pfctl > >indicated we had reached the maximum number permitted state table > >entries (10,000). > > > >============================================================ > ># pfctl -s info > >Status: Enabled for 76 days 06:49:10 Debug: Urgent > > > >Interface Stats for em0 IPv4 IPv6 > > Bytes In 8969748840 0 > > Bytes Out 8296135477 0 > > Packets In > > Passed 128211763 0 > > Blocked 621379 0 > > Packets Out > > Passed 138483868 0 > > Blocked 2579 0 > > > >State Table Total Rate > > current entries 10000 > > searches 267316807 40.6/s > > inserts 4440553 0.7/s > > removals 4430553 0.7/s > >Counters > > match 5067474 0.8/s > > bad-offset 0 0.0/s > > fragment 324 0.0/s > > short 0 0.0/s > > normalize 32 0.0/s > > memory 336946 0.1/s > > bad-timestamp 0 0.0/s > > congestion 0 0.0/s > > ip-option 0 0.0/s > > proto-cksum 1611 0.0/s > > state-mismatch 509 0.0/s > > state-insert 0 0.0/s > > state-limit 0 0.0/s > > src-limit 0 0.0/s > > synproxy 0 0.0/s > > > ># pfctl -s memory > >states hard limit 10000 > >src-nodes hard limit 10000 > >frags hard limit 5000 > >tables hard limit 1000 > >table-entries hard limit 100000 > >============================================================ > > > >The above is mainly for em0 (our WAN interface); our LAN interface (em1) > >was not impacted because we use "set skip on em1". And it's a good > >thing too: we have lots of LAN-based services that this machine provides > >that would have been impacted. We also use "set skip on lo0". > > > >I immediately went to look at our monitoring graphs, which monitor pf > >state (specifically state table entries), polled via bsnmpd(8). This > >data is even more frightening: > > > >http://jdc.parodius.com/freebsd/pf-issue/pf_states-day.png > >http://jdc.parodius.com/freebsd/pf-issue/pf_states-week.png > > > >Literally something was spiraling out of control, starting at approx. > >2011/05/01 (Sun) at 12:30 PDT. The situation became dire at approx. > >19:45 PDT the same day, but I wasn't aware of it until said user brought > >an issue to my attention. > > > >You can see from the network I/O graphs (taken from SNMP polling our > >switch, NOT from the host/box itself) that there was no DoS attack or > >anything like that occurring -- this was something within FreeBSD > >itself. More evidence of that will become apparent. > > > >http://jdc.parodius.com/freebsd/pf-issue/port_03-day.png > >http://jdc.parodius.com/freebsd/pf-issue/port_03-week.png > > > >The first thing I did was "/etc/rc.d/pf reload". This command hung. > >Any attempt to send Ctrl-C/SIGINT did nothing. I was able to > >Ctrl-Z/SIGSTOP it, then use kill %1, but the actual reload process did > >not truly die (despite csh stating "Terminated"). The only way to kill > >it was to kill -9. > > > >Attempts to shut down any daemons which utilised the network -- > >including things that only used em1 -- would not shut down. This > >included things like postfix, mysqld, and some inet-based services. I > >was forced to kill -9 them. Things like bsnmpd, however, did shut down. > > > >Equally as uncomfortable, "shutdown -r now" did not reboot the system. > >That is to say, wall(1)'s announcement was shown, but the actual > >stopping of services did not begin. > > > >The next thing I tried was "/etc/rc.d/pf stop", which worked. Then I > >did "/etc/rc.d/pf start", which also worked. However, what I saw next > >surely indicated a bug in the pf layer somewhere -- "pfctl -s states" > >and "pfctl -s info" disagreed on the state count: > > > >============================================================ > ># pfctl -s info > >Status: Enabled for 0 days 00:00:16 Debug: Urgent > > > >Interface Stats for em0 IPv4 IPv6 > > Bytes In 3459 0 > > Bytes Out 0 0 > > Packets In > > Passed 0 0 > > Blocked 29 0 > > Packets Out > > Passed 0 0 > > Blocked 0 0 > > > >State Table Total Rate > > current entries 10000 > > searches 29 1.8/s > > inserts 0 0.0/s > > removals 0 0.0/s > >Counters > > match 29 1.8/s > > bad-offset 0 0.0/s > > fragment 0 0.0/s > > short 0 0.0/s > > normalize 0 0.0/s > > memory 18 1.1/s > > bad-timestamp 0 0.0/s > > congestion 0 0.0/s > > ip-option 0 0.0/s > > proto-cksum 0 0.0/s > > state-mismatch 0 0.0/s > > state-insert 0 0.0/s > > state-limit 0 0.0/s > > src-limit 0 0.0/s > > synproxy 0 0.0/s > > > ># pfctl -s state | wc -l > > 0 > >============================================================ > > > >The "pf uptime" shown above, by the way, matches system uptime. > > > >I then attempted "pfctl -F state", but nothing changed (looked the same > >as above). > > > >Since I could not reboot the box, I was forced to drop to ddb via serial > >console. I did some commands like "ps" and the like, and then "call > >doadump" to induce a kernel panic, and then "reboot" (which worked). > > > >Once the machine came back up, savecore(8) ran, wrote the data out, and > >everything has been fine since. /var/crash/core.txt.0 is ~68KBytes and > >I do not feel comfortable sharing its content publicly, but will be > >happy to hand it to developer(s) who are interested. Relevant tidbits I > >can discern: > > > >------------------------------------------------------------------------ > >ps -axl > > > > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > > 0 422 0 0 -16 0 0 0 pftm DL ?? 1362773081:04.00 [pfpurge] > >------------------------------------------------------------------------ > > > >------------------------------------------------------------------------ > >vmstat -z > > > >ITEM SIZE LIMIT USED FREE REQUESTS FAILURES > >pfsrctrpl: 152, 10000, 0, 0, 0, 0 > >pfrulepl: 912, 0, 40, 88, 806, 0 > >pfstatepl: 392, 10000, 10000, 0, 4440553, 341638 > >pfaltqpl: 240, 0, 0, 0, 0, 0 > >pfpooladdrpl: 88, 0, 0, 0, 0, 0 > >pfrktable: 1296, 1002, 4, 20, 112, 0 > >pfrkentry: 216, 100008, 603, 891, 15384, 0 > >pfrkentry2: 216, 0, 0, 0, 0, 0 > >pffrent: 32, 5050, 0, 303, 1620, 0 > >pffrag: 80, 0, 0, 135, 807, 0 > >pffrcache: 80, 10035, 0, 0, 0, 0 > >pffrcent: 24, 50022, 0, 0, 0, 0 > >pfstatescrub: 40, 0, 0, 0, 0, 0 > >pfiaddrpl: 120, 0, 0, 0, 0, 0 > >pfospfen: 112, 0, 696, 30, 18096, 0 > >pfosfp: 40, 0, 407, 97, 10582, 0 > >------------------------------------------------------------------------ > > > >You can see evidence of processes not exiting/doing what they should do > >here: > > > >------------------------------------------------------------------------ > >fstat > > > >USER CMD PID FD MOUNT INUM MODE SZ|DV R/W > >root shutdown 91155 root / 2 drwxr-xr-x 512 r > >root shutdown 91155 wd / 2 drwxr-xr-x 512 r > >root shutdown 91155 text / 47195 -r-sr-x--- 15912 r > >root shutdown 91155 0 /dev 38 crw------- ttyu0 rw > >root shutdown 91155 1 /dev 38 crw------- ttyu0 rw > >root shutdown 91155 2 /dev 38 crw------- ttyu0 rw > >root sh 91129 root / 2 drwxr-xr-x 512 r > >root sh 91129 wd / 2 drwxr-xr-x 512 r > >root sh 91129 text / 44 -r-xr-xr-x 134848 r > >root sh 91129 0 /dev 38 crw------- ttyu0 rw > >root sh 91129 1* pipe ffffff01e78fc9e0 <-> ffffff01e78fc888 0 rw > >root sh 91129 2 /dev 20 crw-rw-rw- null w > >root shutdown 91115 root / 2 drwxr-xr-x 512 r > >root shutdown 91115 wd /storage 5 drwx------ 37 r > >root shutdown 91115 text / 47195 -r-sr-x--- 15912 r > >root shutdown 91115 0 /dev 38 crw------- ttyu0 rw > >root shutdown 91115 1 /dev 38 crw------- ttyu0 rw > >root shutdown 91115 2 /dev 38 crw------- ttyu0 rw > >root shutdown 91115 3* local dgram ffffff008ff92960 > >root sh 90818 root / 2 drwxr-xr-x 512 r > >root sh 90818 wd / 70659 drwxr-xr-x 512 r > >root sh 90818 text / 44 -r-xr-xr-x 134848 r > >root sh 90818 0 /dev 38 crw------- ttyu0 rw > >root sh 90818 1* pipe ffffff0043f1ecb8 <-> ffffff0043f1eb60 0 rw > >root sh 90818 2 /dev 20 crw-rw-rw- null w > >root csh 90802 root / 2 drwxr-xr-x 512 r > >root csh 90802 wd / 2 drwxr-xr-x 512 r > >root csh 90802 text / 51 -r-xr-xr-x 358752 r > >root csh 90802 15 /dev 38 crw------- ttyu0 rw > >root csh 90802 16 /dev 38 crw------- ttyu0 rw > >root csh 90802 17 /dev 38 crw------- ttyu0 rw > >root csh 90802 18 /dev 38 crw------- ttyu0 rw > >root csh 90802 19 /dev 38 crw------- ttyu0 rw > >------------------------------------------------------------------------ > > > >No indication of mbuf exhaustion, putting further focus on the pf stack: > > > >------------------------------------------------------------------------ > >netstat -m > > > >2054/1786/3840 mbufs in use (current/cache/total) > >2048/1414/3462/25600 mbuf clusters in use (current/cache/total/max) > >2048/896 mbuf+clusters out of packet secondary zone in use (current/cache) > >0/320/320/12800 4k (page size) jumbo clusters in use (current/cache/total/max) > >0/0/0/19200 9k jumbo clusters in use (current/cache/total/max) > >0/0/0/12800 16k jumbo clusters in use (current/cache/total/max) > >4609K/4554K/9164K bytes allocated to network (current/cache/total) > >0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > >0/0/0 requests for jumbo clusters denied (4k/9k/16k) > >0 requests for sfbufs denied > >0 requests for sfbufs delayed > >0 requests for I/O initiated by sendfile > >0 calls to protocol drain routines > >------------------------------------------------------------------------ > > > >Here's one piece of core.0.txt which makes no sense to me -- the "rate" > >column. I have a very hard time believing that was the interrupt rate > >of all the relevant devices at the time (way too high). Maybe this data > >becomes wrong only during a coredump? The total column I could believe. > > > >------------------------------------------------------------------------ > >vmstat -i > > > >interrupt total rate > >irq4: uart0 54768 912 > >irq6: fdc0 1 0 > >irq17: uhci1+ 172 2 > >irq23: uhci3 ehci1+ 2367 39 > >cpu0: timer 13183882632 219731377 > >irq256: em0 260491055 4341517 > >irq257: em1 127555036 2125917 > >irq258: ahci0 225923164 3765386 > >cpu2: timer 13183881837 219731363 > >cpu1: timer 13002196469 216703274 > >cpu3: timer 13183881783 219731363 > >Total 53167869284 886131154 > >------------------------------------------------------------------------ > > > >Here's what a normal "vmstat -i" shows from the command-line: > > > ># vmstat -i > >interrupt total rate > >irq4: uart0 518 0 > >irq6: fdc0 1 0 > >irq23: uhci3 ehci1+ 145 0 > >cpu0: timer 19041199 1999 > >irq256: em0 614280 64 > >irq257: em1 168529 17 > >irq258: ahci0 355536 37 > >cpu2: timer 19040462 1999 > >cpu1: timer 19040458 1999 > >cpu3: timer 19040454 1999 > >Total 77301582 8119 > > > >We graph many aspects of this box, including CPU load, memory/swap > >usage, etc. and none show any sign that the interrupt rate on all of > >those devices was even remotely out of control. (I would expect to see > >CPU through the roof given the above data) > > > >I have since rebuilt/reinstalled world/kernel on the machine with the > >latest RELENG_8 code (box is now 8.2-STABLE #0: Mon May 2 14:44:18 PDT > >2011), hoping whatever this was may have been fixed. > > > >As for what I think may have triggered it, but I have no hard evidence > >of such: on April 29th, I changed our pf.conf and did "/etc/rc.d/pf > >reload". The pf.conf change was a single line: > > > >Old: scrub on em0 all > >New: scrub in on em0 all > > > >Why it took the problem approximately 3 days to start is unknown. It's > >the only change we've made to the system (truly/honestly), and it was a > >change to pf.conf. > > > >If anyone has advice (or has seen the above problem), or is interested > >in debugging it -- as I said, I have a vmcore -- I'm happy to assist in > >any way I can. I would hate for someone else to get bit by this, and > >really am hoping its something that has been fixed between February and > >now. > > > > That's quite the deduction there. I've noticed recently that you were also > experimenting with the new NFS server recompiling kernel etc etc. Seeing > as weird things can happen with DNS, NFS and mountpoint's, is this the same > machine that you were doing that on ? Completely different machine. The machine which exhibited the problem does mount a single NFS filesystem from the other box you're referring to, but it does not use NFSv4. Most importantly: the NFS mounts are also done across the em1 interface exclusively, which as mentioned falls under "set skip". I realise my analysis results in a very ballsy claim, but the graphs are quite damning, as is pfctl output and machine behaviour. I wish I knew why the machine couldn't be rebooted cleanly / why processes were "sort of" wedging. That seems to indicate something weird going on in the kernel or the pf stack; not sure which. > If so can you check to see how many requests for NFS operations were done > to/from that box ? as well the names that would be being resolved and if > that machine can resolve them ? Each machine on our network runs named. Each named instance slaves a copy of three zones from the masters: reverse in-addr.arpa for the LAN, reverse in-addr.arpa for the WAN, and the forward zone associated with the domain name (parodius.com). So, literally every machine on the network avoids having to go out and ask the Internet for lookups under our domain, or our IP blocks (public or private). Our resolv.conf is tuned appropriately for this too ("search parodius.com private.network" and a nameserver of 127.0.0.1). The slaving does happen over the public interface for a lot of reasons I'd rather not get into here. Regarding NFS statistics -- no NFS operations were occurring during the 24 hour window where pf's state count went through the roof, except for at approximately 04:00 PDT when our automated backups ran. The issue did not start then as indicated by my analysis. Regardless, below are the NFS statistics which were provided by "call doadump" and are in core.0.txt: ------------------------------------------------------------------------ nfsstat Client Info: Rpc Counts: Getattr Setattr Lookup Readlink Read Write Create Remove 172 0 33 0 0 0 0 0 Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access 0 0 0 0 0 3 0 3 Mknod Fsstat Fsinfo PathConf Commit 0 56 1 0 0 Rpc Info: TimedOut Invalid X Replies Retries Requests 0 0 0 0 268 Cache Info: Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits Misses 95 172 26 33 0 0 0 0 BioRLHits Misses BioD Hits Misses DirE Hits Misses 0 0 3 3 3 0 Server Info: Getattr Setattr Lookup Readlink Read Write Create Remove 0 0 0 0 0 0 0 0 Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access 0 0 0 0 0 0 0 0 Mknod Fsstat Fsinfo PathConf Commit 0 0 0 0 0 Server Ret-Failed 0 Server Faults 0 Server Cache Stats: Inprog Idem Non-idem Misses 0 0 0 0 Server Write Gathering: WriteOps WriteRPC Opsaved 0 0 0 ------------------------------------------------------------------------ These numbers should act as confirmation that the problem is almost certainly not NFS-related; these numbers are very small. For fun, here's the "netstat -s" output of the core. ------------------------------------------------------------------------ netstat -s tcp: 70963757 packets sent 56716136 data packets (93106798714 bytes) 20493 data packets (14888440 bytes) retransmitted 3258 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 7640376 ack-only packets (433685 delayed) 0 URG only packets 0 window probe packets 3629755 window update packets 2956596 control packets 109589225 packets received 81346171 acks (for 93113064503 bytes) 2970456 duplicate acks 0 acks for unsent data 56067087 packets (8649172006 bytes) received in-sequence 5950 completely duplicate packets (5254577 bytes) 26 old duplicate packets 152 packets with some dup. data (35886 bytes duped) 7412 out-of-order packets (10192893 bytes) 0 packets (0 bytes) of data after window 0 window probes 10001480 window update packets 16 packets received after close 1561 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 6 discarded due to memory problems 2159 connection requests 2954080 connection accepts 0 bad connection attempts 0 listen queue overflows 491 ignored RSTs in the windows 2955571 connections established (including accepts) 3066783 connections closed (including 1401 drops) 2952708 connections updated cached RTT on close 2952890 connections updated cached RTT variance on close 2928379 connections updated cached ssthresh on close 220 embryonic connections dropped 81149655 segments updated rtt (of 58223063 attempts) 17684 retransmit timeouts 169 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 76 keepalive timeouts 76 keepalive probes sent 0 connections dropped by keepalive 222190 correct ACK header predictions 6584196 correct data packet header predictions 2955062 syncache entries added 603 retransmitted 188 dupsyn 0 dropped 2954080 completed 0 bucket overflow 0 cache overflow 869 reset 113 stale 0 aborted 0 badack 0 unreach 0 zone failures 2955062 cookies sent 0 cookies received 126 SACK recovery episodes 78 segment rexmits in SACK recovery episodes 78254 byte rexmits in SACK recovery episodes 2962 SACK options (SACK blocks) received 8870 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window udp: 6218432 datagrams received 0 with incomplete header 0 with bad data length field 29 with bad checksum 15493 with no checksum 4896 dropped due to no socket 0 broadcast/multicast datagrams undelivered 0 dropped due to full socket buffers 0 not for hashed pcb 6213507 delivered 6405573 datagrams output 0 times multicast source filter matched ip: 236156523 total packets received 0 bad header checksums 0 with size smaller than minimum 0 with data size < data length 0 with ip length > max ip packet size 0 with header length < data size 0 with data length < header length 0 with bad options 0 with incorrect version number 0 fragments received 0 fragments dropped (dup or out of space) 0 fragments dropped after timeout 0 packets reassembled ok 235463206 packets for this host 66789 packets for unknown/unsupported protocol 0 packets forwarded (0 packets fast forwarded) 0 packets not forwardable 0 packets received for unknown multicast group 0 redirects sent 81178132 packets sent from this host 126684324 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 0 output packets discarded due to no route 0 output datagrams fragmented 0 fragments created 0 datagrams that can't be fragmented 0 tunneling packets that can't find gif 0 datagrams with bad address in header icmp: 4896 calls to icmp_error 0 errors not generated in response to an icmp message Output histogram: echo reply: 3801761 destination unreachable: 4896 0 messages with bad code fields 0 messages less than the minimum length 21 messages with bad checksum 0 messages with bad length 0 multicast echo requests ignored 0 multicast timestamp requests ignored Input histogram: echo reply: 7552810 destination unreachable: 58965 source quench: 10 routing redirect: 75 echo: 3801761 time exceeded: 108309026 3801761 message responses generated 0 invalid return addresses 0 no return routes igmp: 0 messages received 0 messages received with too few bytes 0 messages received with wrong TTL 0 messages received with bad checksum 0 V1/V2 membership queries received 0 V3 membership queries received 0 membership queries received with invalid field(s) 0 general queries received 0 group queries received 0 group-source queries received 0 group-source queries dropped 0 membership reports received 0 membership reports received with invalid field(s) 0 membership reports received for groups to which we belong 0 V3 reports received without Router Alert 0 membership reports sent arp: 5537 ARP requests sent 26179 ARP replies sent 1794911 ARP requests received 5531 ARP replies received 1800442 ARP packets received 18 total packets dropped due to no ARP entry 9834 ARP entrys timed out 0 Duplicate IPs seen ------------------------------------------------------------------------ > Also I would believe your using tables in your pf.conf, if so do any of > those tables contain a FQDN that cannot be resolved from that machine ? > > I think you probably see what I am getting at here as it could be some > sort of concurrent recursive DNS failure that can only be seen from the > machine caused by possibly the new NFS backend or a change in one of the > tables that pf would use. I absolutely see where you're going with this, but as I understand it, DNS resolution on pf.conf rules happens during pf load-time (though I have no idea how TTL would be honoured). However/regardless: There are no FQDNs or hostnames used in the pf.conf on the machine which exhibited the above problem; everything uses hard-coded IP addresses. The same goes for the contents of the (extremely few; only 4) tables referenced in pf.conf. The only non-numeric data used for resolution of any kind are a couple service names (e.g. "ssh" instead of 22). The pf.conf on the machine which witnessed the issue is extremely small and very simple. It's also the 2nd-to-lowest machine on our network when it comes to amount of network traffic received/sent. How can I get a dump of the contents of the pf state table -- specifically referring to whatever it thinks still contained 10,000 state entries despite "pfctl -s state" showing 0? There is obviously a mismatch there, so I'm wondering how the 10,000 counter could stay at 10,000 (especially after doing a full stop/start of the pf stack) when "pfctl -s state" shows 0 (ditto with "pfctl -F state"). I really, *really* want to debug this. It's to the point where I'm happily willing to pay for a kernel developer's time, hourly, to figure it out (if at all possible). It might be easier to figure it out in real-time rather than post-mortem, but post-mortem is what I've got right now. I need to get off my butt and get an actual dev/testbed box in our production co-lo where I can bang on this stuff for days and see if I can reproduce it. All I can conclude on my own is that "something bad" can happen when reloading pf rules + changing "scrub" from both directions to just incoming. I don't even know if the scrub change is what triggered it, maybe just the reload did. Maybe the problem can only happen when a packet of a certain type is inbound or outbound and in a certain internal state. Again, it's speculation, but there's some pretty strong evidence that something very bad happened, almost like an infinite loop that eventually exhausted all stateful memory/entries. A DoS across em0 is what it sounds like, but the network graphs on the switch confirm that isn't the case. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |