From owner-freebsd-net@FreeBSD.ORG Wed Aug 31 03:53:19 2011 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 350E2106564A for ; Wed, 31 Aug 2011 03:53:19 +0000 (UTC) (envelope-from ask@develooper.com) Received: from mbox1.develooper.com (mbox1.develooper.com [207.171.7.178]) by mx1.freebsd.org (Postfix) with ESMTP id 17A798FC17 for ; Wed, 31 Aug 2011 03:53:18 +0000 (UTC) Received: (qmail 15690 invoked from network); 31 Aug 2011 03:53:18 -0000 Received: from gw.develooper.com (HELO embla.bn.dev) (ask@mail.dev@64.81.84.140) by smtp.develooper.com with ESMTPA; 31 Aug 2011 03:53:18 -0000 Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: text/plain; charset=iso-8859-1 From: =?iso-8859-1?Q?Ask_Bj=F8rn_Hansen?= In-Reply-To: <4E4B3267.8050707@freebsd.org> Date: Tue, 30 Aug 2011 20:53:17 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <8D7CB918-F7E0-4BC6-B9FC-624B1E438692@develooper.com> References: <4E4AB3BE.4090603@sentex.net> <9255C71C-BB78-417E-A900-85140FC2050C@develooper.com> <4E4B3267.8050707@freebsd.org> To: Julian Elischer X-Mailer: Apple Mail (2.1244.3) Cc: Mike Tancsa , net@freebsd.org Subject: system locks up when pf is enabled? (was: system locks up with vr driver on alix board) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 31 Aug 2011 03:53:19 -0000 On Aug 16, 2011, at 20:15, Julian Elischer wrote: > from your description it doesn't sound like a vr problem. > I suggest you hook up teh serial console (I am guessing you already = have) > and set the config options to allow break-to-debugger or = alt-break-to-debugger on it > when it happens next, drop into the debugger.. >=20 > in fact, drop in, and do a ps to see what processes are runnable, > 'tr [pid] (or thread id)' to get a stack trace of anything that looks > interesting, and then cont and do it again a few times to get a feel > for where the processor is hanging out (a straight 'tr' will give you > the interrupt of the com port which is not intresting..) Alright, I thought I had locked this down to ZERO_COPY_SOCKETS being = enabled in the kernel (as unlikely as it seems that it'd break things = without being explicitly used -- I wanted to believe!). A couple days ago I figured out how to consistently make it happen (or = not). For about a week we'd been running with no trouble; until I = realized that a hostname added to our pf rules made them not load on = startup. When running without our pf rules loaded, everything is fine. = WIth the pf rules loaded the system will hang within 2-4 hours. Our ruleset was about 240 rules. We cut it down to ~140 rules today to = see if that made a difference; it didn't. We also turned off as much = IPv6 traffic as we could in case that was what was disturbing things = (adding IPv6 traffic is relatively new). Flipping the backup system to be the master for the active IP addresses = doesn't (as far as I can tell) make the old master recover. I tried looking for interesting information with the kernel debugger; = but honestly nothing springs out as interesting -- though that may be = due to my lack of skills more than anything else. I am including a = snapshot below. I ran ps and tr on 100024, 100025 (vr0 and vr1 = interrupt threads) and 100022 (thread taskq) a few more times but it = seemed to me to give basically the same thread trace information. For what it's worth, if I let 'cont' run for long enough for the system = to send out carp packets and get it's IPs back, 'ps' seemed more likely = to have 100024 or 100025 (vr0/vr1) in 'Run' state. If I interrupted the = system again relatively quickly after continuing, it was more likely to = have 'thread tasks' be the busy process. Any tips for what I can do to extract something more useful would be = greatly appreciated. As a last note -- earlier today I noticed that "Searches" in 'pfctl -si' = seemed to go up significantly just before the console locks up (which is = 15-30 minutes before the box stops routing packets). My theory was = that we were getting some sort of tiny DoS attack (doesn't take much to = take out a computer the size of a CD case); but that doesn't really make = sense as the system 1) never recovers and 2) the backup is fine as soon = as the master has been turned off / changed priorities. - ask db> ps pid ppid pgrp uid state wmesg wchan cmd 3300 2371 2371 0 R sshd 3299 3298 3298 0 S+ sbwait 0xc2ddb58c ssh 3298 3100 3298 0 S+ piperd 0xc2624620 scp 3100 3098 3100 0 Ss+ pause 0xc2e95850 csh 3098 2371 3098 0 Rs sshd 2904 2371 2904 0 Ss select 0xc379a924 sshd 2476 2474 2476 0 Ss+ ttyin 0xc2e44070 csh 2474 2371 2474 0 Rs sshd 2440 1 2440 0 Rs+ getty 2433 1 2433 0 Rs cron 2426 1 2426 62 Rs ftp-proxy 2397 1 2397 0 Ss select 0xc2d71964 inetd 2371 1 2371 0 Rs sshd 2349 1 2349 0 Rs bird 2302 1 2302 0 Rs radvd 2298 2292 2285 65534 S piperd 0xc2624188 multilog 2297 2291 2285 0 R openvpn 2292 2287 2285 0 S select 0xc2d713a4 supervise 2291 2287 2285 0 S select 0xc2d71a64 supervise 2288 1 2285 0 S piperd 0xc2624c40 readproctitle 2287 1 2285 0 R svscan 2244 1 2244 0 Rs ntpd 2129 1 2129 0 Rs syslogd 50 0 0 0 SL mdwait 0xc2717000 [md1] 40 0 0 0 SL mdwait 0xc24ef000 [md0] 22 0 0 0 RL [flowcleaner] 21 0 0 0 SL sdflush 0xc0b24d80 [softdepflush] 20 0 0 0 SL vlruwt 0xc25f52a8 [vnlru] 19 0 0 0 SL syncer 0xc0b14774 [syncer] 18 0 0 0 RL [bufdaemon] 17 0 0 0 RL [pagezero] 16 0 0 0 SL psleep 0xc0b2567c [vmdaemon] 15 0 0 0 SL psleep 0xc0b25644 [pagedaemon] 9 0 0 0 SL ccb_scan 0xc0b006d4 [xpt_thrd] 8 0 0 0 RL [pfpurge] 7 0 0 0 SL waiting_ 0xc0b181d8 [sctp_iterator] 14 0 0 0 SL (threaded) usb 100036 D - 0xc250ed34 [usbus1] 100035 D - 0xc250ed04 [usbus1] 100034 D - 0xc250ecd4 [usbus1] 100033 D - 0xc250eca4 [usbus1] 100032 D - 0xc2504b5c [usbus0] 100031 D - 0xc2504b2c [usbus0] 100030 D - 0xc2504afc [usbus0] 100029 D - 0xc2504acc [usbus0] 13 0 0 0 SL - 0xc0b0df64 [yarrow] 6 0 0 0 SL crypto_r 0xc0b2432c [crypto returns] 5 0 0 0 SL crypto_w 0xc0b24308 [crypto] 4 0 0 0 SL - 0xc0b0bc44 [g_down] 3 0 0 0 SL - 0xc0b0bc40 [g_up] 2 0 0 0 SL - 0xc0b0bc38 [g_event] 12 0 0 0 RL (threaded) intr 100037 I [swi0: uart uart] 100028 I [irq12: ohci0 = ehci0] 100027 I [irq14: ata0] 100026 I [irq15: vr2 ata1] 100025 Run CPU 0 [irq11: vr1] 100024 RunQ [irq10: vr0] 100021 I [swi5: +] 100016 I [swi2: cambio] 100015 I [swi6: task queue] 100014 I [swi6: Giant taskq] 100006 I [swi1: netisr 0] 100005 I [swi4: clock] 100004 I [swi3: vm] 11 0 0 0 RL [idle] 1 0 1 0 SLs wait 0xc2433d48 [init] 10 0 0 0 SL audit_wo 0xc0b246a0 [audit] 0 0 0 0 RLs (threaded) kernel 100040 D - 0xc2527880 [dummynet] 100023 D - 0xc24d6580 [glxsb0 taskq] 100022 RunQ [thread taskq] 100020 D - 0xc24afd40 [acpi_task_2] 100019 D - 0xc24afd40 [acpi_task_1] 100018 D - 0xc24afd40 [acpi_task_0] 100017 D - 0xc24afd80 [kqueue taskq] 100010 D - 0xc24512c0 [firmware taskq] 100000 D sched 0xc0b0bd20 [swapper] db> tr 100025 Tracing pid 12 tid 100025 td 0xc24d42d0 kdb_enter(c0a48fbe,c0a5d00a,1,c2522280,0,...) at 0xc0733b3a =3D = kdb_enter+0x3a uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 =3D = uart_intr+0x126 intr_event_handle(c2430980,cc918ac4,cc918aa4,cc918b40,4,...) at = 0xc06dff2c =3D intr_event_handle+0x5c intr_execute_handlers(c0afdc90,cc918ac4,cc918b40,c2ef8a00,cc918b68,...) = at 0xc09d4929 =3D intr_execute_handlers+0x49 atpic_handle_intr(4,cc918ac4) at 0xc09ee047 =3D atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 =3D Xatpic_intr4+0x22 --- interrupt, eip =3D 0xc09e0f58, esp =3D 0xcc918b04, ebp =3D = 0xcc918b68 --- bzero(c2ef8a00,0,12,0,0,...) at 0xc09e0f58 =3D bzero ip_input(c2ef8a00,246,c24a77c0,cc918bd0,c0622c21,...) at 0xc083ff8d =3D = ip_input+0x6cd netisr_dispatch_src(1,0,c2ef8a00,cc918c08,c07bb331,...) at 0xc07c9109 =3D = netisr_dispatch_src+0x89 netisr_dispatch(1,c2ef8a00,c24d8c00,c24d8c00,c2f16008,...) at 0xc07c93a0 = =3D netisr_dispatch+0x20 ether_demux(c24d8c00,c2ef8a00,3,0,3,...) at 0xc07bb331 =3D = ether_demux+0x161 ether_input(c24d8c00,c2ef8a00,c09d6ad0,c0b11780,c24d42d0,...) at = 0xc07bb8af =3D ether_input+0x33f vr_intr(c2499000,c24d42d0,0,109,dc13a3e2,...) at 0xc0690034 =3D = vr_intr+0x4d4 intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) = at 0xc06def8b =3D intr_event_execute_handlers+0x13b ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea =3D = ithread_loop+0x6a fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 =3D fork_exit+0x90 fork_trampoline() at 0xc09cf964 =3D fork_trampoline+0x8 --- trap 0, eip =3D 0, esp =3D 0xcc918d60, ebp =3D 0 --- db> tr 100024 Tracing pid 12 tid 100024 td 0xc24d45a0 sched_switch(c24d45a0,0,109,dc1435c0,660,...) at 0xc07270bf =3D = sched_switch+0x1df mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a =3D = mi_switch+0x11a ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df =3D = ithread_loop+0x15f fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 =3D fork_exit+0x90 fork_trampoline() at 0xc09cf964 =3D fork_trampoline+0x8 --- trap 0, eip =3D 0, esp =3D 0xcc894d60, ebp =3D 0 --- db> cont KDB: enter: Line break on console [thread pid 0 tid 100022 ] Stopped at 0xc0733b3a =3D kdb_enter+0x3a: movl $0,0xc0b124b4 = =3D kdb_why db> ps=20 pid ppid pgrp uid state wmesg wchan cmd 3300 2371 2371 0 R sshd 3299 3298 3298 0 S+ sbwait 0xc2ddb58c ssh 3298 3100 3298 0 S+ piperd 0xc2624620 scp 3100 3098 3100 0 Ss+ pause 0xc2e95850 csh 3098 2371 3098 0 Rs sshd 2904 2371 2904 0 Ss select 0xc379a924 sshd 2476 2474 2476 0 Ss+ ttyin 0xc2e44070 csh 2474 2371 2474 0 Rs sshd 2440 1 2440 0 Rs+ getty 2433 1 2433 0 Rs cron 2426 1 2426 62 Rs ftp-proxy 2397 1 2397 0 Ss select 0xc2d71964 inetd 2371 1 2371 0 Rs sshd 2349 1 2349 0 Rs bird 2302 1 2302 0 Rs radvd 2298 2292 2285 65534 S piperd 0xc2624188 multilog 2297 2291 2285 0 R openvpn 2292 2287 2285 0 S select 0xc2d713a4 supervise 2291 2287 2285 0 S select 0xc2d71a64 supervise 2288 1 2285 0 S piperd 0xc2624c40 readproctitle 2287 1 2285 0 R svscan 2244 1 2244 0 Rs ntpd 2129 1 2129 0 Rs syslogd 50 0 0 0 SL mdwait 0xc2717000 [md1] 40 0 0 0 SL mdwait 0xc24ef000 [md0] 22 0 0 0 RL [flowcleaner] 21 0 0 0 SL sdflush 0xc0b24d80 [softdepflush] 20 0 0 0 SL vlruwt 0xc25f52a8 [vnlru] 19 0 0 0 SL syncer 0xc0b14774 [syncer] 18 0 0 0 RL [bufdaemon] 17 0 0 0 RL [pagezero] 16 0 0 0 SL psleep 0xc0b2567c [vmdaemon] 15 0 0 0 SL psleep 0xc0b25644 [pagedaemon] 9 0 0 0 SL ccb_scan 0xc0b006d4 [xpt_thrd] 8 0 0 0 RL [pfpurge] 7 0 0 0 SL waiting_ 0xc0b181d8 [sctp_iterator] 14 0 0 0 SL (threaded) usb 100036 D - 0xc250ed34 [usbus1] 100035 D - 0xc250ed04 [usbus1] 100034 D - 0xc250ecd4 [usbus1] 100033 D - 0xc250eca4 [usbus1] 100032 D - 0xc2504b5c [usbus0] 100031 D - 0xc2504b2c [usbus0] 100030 D - 0xc2504afc [usbus0] 100029 D - 0xc2504acc [usbus0] 13 0 0 0 SL - 0xc0b0df64 [yarrow] 6 0 0 0 SL crypto_r 0xc0b2432c [crypto returns] 5 0 0 0 SL crypto_w 0xc0b24308 [crypto] 4 0 0 0 SL - 0xc0b0bc44 [g_down] 3 0 0 0 SL - 0xc0b0bc40 [g_up] 2 0 0 0 SL - 0xc0b0bc38 [g_event] 12 0 0 0 WL (threaded) intr 100037 I [swi0: uart uart] 100028 I [irq12: ohci0 = ehci0] 100027 I [irq14: ata0] 100026 I [irq15: vr2 ata1] 100025 I [irq11: vr1] 100024 I [irq10: vr0] 100021 I [swi5: +] 100016 I [swi2: cambio] 100015 I [swi6: task queue] 100014 I [swi6: Giant taskq] 100006 I [swi1: netisr 0] 100005 I [swi4: clock] 100004 I [swi3: vm] 11 0 0 0 RL [idle] 1 0 1 0 SLs wait 0xc2433d48 [init] 10 0 0 0 SL audit_wo 0xc0b246a0 [audit] 0 0 0 0 RLs (threaded) kernel 100040 D - 0xc2527880 [dummynet] 100023 D - 0xc24d6580 [glxsb0 taskq] 100022 Run CPU 0 [thread taskq] 100020 D - 0xc24afd40 [acpi_task_2] 100019 D - 0xc24afd40 [acpi_task_1] 100018 D - 0xc24afd40 [acpi_task_0] 100017 D - 0xc24afd80 [kqueue taskq] 100010 D - 0xc24512c0 [firmware taskq] 100000 D sched 0xc0b0bd20 [swapper] db> tr 100024 Tracing pid 12 tid 100024 td 0xc24d45a0 sched_switch(c24d45a0,0,109,b67db121,673,...) at 0xc07270bf =3D = sched_switch+0x1df mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a =3D = mi_switch+0x11a ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df =3D = ithread_loop+0x15f fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 =3D fork_exit+0x90 fork_trampoline() at 0xc09cf964 =3D fork_trampoline+0x8 --- trap 0, eip =3D 0, esp =3D 0xcc894d60, ebp =3D 0 --- db> tr 100022 Tracing pid 0 tid 100022 td 0xc24d4b40 kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a =3D = kdb_enter+0x3a uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 =3D = uart_intr+0x126 intr_event_handle(c2430980,c23f5c20,0,c25db2bc,4,...) at 0xc06dff2c =3D = intr_event_handle+0x5c intr_execute_handlers(c0afdc90,c23f5c20,c25db2bc,0,c23f5c60,...) at = 0xc09d4929 =3D intr_execute_handlers+0x49 atpic_handle_intr(4,c23f5c20) at 0xc09ee047 =3D atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 =3D Xatpic_intr4+0x22 --- interrupt, eip =3D 0xc073cf01, esp =3D 0xc23f5c60, ebp =3D = 0xc23f5c60 --- sleepq_release(c25db2bc,0,0,0,c24afc80,...) at 0xc073cf01 =3D = sleepq_release+0x31 wakeup(c25db2bc,0,0,0,0,...) at 0xc07109f1 =3D wakeup+0x41 taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e8e2 =3D= taskqueue_run_locked+0x142 taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 =3D = taskqueue_thread_loop+0x87 fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 =3D fork_exit+0x90 fork_trampoline() at 0xc09cf964 =3D fork_trampoline+0x8 --- trap 0, eip =3D 0, esp =3D 0xc23f5d60, ebp =3D 0 --- ... cont dance again ... db> tr 100022 Tracing pid 0 tid 100022 td 0xc24d4b40 kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a =3D = kdb_enter+0x3a uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 =3D = uart_intr+0x126 intr_event_handle(c2430980,c23f5c48,c23f5c14,c24afc80,4,...) at = 0xc06dff2c =3D intr_event_handle+0x5c intr_execute_handlers(c0afdc90,c23f5c48,c24afc80,c25db2bc,c23f5cbc,...) = at 0xc09d4929 =3D intr_execute_handlers+0x49 atpic_handle_intr(4,c23f5c48) at 0xc09ee047 =3D atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 =3D Xatpic_intr4+0x22 --- interrupt, eip =3D 0xc073e7e3, esp =3D 0xc23f5c88, ebp =3D = 0xc23f5cbc --- taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e7e3 =3D= taskqueue_run_locked+0x43 taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 =3D = taskqueue_thread_loop+0x87 fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 =3D fork_exit+0x90 fork_trampoline() at 0xc09cf964 =3D fork_trampoline+0x8 --- trap 0, eip =3D 0, esp =3D 0xc23f5d60, ebp =3D 0 --- db> tr 100024 Tracing pid 12 tid 100024 td 0xc24d45a0 sched_switch(c24d45a0,0,109,4b19c082,684,...) at 0xc07270bf =3D = sched_switch+0x1df mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a =3D = mi_switch+0x11a ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df =3D = ithread_loop+0x15f fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 =3D fork_exit+0x90 fork_trampoline() at 0xc09cf964 =3D fork_trampoline+0x8 --- trap 0, eip =3D 0, esp =3D 0xcc894d60, ebp =3D 0 --- db> tr 100025 Tracing pid 12 tid 100025 td 0xc24d42d0 kdb_enter(c0a48fbe,c0a5d00a,58,c2522280,0,...) at 0xc0733b3a =3D = kdb_enter+0x3a uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 =3D = uart_intr+0x126 intr_event_handle(c2430980,cc918ac8,0,0,4,...) at 0xc06dff2c =3D = intr_event_handle+0x5c intr_execute_handlers(c0afdc90,cc918ac8,0,c2efa300,cc918b68,...) at = 0xc09d4929 =3D intr_execute_handlers+0x49 atpic_handle_intr(4,cc918ac8) at 0xc09ee047 =3D atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 =3D Xatpic_intr4+0x22 --- interrupt, eip =3D 0xc083e822, esp =3D 0xcc918b08, ebp =3D = 0xcc918b68 --- ip_forward(c2efa300,0,12,0,0,...) at 0xc083e822 =3D ip_forward+0x3b2 ip_input(c2efa300,246,c24a1580,cc918bd0,c0622c21,...) at 0xc083ff8d =3D = ip_input+0x6cd netisr_dispatch_src(1,0,c2efa300,cc918c08,c07bb331,...) at 0xc07c9109 =3D = netisr_dispatch_src+0x89 netisr_dispatch(1,c2efa300,c24d8c00,c24d8c00,c2f1a808,...) at 0xc07c93a0 = =3D netisr_dispatch+0x20 ether_demux(c24d8c00,c2efa300,3,0,3,...) at 0xc07bb331 =3D = ether_demux+0x161 ether_input(c24d8c00,c2efa300,c09d6ad0,c0b11780,c24d42d0,...) at = 0xc07bb8af =3D ether_input+0x33f vr_intr(c2499000,c24d42d0,0,109,48d02b08,...) at 0xc0690034 =3D = vr_intr+0x4d4 intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) = at 0xc06def8b =3D intr_event_execute_handlers+0x13b ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea =3D = ithread_loop+0x6a fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 =3D fork_exit+0x90 fork_trampoline() at 0xc09cf964 =3D fork_trampoline+0x8 --- trap 0, eip =3D 0, esp =3D 0xcc918d60, ebp =3D 0 ---