From owner-freebsd-current@FreeBSD.ORG Fri Aug 21 19:05:20 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 64B02106564A for ; Fri, 21 Aug 2009 19:05:20 +0000 (UTC) (envelope-from ianf@clue.co.za) Received: from inbound01.jnb1.gp-online.net (inbound01.jnb1.gp-online.net [41.161.16.135]) by mx1.freebsd.org (Postfix) with ESMTP id A19C68FC12 for ; Fri, 21 Aug 2009 19:05:19 +0000 (UTC) Received: from [41.161.16.10] (helo=clue.co.za) by inbound01.jnb1.gp-online.net with esmtpsa (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.63) (envelope-from ) id 1MeZQL-0005TU-3v; Fri, 21 Aug 2009 21:05:17 +0200 Received: from localhost ([127.0.0.1] helo=clue.co.za) by clue.co.za with esmtp (Exim 4.69 (FreeBSD)) (envelope-from ) id 1MeZQL-000Dzg-UM; Fri, 21 Aug 2009 21:05:17 +0200 To: Max Laier From: Ian FREISLICH In-Reply-To: <200908211727.11400.max@love2party.net> References: <200908211727.11400.max@love2party.net> X-Attribution: BOFH Date: Fri, 21 Aug 2009 21:05:17 +0200 Message-Id: Cc: freebsd-current@freebsd.org Subject: Re: panic: in pf_reassemble() ? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2009 19:05:20 -0000 Max Laier wrote: > On Friday 21 August 2009 17:01:14 Ian Freislich wrote: > > 0xffffff 81ccae4710 --- > > pf_reassemble() at pf_reassemble+0xb1 > > pf_normalize_ip() at pf_normalize_ip+0x694 > > Can you get me line numbers for these two? How? > > pf_test() at pf_test+0x78e > > pf_check_in() at pf_check_in+0x39 > > pfil_run_hooks() at pfil_run_hooks+0x9c > > ip_fastforward() at ip_fastforward+0x319 > > Does switching fast forward off change the situation - not that it > should, but it might help with finding the culprit. I'll test and let you know. I'm also running Watson's: net.isr.maxthreads=8 net.isr.direct=0 But, these didn't seem to make a difference either. > > ether_demux() at ether_demux+0x131 > > ether_input() at ether_input+0x1e0 > > ether_demux() at ether_demux+0x6f > > ether_input() at ether_input+0x1e0 > > bce_intr() at bce_intr+0x398 > > intr_event_execute_handlers() at intr_event_execute_handlers+0x100 > > ithread_loop() at ithread_loop+0x8e > > fork_exit() at fork_exit+0x117 > > fork_trampoline() at fork_trampoline+0xe > > --- trap 0, rip =3D 0, rsp =3D 0xffffff81ccae4d30, rbp =3D 0 --- > > > > I can setup remote GDB and set this panic off again if there's > > something specific someone would like me to look at. > > From a very first glance this could be a byte order mismatch in ip_len > or the like, so if you could take a look at the ip header in the > involved mbufs. Anything that looks like swapped bytes. Are you > using jumbo frames? We're not using jumbo frames. It'll take me a while to setup a test-bed. This happened on my live routers. It panicked the first one, the carp failover and it panicked the second one. I'm away for a week now, so if I don't manage to get this done before we leave, it'll be the first thing I do when I get back. It's easily reproduceable: Mine's a 16 core amd64 with 4x bce(4) interfaces in a lagg(4) and vlans using the lagg as the parent (but I doubt that's involved). At the time there were about 180000 states + 40000 NAT states on the firewall. Limits are as high as they are because we've been up to about 430000 states in the past. pf.conf: # Options # ~~~~~~~ set timeout { \ adaptive.start 480000, \ adaptive.end 960000 \ } set state-policy if-bound set optimization normal set ruleset-optimization basic set limit states 800000 set limit frags 40000 set limit src-nodes 150000 # Normalization # ~~~~~~~~~~~~~ scrub on tun0 all fragment reassemble scrub on vlan2 all fragment reassemble scrub on vlan3 all fragment reassemble scrub on vlan4 all fragment reassemble scrub on vlan5 all fragment reassemble scrub on vlan7 all fragment reassemble scrub on vlan8 all fragment reassemble scrub on vlan9 all fragment reassemble scrub on vlan14 all fragment reassemble scrub on vlan15 all fragment reassemble scrub on vlan20 all fragment reassemble scrub on vlan21 all fragment reassemble scrub on vlan22 all fragment reassemble scrub on vlan23 all fragment reassemble scrub on vlan25 all fragment reassemble scrub on vlan26 all fragment reassemble # Queueing # ~~~~~~~~ altq on lagg0 cbq bandwidth 8Gb queue { vlan2_m_in, vlan2_m_out, default } queue vlan2_m_in bandwidth 100Mb { vlan2_in, vlan15_in } queue vlan2_m_out bandwidth 100Mb { vlan2_out, vlan15_out } queue vlan2_in bandwidth 80Mb borrow queue vlan2_out bandwidth 80Mb borrow queue vlan15_in bandwidth 20Mb queue vlan15_out bandwidth 20Mb queue default bandwidth 7176Mb cbq(default) Then use netperf from a machine on one vlan to a machine on another vlan: netperf -tUDP_STREAM -i2 -H othermachine It takes about 3 seconds to panic. Ian -- Ian Freislich