Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Jun 2014 19:09:20 +0430
From:      Hooman Fazaeli <hoomanfazaeli@gmail.com>
To:        Mark van der Meulen <mark@fivenynes.com>
Cc:        freebsd-net@freebsd.org, freebsd-bugs@freebsd.org
Subject:   Re: FreeBSD 9 w/ MPD5 crashes as LNS with 300+ tunnels. Netgraph issue?
Message-ID:  <539DB018.5020702@gmail.com>
In-Reply-To: <CFC3BC24.1CB4A%mark@fivenynes.com>
References:  <CFC3BC24.1CB4A%mark@fivenynes.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 6/15/2014 3:39 PM, Mark van der Meulen wrote:
> Hi List,
>
> Iım wondering if anyone can help me with this problem or at least help
> point me in the direction of where to start looking? I have FreeBSD 9
> based servers which are crashing every 4-10 days and producing crash dumps
> similar to this one: http://pastebin.com/F82Jc08C
>
> All crash dumps seem to involve the net graph code and the current process
> is always ng_queueX.
>
> In summary, we have 4 x FreeBSD server running as LNS(MPD5) for around
> 2000 subscribers with 3 of the servers running a modified version of
> BSDRP, the fourth running a FreeBSD 9 install with what I thought was the
> latest stable source for the kernel because I fetched it from stable/9
> however it shows up as 9.3-BETA in uname(the linked crash dump is from
> that server).
>
> 3 x LNS running modified BSDRP: DELL PowerEdge 2950, 2 x Xeon E5320, 4GB
> RAM, igb Quad Port NIC in LAGG, Quagga, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
> 1 x LNS running latest FreeBSD 9 code: HP ProLiant DL380, 2 x Xeon X5465,
> 36GB RAM, em Quad Port NIC in LAGG, BIRD, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
>
> The reason I built the fresh server on FreeBSD 9 is because I cannot save
> crash dumps for BSDRP easily. In short the problem is this ­ servers with
> 10-50 clients will run indefinitely(as long as we have had them, which is
> probably about 1.5 years) without errors and serve clients fine, however
> any with over 300 clients appear to only stay online for 4-10 days maximum
> before crashing and rebooting. I have attached the crash file from the
> latest crash on the LNS running the latest FreeBSD 9 code however unsure
> what to do with it and where to look?
>
> When these devices crash they are often doing in excess of
> 200Mbps(anywhere between 200Mbps and 450Mbps), very little load(3-4.5 on
> the first 3, less than 2 on the fourth).
>
> Things Iıve done to attempt resolution:
>
> - Replaced bce network cards with em network cards. This produced far less
> errors on the interfaces(was many before, now none) and I think caused the
> machines to stay up longer between reboots as before it would happen up to
> once a day.
> - Replaced em network cards with igb network cards. All this did was lower
> load and give us a little more time between reboots.
> - Tried an implementation using FreeBSD 10(this lasted less than 4 hours
> before reboots when under load)
> - Replaced memory
> - Increased memory on LNS4 to 36GB.
> - Various kernel rebuilds
> - Tweaked various kernel settings. This appears to have helped a little
> and given us more time between reboots.
> - Disabled IPv6
> - Disabled IPFW
> - Disabled BSNMPD
> - Disabled Netflow
> - Versions 5.6 and 5.7 of MPD5
>
> Anyone able to help me work out what the crash dump means? It only happens
> on servers running MPD5 (eg. Exact same boxes, exact same code pushing
> 800Mbps+ of routing and no crashes) and I can see the crash relates to net
> graph, however unsure where to go from thereŠ
>
> Thanks,
>
> Mark
>
>
> Relevant Current Settings:
>
> net.inet.ip.fastforwarding=1
> net.inet.ip.fw.default_to_accept=1
> net.bpf.zerocopy_enable=1
> net.inet.raw.maxdgram=16384
> net.inet.raw.recvspace=16384
> hw.intr_storm_threshold=64000
> net.inet.ip.fastforwarding=1
> net.inet.ip.fw.default_to_accept=1
> net.inet.ip.intr_queue_maxlen=10240
> net.inet.ip.redirect=0
> net.inet.ip.sourceroute=0
> net.inet.ip.rtexpire=2
> net.inet.ip.rtminexpire=2
> net.inet.ip.rtmaxcache=256
> net.inet.ip.accept_sourceroute=0
> net.inet.ip.process_options=0
> net.inet.icmp.log_redirect=0
> net.inet.icmp.drop_redirect=1
> net.inet.tcp.drop_synfin=1
> net.inet.tcp.blackhole=2
> net.inet.tcp.sendbuf_max=16777216
> net.inet.tcp.recvbuf_max=16777216
> net.inet.tcp.sendbuf_auto=1
> net.inet.tcp.recvbuf_auto=1
> net.inet.udp.recvspace=262144
> net.inet.udp.blackhole=0
> net.inet.udp.maxdgram=57344
> net.route.netisr_maxqlen=4096
> net.local.stream.recvspace=65536
> net.local.stream.sendspace=65536
> net.graph.maxdata=65536
> net.graph.maxalloc=65536
> net.graph.maxdgram=2096000
> net.graph.recvspace=2096000
> kern.ipc.somaxconn=32768
> kern.ipc.nmbclusters=524288
> kern.ipc.maxsockbuf=26214400
> kern.ipc.shmmax=³2147483648"
> kern.ipc.nmbjumbop=³53200"
> kern.ipc.maxpipekva=³536870912"
> kern.random.sys.harvest.ethernet="0"
> kern.random.sys.harvest.interrupt="0"
> vm.kmem_size=³4096M² # Only on box with over 12G RAM. Otherwise 2G.
>
>
> vm.kmem_size_max=³8192M" # Only on box with over 12G RAM.
> hw.igb.rxd="4096"
> hw.igb.txd="4096"
> hw.em.rxd="4096"
> hw.em.txd="4096"
> hw.igb.max_interrupt_rate=³32000"
>
> hw.igb.rx_process_limit="4096"
> hw.em.rx_process_limit="500"
> net.link.ifqmaxlen="20480"
> net.isr.dispatch="direct"
> net.isr.direct_force="1"
> net.isr.direct="1"
> net.isr.maxthreads="8"
> net.isr.numthreads="4"
> net.isr.bindthreads="1"
> net.isr.maxqlimit="20480"
> net.isr.defaultqlimit="8192"
>
>

The following workarounds have worked for some people.
They may not solve your problem, but are worth giving a try:

1. Increases netgraph limits:
net.graph.maxdata=262140 # /boot/loader.conf
net.graph.maxalloc=262140 # /boot.loader.conf

2. Remove FLOWTABLE kernel option.

It would also help if you put your kernel and core dump somewhere for download so we can have a closer look at panic trace.

-- 

Best regards.
Hooman Fazaeli




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?539DB018.5020702>