Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Dec 2016 10:18:09 -0500
From:      Ian FREISLICH <ian.freislich@capeaugusta.com>
To:        freebsd-pf@freebsd.org
Subject:   Re: Poor PF performance with 2.5k rdr's
Message-ID:  <5d8f9f65-bb3a-ef25-0fbe-bfc28b9025df@capeaugusta.com>
In-Reply-To: <CANFey=-4bEtrBatWAQdUWQofTHUy2XseKyokpc3UmpxiUu-GMA@mail.gmail.com>
References:  <CANFey=-4bEtrBatWAQdUWQofTHUy2XseKyokpc3UmpxiUu-GMA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Chris,

It's been a fairly long time since I ran a FreeBSD router in a
production environment, 10-CURRENT at the time. tcp.sendspace/recvspace
will have no effect on forwarding performance.  Digging around in my
email I've found some of my config:

--- /etc/pf.conf
# Options
# ~~~~~~~
set timeout { \
        adaptive.start  900000, \
        adaptive.end    1800000 \
        }
set block-policy return
set state-policy if-bound
set optimization normal
set ruleset-optimization basic
set limit states 1500000
set limit frags 40000
set limit src-nodes 150000
---

--- /etc/sysctl.conf ---
net.inet.ip.fastforwarding=1
net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1
net.inet.ip.fastforwarding=1
net.inet.carp.preempt=1
net.inet.icmp.icmplim_output=0
net.inet.icmp.icmplim=0
kern.random.sys.harvest.interrupt=0
kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
net.route.netisr_maxqlen=8192
---

 --- /boot/loader.conf
console="comconsole"
net.isr.maxthreads="8"
net.isr.defaultqlimit="4096"
net.isr.maxqlimit="81920"
net.isr.direct="0"
net.isr.direct_force="0"
kern.ipc.nmbclusters="262144"
kern.maxusers="1024"
hw.bce.rx_pages="8"
hw.bce.tx_pages="8"
---

Our pfctl -s inf at the time:

State Table                          Total             Rate
  current entries                   330022              
  searches                       516720212        91910.4/s
  inserts                         24545254         4365.9/s
  removals                        24215232         4307.2/s
Counters
  match                           66166232        11769.2/s

We were using a different NIC to you and eventually moved to ixgb(4) and
bxe(4) NICs to handle the traffic, but the principle is the same: tune
the queues.  We didn't have as many rdr rules as you do, but the rule
set is only linearly searched when there is no matching state in the
state table.  This means the rules are linearly searched for the first
packet in each flow.

In my testing, the other large contributor to forwarding rate is L1
cache size.  Intel CPUs have traditionally had very small L1 cache sizes
ranging from 12K to 32K and they're almost never quoted in marketing or
comparison material.  Your CPU has 32K of L1 data and 32K of L1
instruction cache per core. You may want to try disabling HT if that's
possible these days to reduce L1 contention with the HT instance on each
core.  I may be talking total rubbish regarding HT and cache
architecture but I think it's worth a try.

Ian

-- 
Ian Freislich

On 12/11/16 11:22, chris g wrote:
> Hello,
>
> I've decided to write here, as we had no luck troubleshooting PF's
> poor performance on 1GE interface.
>
> Network scheme, given as simplest as possible is:
>
> ISP <-> BGP ROUTER <-> PF ROUTER with many rdr rules <-> LAN
>
> Problem is reproducible on any PF ROUTER's connection - to LAN and to BGP ROUTER
>
>
> Both BGP and PF routers' OS versions and tunables, hardware:
>
> Hardware: E3-1230 V2 with HT on, 8GB RAM, ASUS P8B-E, NICs: Intel I350 on PCIe
>
> FreeBSD versions tested: 9.2-RELEASE amd64 with Custom kernel,
> 10.3-STABLE(compiled 4th Dec 2016) amd64 with Generic kernel.
>
> Basic tunables (for 9.2-RELEASE):
> net.inet.ip.forwarding=1
> net.inet.ip.fastforwarding=1
> kern.ipc.somaxconn=65535
> net.inet.tcp.sendspace=65536
> net.inet.tcp.recvspace=65536
> net.inet.udp.recvspace=65536
> kern.random.sys.harvest.ethernet=0
> kern.random.sys.harvest.point_to_point=0
> kern.random.sys.harvest.interrupt=0
> kern.polling.idle_poll=1
>
> BGP router doesn't have any firewall.
>
> PF options of PF router are:
> set state-policy floating
> set limit { states 2048000, frags 2000, src-nodes 384000 }
> set optimization normal
>
>
> Problem description:
> We are experiencing low throughput when PF is enabled with all the
> rdr's. If 'skip' is set on benchmarked interface or the rdr rules are
> commented (not present) - the bandwidth is flawless. In PF, there is
> no scrubbing done, most of roughly 2500 rdr rules look like this,
> please note that no interface is specified and it's intentional:
>
> rdr pass inet proto tcp from any to 1.2.3.4 port 1235 -> 192.168.0.100 port 1235
>
> All measurements were taken using iperf 2.0.5 with options "-c <IP>"
> or "-c <IP> -m -t 60 -P 8" on client side and "-s" on server side. We
> changed directions too.
> Please note that this is a production environment and there was some
> other traffic on bencharked interfaces (let's say 20-100Mbps) during
> both tests, thus iperf won't show full Gigabit. There is no networking
> eqipment between 'client' and 'server' - just 2 NICs independly
> connected with Cat6 cable.
>
> Without further ado, here are benchmark results:
>
> server's PF enabled with fw rules but without rdr rules:
>   root@client:~ # iperf -c server
> ------------------------------------------------------------
> Client connecting to server, TCP port 5001
> TCP window size: 65.0 KByte (default)
> ------------------------------------------------------------
> [  3] local clients_ip port 51361 connected with server port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  1.09 GBytes   936 Mbits/sec
>
>
>
> server's PF enabled with fw rules and around 2500 redirects present:
> root@client:~ # iperf -c seerver
> ------------------------------------------------------------
> Client connecting to server, TCP port 5001
> TCP window size: 65.0 KByte (default)
> ------------------------------------------------------------
> [  3] local clients_ip port 45671 connected with server port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec   402 MBytes   337 Mbits/sec
>
>
> That much of a  difference is 100% reproducible on production env.
>
> Performance depends on hours of day&night, the result is 160-400Mbps
> with RDR rules present and always above 900Mbps with RDR rules
> disabled.
>
>
> Some additional information:
>
> # pfctl -s info
> Status: Enabled for 267 days 10:25:22         Debug: Urgent
>
> State Table                          Total             Rate
>   current entries                   132810
>   searches                      5863318875          253.8/s
>   inserts                        140051669            6.1/s
>   removals                       139918859            6.1/s
> Counters
>   match                         1777051606           76.9/s
>   bad-offset                             0            0.0/s
>   fragment                             191            0.0/s
>   short                                518            0.0/s
>   normalize                              0            0.0/s
>   memory                                 0            0.0/s
>   bad-timestamp                          0            0.0/s
>   congestion                             0            0.0/s
>   ip-option                           4383            0.0/s
>   proto-cksum                            0            0.0/s
>   state-mismatch                     52574            0.0/s
>   state-insert                         172            0.0/s
>   state-limit                            0            0.0/s
>   src-limit                              0            0.0/s
>   synproxy                               0            0.0/s
>
> # pfctl -s states | wc -l
>   113705
>
> # pfctl  -s memory
> states        hard limit  2048000
> src-nodes     hard limit   384000
> frags         hard limit     2000
> tables        hard limit     1000
> table-entries hard limit   200000
>
> # pfctl -s Interfaces|wc -l
>       75
>
> # pfctl -s rules | wc -l
>     1226
>
>
> In our opinion hardware is not too weak as we have only 10-30% of CPU
> usage and during the benchmark it doesn't go to 100%. Even any one
> vcore isn't filled up to 100% of CPU usage.
>
>
> I would be really grateful if someone could point me at the right direction.
>
>
> Thank you,
> Chris
> _______________________________________________
> freebsd-pf@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-pf
> To unsubscribe, send any mail to "freebsd-pf-unsubscribe@freebsd.org"


-- 
 

Cape Augusta Digital Properties, LLC a Cape Augusta Company

*Breach of confidentiality & accidental breach of confidentiality *

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5d8f9f65-bb3a-ef25-0fbe-bfc28b9025df>