Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 17 Apr 2019 13:09:50 -0500
From:      Jim Thompson <jim@netgate.com>
To:        Wojciech Puchar <wojtek@puchar.net>
Cc:        Miroslav Lachman <000.fbsd@quip.cz>, Mark Millard via freebsd-hackers <freebsd-hackers@freebsd.org>
Subject:   Re: openvpn and system overhead
Message-ID:  <94EA4F3F-4D78-4E08-9AF8-441B957A4749@netgate.com>
In-Reply-To: <alpine.BSF.2.20.1904171753480.98262@puchar.net>
References:  <alpine.BSF.2.20.1904171707030.87502@puchar.net> <8648d069-2172-2c09-8e59-d66a8265a120@quip.cz> <alpine.BSF.2.20.1904171753480.98262@puchar.net>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Apr 17, 2019, at 10:54 AM, Wojciech Puchar <wojtek@puchar.net> =
wrote:
>=20
>=20
>=20
> On Wed, 17 Apr 2019, Miroslav Lachman wrote:
>=20
>> Wojciech Puchar wrote on 2019/04/17 17:08:
>>> i'm running openvpn server on Xeon E5 2620 server.
>>> when receiving 100Mbit/s traffic over VPN it uses 20% of single =
core.
>>> At least 75% of it is system time.
>>> Seems like 500Mbit/s is a max for a single openvpn process.
>>> can anything be done about that to improve performance?
>>=20
>> You can play with ciphers, AES-NI etc.
>> https://community.openvpn.net/openvpn/wiki/Gigabit_Networks_Linux
>>=20
>> Miroslav Lachman
>>=20
>>=20
> again. it's system time mostly not user time.

Yup.  I=E2=80=99ve looked at this a bunch over the years for pfSense.

The tun/tap device can be viewed as a simple Point-to-Point IP or =
Ethernet device, which instead of receiving packets from a physical=20
media, receives them from user space program and instead of sending =
packets via physical media sends them to the user space program.=20

Let's say that you configured IP on the tap0, then whenever the kernel =
sends an IP packet to tap0, it is passed to the application (OpenVPN, =
for example).=20
Open=10VPN encrypts, authenticates, and occasionally compresses this =
packet, encapsulates it, and sends it to the other side over TCP or =
(preferably) UDP.

The application on the other side receives the packet, decompresses and =
decrypts the data received and writes the packet to its TAP device, the =
kernel on the other side handles the packet like it came from real =
physical device.

Each time you copy data from user to kernel or kernel to user space, you =
also incur a context switch with all the associated overheads.

Using a tun/tap device incurs an additional context switch in each =
direction, as you=E2=80=99re basically running the program to send data =
(say, =E2=80=98ping=E2=80=99 or =E2=80=99ssh=E2=80=99), and another =
program is used to encrypt and encapsulate the packet before it leaves =
the machine.  The process is roughly the same on the other side.   So =
you get twice the copies, and twice the number of context switches.  =
Making things worse, the =E2=80=9CIP stack=E2=80=9D inside OpenVPN is =
single-threaded, and processes one packet at a time, so all the =
overheads accrue to each packet, rather than being amortized across =
several packets.

Net-net, openvpn won=E2=80=99t do close to 1Mpps.  There is a =
decent-enough write-up of recent actual benchmarking in a masters thesis =
that compares IPsec, OpenVPN and Wireguard, on linux here:

=
https://www.net.in.tum.de/fileadmin/bibtex/publications/theses/2018-pudelk=
o-vpn-performance.pdf

Section 5.5 if you want to skip to the substance.  Basically, with *no* =
encryption overheads, OpenVPN still has a static overhead of around 8500 =
cycles/packet on the setup they used (Xeon E5-2620 v4), which seems =
quite similar to yours.  Given all this, they show that OpenVPN enters =
an overload condition at around 120Kpps.

There is some hope if you really have to have a lower-overhead OpenVPN.  =
An OpenVPN session has two channels, multiplexed on the same connection. =
 One is a control channel, the other is a data channel.  The control =
channel and associated configuration code in OpenVPN is =E2=80=A6 =
complex.  It has close to 10 trillion configuration options, and any =
re-write of this code would be a huge, huge undertaking.   Nearly =
unthinkable, really.   The data channel, otoh, is relatively =
straight-forward, especially if you don=E2=80=99t need all the crypto =
options provided, and, instead, limit yourself to, say AES-GCM or =
another AEAD (ChaCha20 / Poly1305) transform.  (Here, if your CPU has =
AES-NI or similar (e.g. ARMv8 has AES acceleration instructions) AES-GCM =
will always be faster.)

But, if you=E2=80=99re willing to limit yourself to one, or a few =
transforms, it theory, it=E2=80=99s possible to make a specialized tun / =
tap device such that the data channel is kept in-kernel, with =
encryption/decryption and encapsulation/decapsulation of data packets =
occurring in the kernel, but control packets passed up and down to/from =
the associated user space process.

A partial attempt of this idea (for linux) can be found here:  =
https://github.com/marywangran/OpenVPN-Linux-kernel  it looks abandoned, =
so maybe it didn=E2=80=99t pan out, or maybe the work just got =
asymptotic.

There is a bunch of work to get this right (keeping the openVPN user =
process happy, counters up to date, etc), but, at the end of the day, =
it=E2=80=99s all software.  Netflix got enough of OpenSSL's AES-GCM =
implementation into the kernel to run the transmit side.  They didn=E2=80=99=
t care about the receive side, and just let nginx deal with the =
relatively light rx flows in their deployment, but it does show that =
it=E2=80=99s possible with enough work.

Even with all that work, It will probably never be as fast as a decent =
IPsec implementation.

Jim





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?94EA4F3F-4D78-4E08-9AF8-441B957A4749>