From owner-freebsd-net@FreeBSD.ORG Fri May 30 11:09:52 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1C3757A8 for ; Fri, 30 May 2014 11:09:52 +0000 (UTC) Received: from mail.ipfw.ru (mail.ipfw.ru [IPv6:2a01:4f8:120:6141::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D526229AD for ; Fri, 30 May 2014 11:09:51 +0000 (UTC) Received: from [2a02:6b8:0:401:222:4dff:fe50:cd2f] (helo=ptichko.yndx.net) by mail.ipfw.ru with esmtpsa (TLSv1:CAMELLIA256-SHA:256) (Exim 4.76 (FreeBSD)) (envelope-from ) id 1WqGmM-0001cG-PR; Fri, 30 May 2014 10:59:02 +0400 Message-ID: <538866A5.9050901@FreeBSD.org> Date: Fri, 30 May 2014 15:08:21 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Nadav Har'El Subject: Re: Route caching References: <20140529123306.GA16644@fermat.math.technion.ac.il> In-Reply-To: <20140529123306.GA16644@fermat.math.technion.ac.il> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org, osv-dev@googlegroups.com X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 May 2014 11:09:52 -0000 On 29.05.2014 16:33, Nadav Har'El wrote: > Hi, Hello! > I'm working on the OSv project (http://osv.io/), a new BSD-licensed > operating system for virtual machines. OSv's networking code is based > on that of FreeBSD. > > I recently noticed an inefficiency that I believe exists also in > FreeBSD's networking code, and I was wondering why this was done, > and whether FreeBSD can also be improved in the same way by fixing > this problem. > > My issue is that, for example, when running a UDP server answering > hundreds of thousands of requests per second, I get the same number of > calls to the routing table lookup function (rtalloc_ign_fib(), etc.). > These calls are relatively slow: Each involves several mutex locks and > unlocks (a rwlock for the radix tree, and a mutex for the individual > route), which are relatively slow in the uncontended case, but even worse > when several CPUs start to access the network heavily, and we start to see > context switches hurting the performance of the server even further. Yes, that's true. > Looking at FreeBSD's udp_output(), I see it does the following: > > error = ip_output(m, inp->inp_options, NULL, ipflags, > inp->inp_moptions, inp) > > Note how NULL is passed as the third parameter. This tells ip_output > that it can't cache the previously found route, and needs to look for > it again and again on every packet output - even in the common case > where a socket will only ever send packets on one interface. > > It seems that this change was done around FreeBSD 5.4. In the original > UCB code (4.4Lite), I see this: > > error = ip_output(m, inp->inp_options, &inp->inp_route, > inp->inp_socket->so_options & (SO_DONTROUTE | SO_BROADCAST), > inp->inp_moptions); > > So the last-found route was cached in inp->inp_route, and possibly > reused on the next packet to be sent. > > Does anyone have any idea why inp->inp_route was removed in FreeBSD? > Doesn't this also hurt FreeBSD's network performance? Well, there are two problems: First, using cached routes makes it more complex to change routing stack to be more efficient. The second one is basically the fact that using cached routes is simply incorrect: no one protects/notifies you on interface removal. That's why we've removed cached route support from various tunneling schemes. There is some ongoing work to change rte_* api and eliminate the need for per-rte mutex (and use different, more efficient lookup mechanisms). There is also another alternative which you can currently use: flowtable (not included in GENERIC). It has been fixed recently and should work better in your case. > Thanks, > Nadav. > >