Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 Mar 2017 16:03:46 +0100
From:      Mateusz Guzik <mjguzik@gmail.com>
To:        Slawa Olhovchenkov <slw@zxy.spb.ru>
Cc:        Kevin Bowling <kevin.bowling@kev009.com>, freebsd-net <freebsd-net@freebsd.org>, "Eugene M. Zheganin" <emz@norma.perm.ru>
Subject:   Re: about that DFBSD performance test
Message-ID:  <20170308150346.GA32269@dft-labs.eu>
In-Reply-To: <20170308125710.GS15630@zxy.spb.ru>
References:  <b91a6e40-9956-1ad9-ac59-41a281846147@norma.perm.ru> <CAK7dMtDiT-PKyy5LkT1WEg5g-nwqv501F=Ap4dNCdwzwr_1dqA@mail.gmail.com> <20170308125710.GS15630@zxy.spb.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 08, 2017 at 03:57:10PM +0300, Slawa Olhovchenkov wrote:
> On Wed, Mar 08, 2017 at 05:25:57AM -0700, Kevin Bowling wrote:
> 
> > Right off the bat, FreeBSD doesn't really understand NUMA in any sufficient
> > capacity.  Unfortunately at companies like the one I work at, we take that
> > to mean "OK buy a high bin CPU and only populate one socket" which serves
> 
> NUMA applicable only to high-localy computed tasks.
> http/https/any_network_related serving is not related to this.
> Indeed, on modern CPU is not important to bind NIC irq handlers to
> same CPU/sockets as NIC.
> 

Well, for both benchmarks this is both true and false.

First and foremost there is general kernel scalability. Certain counters
and most locks are purely managed with atomic operations. An atomic
operation grabs the entire cacheline with the particular variable (64
bytes in total) in exclusive mode.

If you have to do an atomic operation you are somewhat slower than you
be otherwise.

If you have to do an atomic operation and another cpu has the cacheline,
you are visibly slower. And if the cacheline travels a lot between cpus
(e.g. because the lock is contended), the performance degrades rapidly.

NUMA increases the cost of cacheline bounces, making the already bad
situation even worse.

Locking primitives are affected by NUMA significantly more than they
have to be (I'm working on that), but any fixes in the area are just
bandaids.

For instancee, I reproduce the http benchmark and indeed I have about
75k req/s on 2 * 10 * 2 box, although I'm only using one client.

Profiling shows excessive contention on the 'accept lock' and something
else from the socket layer. The latter comes from kqueue being extremely
inefficient by acquiring and releasing the same lock about 4 times per
call on average (if it took it *once* it would significantly reduce lock
bouncing around, including across the socket to a different node). But
even taking it once is likely too bad - no matter how realistically fast
this can get, if all nginx processes serialize on this lock this is not
going to scale.

That said, the end result would be significantly higher if lock
granularity was better and I suspect numa-awareness would not be a
significant factor in the http benchmark - provided locks are granular
enough, they would travel across the socket only if they get pushed out
of the cache (which would be rare), but there would be no contention.

This is a small excerpt from a reply I intend to write to the other
thread where the 'solisten' patch is discussed. It gets rid of the
accept lock contention, but this increases the load on other lock and
thattemporarily slows things down.

-- 
Mateusz Guzik <mjguzik gmail.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170308150346.GA32269>