Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Jul 2016 12:23:35 -0700
From:      Adrian Chadd <adrian@freebsd.org>
To:        Sepherosa Ziehau <sepherosa@gmail.com>
Cc:        Andrew Gallatin <gallatin@cs.duke.edu>, FreeBSD Net <freebsd-net@freebsd.org>
Subject:   Re: proposal: splitting NIC RSS up from stack RSS
Message-ID:  <CAJ-Vmomj5XjtmqbTukmxqdiF_A-Ga1jFMA5r24=CXcG0gueYsg@mail.gmail.com>
In-Reply-To: <CAMOc5cxEWqWOMPSXFe3=N5S93bs8RO-XX22QghtHd8vC5xuNjA@mail.gmail.com>
References:  <CAJ-Vmo=Wj3ZuC6mnVCxonQ74nfEmH7CE=TP3xhLzWifdBxxfBQ@mail.gmail.com> <306af514-70ff-f3bf-5b4f-da7ac1ec6580@cs.duke.edu> <CAJ-VmomHYVCknVkDLF%2Bb8Gc5wBWxkddEMY3dhvbxJihLZHyTLg@mail.gmail.com> <CAMOc5cxEWqWOMPSXFe3=N5S93bs8RO-XX22QghtHd8vC5xuNjA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 21 July 2016 at 18:54, Sepherosa Ziehau <sepherosa@gmail.com> wrote:
> On Fri, Jul 22, 2016 at 6:39 AM, Adrian Chadd <adrian@freebsd.org> wrote:
>> hi,
>>
>> Cool! Yeah, the RSS bits thing can be removed, as it's just doing a
>> bitmask instead of a % operator to do mapping. I think we can just go
>> to % and if people need the extra speed from a power-of-two operation,
>> they can reintroduce it.
>
> I thought about it a while ago (the most popular E5-2560v{1,2,3} only
> has 6 cores, but E5-2560v4 has 8 cores! :).  Since the raw RSS hash
> value is '& 0x1f' (I believe most of the NICs use 128 entry indirect
> table as defined by MS RSS) to select an entry in the indirect table,
> simply '%' on the raw RSS hash value probably will not work properly;
> you will need (hash&0x1f)%mp_ncpus at least.  And well, since the
> indirect table's size if 128, you still will get some uneven CPU
> workload for non-power-of-2 cpus.  And if you take cpu affinity into
> consideration, the situation will be even more complex ...

Hi,

Sure. The biggest annoying part is that a lot of the kernel
infrastructure for queueing packets (netisr) and scheduling stack work
(callouts) are indexed on CPU, not on "thing". If it was indexed on
"thing" then we could do a two stage work redistribution method that'd
scale O(1):

* packets get plonked into "thing" via some mapping table - eg, map
128 or 256 buckets to queues that do work / schedule call outs /
netisr; and
* the queues aren't tied to a CPU at this point, and it can get
shuffled around by using cpumasks.

It'd be really, really nice IMHO if we had netisr and callouts be
"thing" based rather than "cpu" based, so we could just shift work by
changing the CPU mask - then we don't have to worry about rescheduling
packets or work onto the new CPU when we want to move load around.
That doesn't risk out of order packet handling behaviour and it means
we can (in theory!) put a given RSS bucket into more than one CPU, for
things like TCP processing.

Trouble is, this is somewhat contentious. I could do the netisr change
without upsetting people, but the callout code honestly makes me want
to set everything (in sys/kern) on fire and start again. After all of
the current issues with the callout subsystem I kind of just want to
see hps finish his work and land it into head, complete with more
sensible lock semantics, before I look at breaking it out to not be
per-CPU based but instead allow subsystems to create their own worker
pools for callouts. I'm sure NFS and CAM would like this kind of thing
too.

Since people have asked me about this in the past, the side effect of
support dynamic hash mapping (even in software) is that for any given
flow, once you change the hash mapping you will have some packets in
said flow in the old queue and some packets in the new queue. For
things like stack TCP/UDP where it's using pcbgroups it can vary from
being slow to (eventually, when the global list goes away) plainly not
making it to the right pcb/socket, which is okay for some workloads
and not for others. That may be a fun project to work on once the
general stack / driver tidyups are done, but I'm going to resist doing
it myself for a while because it'll introduce the above uncertainties
which will cause out-of-order behaviour that'll likely generate more
problem reports than I want to handle.

(Read: since I'm doing this for free, I'm not going to do anything
risky, as I'm not getting paid to wade through the repercussions just
right now.)

FWIW, we had this same problem in ye olde past with squid and WCCP
with its hash based system. Squid's WCCP implementation was simple and
static. The commercial solutions (read: cisco, etc) implemented
handling the cache set changing / hash traffic map changing by having
the caches redirect traffic to the /old/ cache whenever the hash or
cache set changed. Squid didn't do this out of the box, so if the
cache topology changed it would send traffic to the wrong box and the
existing connections would break.



-adrian



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmomj5XjtmqbTukmxqdiF_A-Ga1jFMA5r24=CXcG0gueYsg>