Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Apr 2009 14:35:33 +0200
From:      Ivan Voras <ivoras@freebsd.org>
To:        freebsd-net@freebsd.org
Subject:   Re: Advice on a multithreaded netisr  patch?
Message-ID:  <grcsus$9vh$1@ger.gmane.org>
In-Reply-To: <alpine.BSF.2.00.0904061238250.34905@fledge.watson.org>
References:  <gra7mq$ei8$1@ger.gmane.org>	<alpine.BSF.2.00.0904051422280.12639@fledge.watson.org>	<grac1s$p56$1@ger.gmane.org>	<alpine.BSF.2.00.0904051440460.12639@fledge.watson.org>	<grappq$tsg$1@ger.gmane.org>	<alpine.BSF.2.00.0904052243250.34905@fledge.watson.org>	<grbcfg$poe$1@ger.gmane.org> <alpine.BSF.2.00.0904061238250.34905@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig2259B8C6FCD2C8A9C92854A6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Robert Watson wrote:
> On Mon, 6 Apr 2009, Ivan Voras wrote:

>> So, a mbuf can reference data not yet copied from the NIC hardware?
>> I'm specifically trying to undestand what m_pullup() does.
>=20
> I think we're talking slightly at cross purposes.  There are two
> transfers of interest:
>=20
> (1) DMA of the packet data to main memory from the NIC
> (2) Servicing of CPU cache misses to access data in main memory
>=20
> By the time you receive an interrupt, the DMA is complete, so once you

OK, this was what was confusing me - for a moment I thought you meant
it's not so.

> believe a packet referenced by the descriptor ring is done, you don't
> have to wait for DMA.  However, the packet data is in main memory rathe=
r
> than your CPU cache, so you'll need to take a cache miss in order to
> retrieve it.  You don't want to prefetch before you know the packet dat=
a
> is there, or you may prefetch stale data from the previous packet sent
> or received from the cluster.
>=20
> m_pullup() has to do with mbuf chain memory contiguity during packet
> processing.  The usual usage is something along the following lines:
>=20
>     struct whatever *w;
>=20
>     m =3D m_pullup(m, sizeof(*w));
>     if (m =3D=3D NULL)
>         return;
>     w =3D mtod(m, struct whatever *);
>
> m_pullup() here ensures that the first sizeof(*w) bytes of mbuf data ar=
e
> contiguously stored so that the cast of w to m's data will point at a

So, m_pullup() can resize / realloc() the mbuf? (not that it matters for
this purpose)

> Is this for the loopback workload?  If so, remember that there may be
> some other things going on:

Both loopback and physical.

> - Every packet is processed at least two times: once went sent, and the=
n
> again
>   when it's received.
>=20
> - A TCP segment will need to be ACK'd, so if you're sending data in
> chunks in
>   one direction, the ACKs will not be piggy-backed on existing data
> tranfers,
>   and instead be sent independently, hitting the network stack two more=

> times.

No combination of these can make an accounting difference between 1,000
and 250,000 pps. I must be hitting something very bad here.

> - Remember that TCP works to expand its window, and then maintains the
> highest
>   performance it can by bumping up against the top of available bandwid=
th
>   continuously.  This involves detecting buffer limits by generating
> packets
>   that can't be sent, adding to the packet count.  With loopback
> traffic, the
>   drop point occurs when you exceed the size of the netisr's queue for
> IP, so
>   you might try bumping that from the default to something much larger.=


My messages are approx. 100 +/- 10 bytes. No practical way they will
even span multiple mbufs. TCP_NODELAY is on.

> No.  x++ is massively slow if executed in parallel across many cores on=

> a variable in a single cache line.  See my recent commit to kern_tc.c
> for an example: the updating of trivial statistics for the kernel time
> calls reduced 30m syscalls/second to 3m syscalls/second due to heavy
> contention on the cache line holding the statistic.  One of my goals fo=
r

I don't get it:
http://svn.freebsd.org/viewvc/base/stable/7/sys/kern/kern_tc.c?r1=3D18989=
1&r2=3D189890&pathrev=3D189891

you replaced x++ with no-ops if TC_COUNTER is defined? Aren't the
timecounters actually needed somewhere?

> 8.0 is to fix this problem for IP and TCP layers, and ideally also ifne=
t
> but we'll see.  We should be maintaining those stats per-CPU and then
> aggregating to report them to userspace.  This is what we already do fo=
r
> a number of system stats -- UMA and kernel malloc, syscall and trap
> counters, etc.

How magic is this? Is it just a matter of declaring mystatarray[NCPU]
and updating mystat[current_cpu] or (probably), the spacing between
array elements should be magically fixed so two elements don't share a
cache line?

>>> - Use cpuset to pin ithreads, the netisr, and whatever else, to speci=
fic
>>> cores
>>>   so that they don't migrate, and if your system uses HTT, experiment=

>>> with
>>>   pinning the ithread and the netisr on different threads on the same=

>>> core, or
>>>   at least, different cores on the same die.
>>
>> I'm using em hardware; I still think there's a possibility I'm
>> fighting the driver in some cases but this has priority #2.
>=20
> Have you tried LOCK_PROFILING?  It would quickly tell you if driver
> locks were a source of significant contention.  It works quite well...

I don't think I'm fighting against locking artifacts, it looks more like
some kind of overly smart hardware thing, like interrupt moderation (but
not exactly interrupt moderation since the number of IRQs/s remains
approx. the same).

>>> - If your card supports RSS, pass the flowid up the stack in the mbuf=

>>> packet
>>>   header flowid field, and use that instead of the hash for work
>>> placement.
>>
>> Don't know about em. Don't really want to touch it if I don't have to =
:)
>=20
> if_em doesn't support it, but if_igb does.  If this saves you a minimum=

> of one and possibly two cache misses per packet, it could be a huge
> performance improvement.

If I had the funds to upgrade hardware, I wouldn't be so interested in
solving it in software :)


--------------enig2259B8C6FCD2C8A9C92854A6
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJ2fccldnAQVacBcgRAnUsAKDvLaUuooKGdMVtT+qJDLQXFNQ/CQCeJvP3
2Xzrk5yV4QbhBpmg5XvCqPk=
=0776
-----END PGP SIGNATURE-----

--------------enig2259B8C6FCD2C8A9C92854A6--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?grcsus$9vh$1>