Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Jul 2018 21:38:20 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Adrian Chadd <adrian.chadd@gmail.com>, "ryan@ixsystems.com" <ryan@ixsystems.com>, FreeBSD Net <freebsd-net@freebsd.org>
Subject:   Re: 9k jumbo clusters
Message-ID:  <YTOPR0101MB0953AE665C73D96D0B1E6BBADD280@YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <CAJ-VmomHQ%2BzcJ%2BHXAjMg9aS1RPZsdHy0tYjdKzjpwrUY%2B05NiQ@mail.gmail.com>
References:  <EBDE6EDD-D875-43D8-8D65-1F1344A6B817@ixsystems.com> <20180727221843.GZ2884@funkthat.com>, <CAJ-VmomHQ%2BzcJ%2BHXAjMg9aS1RPZsdHy0tYjdKzjpwrUY%2B05NiQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Adrian Chadd wrote:
>John-Mark Gurney wrote:
[stuff snipped]
>>
>> Drivers need to be fixed to use 4k pages instead of cluster.  I really h=
ope
>> no one is using a card that can't do 4k pages, or if they are, then they
>> should get a real card that can do scatter/gather on 4k pages for jumbo
>> frames..
>
>Yeah but it's 2018 and your server has like minimum a dozen million 4k
>pages.
>
>So if you're doing stuff like lots of network packet kerchunking why not
>have specialised allocator paths that can do things like "hey, always give
>me 64k physical contig pages for storage/mbufs because you know what?
>they're going to be allocated/freed together always."
>
>There was always a race between bus bandwidth, memory bandwidth and
>bus/memory latencies. I'm not currently on the disk/packet pushing side of
>things, but the last couple times I were it was at different points in tha=
t
>4d space and almost every single time there was a benefit from having a
>couple of specialised allocators so you didn't have to try and manage a fe=
w
>dozen million 4k pages based on your changing workload.
>
>I enjoy the 4k page size management stuff for my 128MB routers. Your 128G
>server has a lot of 4k pages. It's a bit silly.
Here's my NFS guy perspective.
I do think 9K mbuf clusters should go away. I'll note that I once coded NFS=
 so it
would use 4K mbuf clusters for the big RPCs (write requests and read replie=
s) and
I actually could get the mbuf cluster pool fragmented to the point it stopp=
ed
working on a small machine, so it is possible (although not likely) to frag=
ment
even a 2K/4K mix.

For me, send and receive are two very different cases:
- For sending a large NFS RPC (lets say a reply to a 64K read), the NFS cod=
e will
  generate a list of 33 2K mbuf clusters. If the net interface doesn't do T=
SO, this
  is probably fine, since tcp_output() will end up busting this up into a b=
unch of
  TCP segments using the list of mbuf clusters with TCP/IP headers added fo=
r
  each segment, etc...
  - If the net interface does TSO, this long list goes down to the net driv=
er and uses
    34->35 ring entries to send it (it adds at least one segment for the MA=
C header
    typically). If the driver isn't buggy and the net chip supports lots of=
 transmit
    ring entries, this works ok but...
 - If there was a 64K supercluster, the NFS code could easily use that for =
the 64K
   of data and the TSO enabled net interface would use 2 transmit ring entr=
ies.
   (one for the MAC/TCP/NFS header and one for the 64K of data). If the net=
 interface
   can't handle a TSO segment over 65535bytes, it will end up getting 2 TSO=
 segments
   from tcp_output(), but that still is a lot less than 35.
I don't know enough about net hardware to know when/if this will help perf.=
, but
it seems that it might, at least for some chipsets?

For receive, it seems that a 64K mbuf cluster is overkill for jumbo packets=
, but as
others have noted, they won't be allocated for long unless packets arrive o=
ut of
order, at least for NFS. (For other apps., they  might not read the socket =
for a while
to get the data, so they might sit in the socket rcv queue for a while.)

I chose 64K, since that is what most net interfaces can handle for TSO thes=
e days.
(If it will soon be larger, I think this should be even larger, but all of =
them the same
 size to avoid fragmentation.) For the send case for NFS, it wouldn't even =
need to
be a very large pool, since they get free'd as soon as the net interface tr=
ansmits
the TSO segment.

For NFS, it could easily call mget_supercl() and then fall back on the curr=
ent code using 2K mbuf clusters if mget_supercl() failed, so a small pool w=
ould be fine for the
 NFS send side.

I'd like to see a pool for 64K or larger mbuf clusters for the send side.
For the receive side, I'll let others figure out the best solution (4K or l=
arger
for jumbo clusters). I do think anything larger than 4K needs a separate al=
location
pool to avoid fragmentation.
(I don't know, but I'd guess iSCSI could use them as well?)

rick




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTOPR0101MB0953AE665C73D96D0B1E6BBADD280>