Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Mar 2014 23:14:18 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        FreeBSD Filesystems <freebsd-fs@freebsd.org>
Subject:   Re: review/test: NFS patch to use pagesize mbuf clusters
Message-ID:  <2106150833.655954.1395371658421.JavaMail.root@uoguelph.ca>
In-Reply-To: <532947C9.9010607@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Alexander Motin wrote:
> On 19.03.2014 01:57, Rick Macklem wrote:
> > Alexander Motin wrote:
> >> I run several profiles on em NIC with and without the patch. I can
> >> confirm that without the patch m_defrag() is indeed called, while
> >> with
> >> patch it is not any more. But profiler shows to me that very small
> >> amount of time (percents or even fractions) is spent there. I
> >> can't
> >> measure the effect (my Core-i7 desktop test system has only about
> >> 5%
> >> CPU
> >> load while serving full 1Gbps NFS over the em), though I can't say
> >> for
> >> sure that effect can't be there on some low-end system.
> >>
> > Well, since m_defrag() creates a new list and bcopy()s the data,
> > there
> > is some overhead, although I'm not surprised it isn't that easy to
> > measure.
> > (I thought your server built entirely of SSDs might show a
> > difference.)
> 
> I did my test even from TMPFS, not SSD, but mentioned em NIC is only
> 1Gbps, that is too slow to reasonably load the system.
> 
> > I am more concerned with the possibility of m_defrag() failing and
> > the
> > driver dropping the reply, forcing the client to do a fresh TCP
> > connection
> > and retry of the RPC after a long timeout (1minute or more). This
> > will
> > show up as "terrible performance" for users.
> >
> > Also, some drivers use m_collapse() instead of m_defrag() and these
> > will probably be "train wrecks". I get cases where reports of
> > serious
> > NFS problems get "fixed" by disabling TSO and I was hoping this
> > would
> > work around that.
> 
> Yes, I accept that argument. I don't see much reason to cut
> continuous
> data in small chunks.
> 
> >> I am also not very sure about replacing M_WAITOK with M_NOWAIT.
> >> Instead
> >> of waiting a bit while VM find a cluster, NFSMCLGET() will return
> >> single
> >> mbuf, as result, replacing chain of 2K clusters instead of 4K ones
> >> with
> >> chain of 256b mbufs.
> >>
> > I hoped the comment in the patch would explain this.
> >
> > When I was testing (on a small i386 system), I succeeded in getting
> > threads stuck sleeping on "btalloc" a couple of times when I used
> > M_WAITOK for m_getjcl(). As far as I could see, this indicated that
> > it hasd run out of kernel address space, but I'm not sure.
> > --> That is why I used M_NOWAIT for m_getjcl().
> >
> > As for using MCLGET(..M_NOWAIT), the main reason for doing that
> > was I noticed that the code does a drain on zone_mcluster if this
> > allocation attempt for a cluster fails. For some reason, m_getcl()
> > and m_getjcl() do not do this drain of the zone?
> > I thought the drain might help memory constrained cases.
> > To be honest, I've never been able to get a MCLGET(..M_NOWAIT)
> > to fail during testing.
> 
> If it is true, I think that should be handled inside the allocation
> code, not work arounded here. Passing M_NOWAIT means that you agree
> to
> get NULL there, but IMO you don't really want to cut 64K data in ~200
> byte pieces in any case even if system is in low memory condition,
> since
> at least most NICs won't be able to send it without defragging, that
> will also be problematic in low-memory case.
> 
Yep. It looks like calling m_getjcl(..M_NOWAIT..) is worse than
m_getjcl(..M_WAITOK..). Using M_NOWAIT does avoid getting stuck
looping and sleeping on "btalloc", however...
I thought it would result in m_getjcl() returning NULL. What actually
happens is it now loops in "R" state. Unfortunately before I got a
lot of info on it, the machine wedged pretty good.

I'm now trying to make it happen again so I can poke at it some more,
but it seems that this needs to be resolved before the patch could
go in head.

As a complete aside, it looks like the loop in tcp_output() may be
broken and generating TSO segments > 65535 and this might explain
the headaches w.r.t. TSO enabled interfaces. (See the ixgbe thread
over on freebsd-net@.)

I'll post if/when I have more on how UMA(9) behaves when the
boundary tag zone can't seem to do an allocation. (It seems it
results in an allocation request for the mbuf page cluster zone
looping instead of returning NULL, but I'm not sure yet.)

Anyone familiar with UMA(9) and what the boundary tags are for,
feel free to jump in here and explain it, because I don't know
diddly about it at this point.

Thanks for testing it and stay tuned, rick

> --
> Alexander Motin
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2106150833.655954.1395371658421.JavaMail.root>