Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 18 Dec 2010 14:48:09 -0800
From:      Pyun YongHyeon <pyunyh@gmail.com>
To:        abcde abcde <abcde0@yahoo.com>
Cc:        freebsd-net@freebsd.org
Subject:   Re: nfe_defrag() routine in nividia ethernet driver
Message-ID:  <20101218224809.GA22768@michelle.cdnetworks.com>
In-Reply-To: <808782.86181.qm@web53807.mail.re2.yahoo.com>
References:  <808782.86181.qm@web53807.mail.re2.yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Dec 16, 2010 at 07:53:16PM -0800, abcde abcde wrote:
> Hi, we ported the nvidia ethernet driver to our product.? It's been OK until 
> recently we?ran into an error condition where packets would get dropped quietly. 
> The root cause resides in the nfe_encap() routine, where we call nfe_defrag() to 
> try to reduce the length of the mbuf chain to 32, if it's longer than 32. In the 
> event the 32 mbufs need more than 32 segments, the subsequent call to 
> bus_dmamap_load_mbuf_sg() would cause it to return an error then the packet is 
> subsequently dropped. 
> 
> 
> My questions are,
> 
> 1. there appears to be a generic m_defrag() routine available, which doesn't 
> stop at 32 and is used by a couple of other drivers (Intel, Broadcom, to name a 
> few). What was the need for a nvidia version of the defrag routine?
> 

As John said, m_defrag(9) is expensive operation. Since all nfe(4)
controllers supports multiple TX buffers use m_collapse(9) instead.

> 2. The NFE_MAX_SCATTER constant, which limits how many segments can be used, is 
> defined to be 32, while the corresponding constants for other drivers are 100 or 
> 64 (again Intel or Broadcom). How was the value 32 picked? Anybody knows the 
> reasoning behind them?
> 

I think all nfe(4) controllers have no limitation on number of
segments can be used. However most ethernet controllers targeted to
non-server systems are not good at supporting multiple outstanding
DMA read operation on the PCIe bus. Even though controller supports
multiple DMA read operation it would take more time to fetch a TX
frame that is split into long list of mbuf chains than short/single
contiguous TX frame. CPU is much faster than controller DMA engine.
The magic number 32 was chosen to balance on performance and
resource usage. 32 should be large enough to support TSO to send a
full 64KB TCP segment. If controller has no TSO capability I would
have used 16.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101218224809.GA22768>