Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 17 Apr 2008 09:43:57 -0400
From:      "Alexander Sack" <pisymbol@gmail.com>
To:        "Bruce Evans" <brde@optusnet.com.au>
Cc:        freebsd-net@freebsd.org, Dieter <freebsd@sopwith.solgatos.com>, Jung-uk Kim <jkim@freebsd.org>
Subject:   Re: bge dropping packets issue
Message-ID:  <3c0b01820804170643w6b771ce9jdfc2dc5b240922b@mail.gmail.com>
In-Reply-To: <20080417112329.G47027@delplex.bde.org>
References:  <3c0b01820804160929i76cc04fdy975929e2a04c0368@mail.gmail.com> <200804161456.20823.jkim@FreeBSD.org> <3c0b01820804161328m77704ca0g43077a9718d446d4@mail.gmail.com> <200804161654.22452.jkim@FreeBSD.org> <3c0b01820804161402u3aac4425n41172294ad33a667@mail.gmail.com> <20080417112329.G47027@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
First off thanks for the detailed reply Bruce.  I have some follow-up
questions in my quest to learn more about BGE/networking etc.

On Wed, Apr 16, 2008 at 10:53 PM, Bruce Evans <brde@optusnet.com.au> wrote:
> On Wed, 16 Apr 2008, Alexander Sack wrote:
> > On Wed, Apr 16, 2008 at 4:54 PM, Jung-uk Kim <jkim@freebsd.org> wrote:
>
>  First stop using the DEVICE_POLLING packet lossage service...
>

For my own edification, when do you want use DEVICE_POLLING versus
interrupt driven network I/O?  With all question like these I suppose
the answer depends on the workload and the interrupt bandwidth of the
machine (which depends on the type of hardware)...

But why was it added to begin with if standard interrupt driven I/O is
faster?  (was it the fact that historically hardware didn't do
interrupt coalescing initially)

This was my next step so it sounds like I'm in the right direction.

> > You are correct, I got the names confused (this problem really stinks)!
> >
> > However, my point still stands:
> >
> > #define TG3_RX_RCB_RING_SIZE(tp) ((tp->tg3_flags2 &
> > TG3_FLG2_5705_PLUS) ?  512 : 1024)
> >
> > Even the Linux driver uses higher number of RX descriptors than
> > FreeBSD's static 256.  I think minimally making this tunable is a fair
> > approach.
> >
> > If not, no biggie, but I think its worth it.
> >
>
>  I use a fixed value of 512 (jkim gave a pointer to old mail containing
>  a fairly up to date version of my patches for this and more important
>  things).  This should only make a difference with DEVICE_POLLING.

Then minimally 512 could be used if DEVICE_POLLING is enabled and my
point still stands.  Though in light of the other statistics you cited
I understand now that this may not make that big of an impact.

>  Without DEVICE_POLLING, device interrupts normally occur every 150 usec

Is that the coal ticks value you are referring too?  Sorry this is my
first time looking at this driver!

>  or even more frequently (too frequently if the average is much lower
>  than 150 usec), so 512 descriptors is more than enough for 1Gbps ethernet
>  (the minimum possible inter-descriptor time for tiny packets is about 0.6
>  usec,

How do you measure this number?

I'm assuming when you say "inter-descriptor time" you mean the time it
takes the card to fill a RX descriptor on receipt of a packet (really
the firmware latency?).

>  so the minimum number of descriptors is 150 / 0.6 = 250.  The
>  minimum possible inter-descriptor time is rarely achieved, so 250 is
>  enough even with some additional latency.  See below for more numbers).
>  With DEVICE_POLLING at HZ = 1000, the corresponding minimum number of
>  descriptors is 1000 / 0.6 = 1667.  No tuning can give this number.  You
>  can increase HZ to 4000 and then the fixed value of 512 is large enough.
>  However, HZ = 4000 is wasteful, and might not actually deliver 4000 polls
>  per second -- missed polls are more likely at higher frequencies.

Understood makes sense.

>  For timeouts instead of device polls, at least on old systems it was
>  quite common for timeouts at a frequency of HZ not actually being
>  delivered, even when HZ was only 100, because some timeouts run for
>  too long (several msec each, possibly combining to > 10 msec occasionally).
>  Device polls are at a lower level, so they have a better chance of
>  actually keeping up with HZ.  Now the main source of timeouts that run
>  for too long is probably mii tick routines.  These won't combine, at
>  least for MPSAFE drivers, but they will block both interrupts and
>  device polls for their own device.  So the rx ring size needs to be
>  large enough to cover max(150 usec or whatever interrupt moderation time,
>  mii tick time) of latency plus any other latencies due to interrupt
> handling
>  or polling of for other devices.  Latencies due to interrupts on other
>  devices is only certain to be significant if the other devices have higher
>  or the same priority.

You described what I'm seeing.  Couple this with the fact that the
driver uses one mtx for everything doesn't help.  I'm pretty sure I'm
running into RX descriptor starvation despite the fact that
statistically speaking, 256 descriptors is enough for 1Gbps (I'm
talking 100MBps the firmware is dropping packets).  The problem gets
worse if I add some kind of I/O workload on the system (my load
happens to be a gzip of a large log file in /tmp).

I noticed that if I put ANY kind of debugging messages in bge_tick()
the drop gets much worse (for example just printing out the number of
dropped packets read from bge_stats_update() when a drop occurs causes
EVEN more drops to incur and if I had to guess its the printf just
uses up more cycles which delays the drain of RX chain and causes a
longer time to recover - this is a constant stream from a traffic
generator).

>  I don't understand how 1024 can be useful for 5705_PLUS.  PLUS really
>  means MINUS -- at pleast my plain 5705 device has about 1/4 of the
>  performance of my older 5701 device.  The data sheet gives a limit of
>  512 normal rx descriptors for both.  Non-MINUS devices also support
>  jumbo buffers.  These are in a separate ring and not affected by the
>  limit of 256 in FreeBSD.

Well it seems that the 5714, 5715, and 5780 when configured to use
jumbo frames and extended buffer descriptors falls to 256 so it seems
256 is generally a safe number for all cards (but I still submit that
the fact that you yourself bump it to 512 for std frames means this
should be optionally tunable).

>  Some numbers for [1 Gbps] ethernet:
>
>  minimum frame size = 64 bytes =    512 bits
>  minimum inter-frame gap =           96 bits
>  minimum total frame time =         608 nsec (may be off by 64)
>  bge descriptors per tiny frame   = 1 (1 for mbuf)
>  buffering provided by 256 descriptors = 256 * 608 = 155.648 usec (marginal)

So as I read this, its takes 155 usec to fill up the entre RX chain of
rx_bd's if its just small packets, correct?

>  normal frame size = 1518 bytes = 12144 bits
>  normal total frame time =        12240 nsec
>  bge descriptors per normal frame = 2 (1 for mbuf and 1 for mbuf cluster)
>  buffering provided by 256 descriptors = 256/2 * 12240 = 1556.720 usec
> (plenty)

Is this based again on your own instrumentation based on the last
patch?  (just curious, I believe you, I just wanted to know if this
was an artifact of you doing some tuning research or something else)

>  The only tuning that I've found useful since writing my old mail (with
>  the patches) is:
>
>     sysctl kern.random.sys.harvest.ethernet=0
>     sysctl kern.random.sys.harvest.interrupt=0
>     sysctl net.inet.ip.intr_queue_maxlen=1024
>
>  Killing the entropy harvester isn't very important, and IIRC the default
>  intr_queue_maxlen of 50 is only a problem with net.isr.direct=0 (not the
>  default in -current).  With an rx ring size of just 256 and small packets
>  so that there is only 1 descriptor per packet, a single bge interrupt or
>  poll can produces 5 times as many packets as will fit in the default
>  intr_queue.  IIRC, net.isr.direct=1 has the side affect of completely
>  bypassing the intr_queue_maxlen limit, so floods cause extra damage before
>  they are detected.  However, the old limit of 50 is nonsense for all
>  not so old NICs.  Another problem with the intr_queue_maxlen limit is that
>  packets dropped because of it don't show up in any networking statistics
>  utility.  They only show up in sysctl output and cannot be associated with
>  individual interfaces.

Interesting, though the packet loss I'm describing is indeed reported
by the card and not drops due to hitting intr_queue_mexlen (honestly I
haven't looked at this aspect of the system yet so I can't comment).

Bruce, again thank you for your in depth email.

So the million dollar question:  Do you believe that if I disable
DEVICE_POLLING and use interrupt driven I/O, I could achieve zero
packet loss over a 1Gbps link?  This is the main issue I need to solve
(solve means either no its not really achievable without a heavy
rewrite of the driver OR yes it is with some tuning).  If the answer
is yes, then I have to understand the impact on the system in general.
 I just want to be sure I'm on a viable path through the BGE maze!
:D!

Thanks!

-aps



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3c0b01820804170643w6b771ce9jdfc2dc5b240922b>