Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 21 Mar 1999 19:07:52 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        dillon@apollo.backplane.com (Matthew Dillon)
Cc:        tlambert@primenet.com, hasty@rah.star-gate.com, wes@softweyr.com, ckempf@enigami.com, wpaul@skynet.ctr.columbia.edu, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Gigabit ethernet -- what am I doing wrong?
Message-ID:  <199903211907.MAA16206@usr06.primenet.com>
In-Reply-To: <199903211835.KAA13904@apollo.backplane.com> from "Matthew Dillon" at Mar 21, 99 10:35:14 am

next in thread | previous in thread | raw e-mail | index | archive | help
> :You mean "most recent network cards".  Modern networks cards have memory
> :that can be DMA'ed into by other modern network cards.
> :
> :Moral:	being of later manufacture makes you more recent, but being
> :	capable of data higher rates is what makes you modern.
> 
>     It's a nice idea, but there are lots of problems with card-to-card
>     DMA.  If you have only two network ports in your system (note: I said
>     ports, not cards), I suppose you could get away with it.  Otherwise you
>     need something significantly more sophisticated.
> 
>     The problem is that you hit one of the most common situations that occured
>     in early routers:  DMA blockages to one destination screwing over others.
> 
>     For example, say you have four network ports and you are receiving packets
>     which must be distributed to the other ports.  Lets say network port #1
>     receives packets A, B, C, D, and E.  Packet A must be distributed to
>     port #2, and packet's B-D must be distributed to port #3, and packet E
>     must be distributed to port #4.
> 
>     What happens when the DMA to port #2 blocks due to a temporary overcommit
>     of packets being sent to port #2?  Or due to a collision/retry situation
>     occuring on port #1?  What happens is that the packets B-E stick around
>     in port #1's input queue and don't get sent to ports 3 and 4 even if
>     ports 3 and 4 are idle.

My argument would be that you should be looking at this as an IP switch
level thing, on the order of a Kalpana Etherswitch, or whatever the
modern Gigabit ethernet equivalent manufacturer par excellance du jour
happens to be.

Personally, I have no problem whatsoever doing the MAE-West trick of
invoking the "leaky bucket" algorithm, most noticible when they bought
only one, not two, gigaswitches, and failed to dedicate 50% of the
ports to communication with other gigaswitches, which is what the
Fibbonacci sequence would suggest is the correct mechanism for node
expansion of that type of topology.

>     Even worse, what happens to poor packet E which can't be sent to port 4
>     until all the mess from packets A-D are dealt with?  Major latency occurs
>     at best, packet loss occurs at worse.

The way to deal with the "mess" is to allow the origin to retransmit,
instead of taking it upon yourself to ensure reliable end-to-end
datagram delivery of IP datagrams.  Why do it for IP if you aren't
going to do it for UDP/IP?

If you really have a problem with the idea that packets which collide
should be dropped, well, then stop thinking IP and start thinking ATM
instead (personnaly, I hate ATM, but if the tool fits...).


>     For each port in your system, you need a *minimum* per-port buffer size
>     to handle the maximum latency you wish to allow times the number of ports
>     in the router.  If you have 4 1 Gigabit ports and wish to allow latencies
>     of up to 20mS, each port would require 8 MBytes of buffer space and you
>     *still* don't solve the problem that occurs if one port backs up, short
>     of throwing away the packets destined to other ports even if the other
>     ports are idle.

Yes, that's true.  This is the "minumum pool retention time" problem,
which is the maximum allowable latency before a discard occurs.


>     Backups can also introduce additional latencies that are not the fault 
>     of the destination port.

It doesn't matter whose fault it is.


>     DEC Gigaswitch switches suffered from exactly this problem -- MAE-WEST
>     had serious problems for several years, in fact, due to overcommits on
>     a single port out of dozens.

See above.  This was a management budget driven topology error, having
really nothing at all to do with the capabilities of the hardware.  It
was the fact that in a topology having more than one gigaswitch, each
gigaswitch must dedicate 50% of its capability to talking to the other
gigaswitches.  This means you must increase the number of switches in
a Fibbonacci sequence, not a linear additive sequence.

Consider that if I have one gigaswitch, and all it's ports are in use,
if I dedicate 50% of it's ports to inter-switch communication, and
then add only one other switch where I do the same, then I end up
with exactly the same number of ports.

The MAE-West problem was that (much!) fewer than 50% of the ports on
the two switches were dedicated to interconnect.

For topologies of 5 or more switches, the number of ports which must
be dedicated to direct interconnect goes up again... one progression
is Fibbonacci, the other geometric; you have diminishing returns for
the next X switches in the progression, but the returns are always
positive; you merely have to pay progresively more for each node in
the progression.

The soloution to this is to have geographically seperate clusters
of the things, and ensure that the interlinks between them are
significantly faster than their interlinks to equipment that isn't
them.


>     There are solutions to this sort of problem, but all such solutions
>     require truely significant on-card buffer memory... 8 MBytes minimum
>     with my example above.  In order to handle card-to-card DMA, cards
>     must be able to handle sophisticated DMA scheduling to prevent
>     blockages from interfering with other cards.

With respect, I think that you are barking up the wrong tree.  If the
latency is such that the pool retention time becomes infinite for any
packet, then you are screwed, blued, and tatooed.  You *must* never
fill the pool faster than you can drain it... period.  The retention
time is dictated not by the traffic rate itself, but by the relative
rate between the traffic and the rate at which you can make delivery
decisions based on traffic content.

Increasing the pool size can only compensate for a slow decision process,
not resource overcommit by a budget-happy management.  The laws of physics
and mathematics bend to the will of no middle-manager.


>     Industrial strength routers that implement cross bars or other high 
>     speed switch matrices have to solve the ripple-effect-blockage problem.
>     It is not a trivial problem to solve.  It *IS* a necessary problem to
>     solve since direct card-card transfers are much more efficient then
>     transfers to common shared-memory stores.

It seems to me that this is a wholly artificial problem based upon an
unreasonable desire to overcommit resources, said desire resulting in
an assymetry in the routing capability between supposed peers, such
that the interaction of two peers can negatively impact a third.

I blame this on the phone company's historical use of circuit switching
technologies, and the baggage that comes with that mindset.

Assymetry is bad.  All of the memory in the world would not have fixed
the MAE-West topology problem (well, OK, but only if they got the
memory from the machines that were trying to send packets through
MAE-West, such that they were incapable of generating traffic for
lack of the memory needed to boot and run ;-)).


>     It is *NOT* a problem that PC architectures can deal with well,
>     though.

On this, we heartily agree!

>     It is definitely *NOT* a problem that PCI cards are usually able
>     to deal with due to the lack of DMA channels and the lack of a
>     system-wide scheduling protocol.

I still think that it's very interesting to ask "what's the absolute,
total balls-to-the-wall best that such limited hardware can do?", not
to mention fun.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199903211907.MAA16206>