Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 22 Mar 1999 05:45:54 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        dillon@apollo.backplane.com (Matthew Dillon)
Cc:        tlambert@primenet.com, hasty@rah.star-gate.com, wes@softweyr.com, ckempf@enigami.com, wpaul@skynet.ctr.columbia.edu, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Gigabit ethernet -- what am I doing wrong?
Message-ID:  <199903220545.WAA10719@usr01.primenet.com>
In-Reply-To: <199903211951.LAA14338@apollo.backplane.com> from "Matthew Dillon" at Mar 21, 99 11:51:28 am

next in thread | previous in thread | raw e-mail | index | archive | help
> :If you really have a problem with the idea that packets which collide
> :should be dropped, well, then stop thinking IP and start thinking ATM
> :instead (personnaly, I hate ATM, but if the tool fits...).
> 
>     Terry, this may sound great on paper, but nobody in their right mind
>     drops a packet inside a router unless they absolutely have to.  This
>     lesson was learned long ago, and has only become more important now as 
>     the number of hops increase.   There is no simple solution that doesn't
>     have terrible boundry conditions on the load curve.

I think running out of buffers such that a single VC can implement
implicit denial of service attacks against another VC counts as an
"absolutely has to" situation.

This is really no different than th SYN-flooding soloutions.


> :Yes, that's true.  This is the "minumum pool retention time" problem,
> :which is the maximum allowable latency before a discard occurs.
> 
>     No, it's worse then that.  Supplying sufficient buffer space only
>     partially solves the problem.  If the buffer space is not properly
>     scheduled, an overload on one port or even just the statistical
>     possibility of packets being ordered to different destinations
>     badly can result in a multiplication of the latency to other ports.

This is incorporated into the retention time calculation.  Scheduling
issues are complex, agreed, but real-world soloutions, like dropping
oldest packets first (on the theory that they will be retransmitted
by the sender first -- and may already have been) frankly tend to have
reasonable effects.

What we're talking about here is overloading the equipment, and then
having it fail in such a way that everyone who is loading it takes
the hit "fairly".


>     Adding even more buffer space to try to brute-force a solution
>     doesn't work well

I'd go farther:  I say it doesn't work at all.  The idea that by adding
a larger buffer (the better to increase your unhandlable backlog?) flies
in the face of queueing theory.


> :See above.  This was a management budget driven topology error, having
> :really nothing at all to do with the capabilities of the hardware.  It
> :was the fact that in a topology having more than one gigaswitch, each
> :gigaswitch must dedicate 50% of its capability to talking to the other
> :gigaswitches.  This means you must increase the number of switches in
> :a Fibbonacci sequence, not a linear additive sequence.
> 
>     It had nothing to do with any of that.  I *LIVED* the crap at MAE-WEST 
>     because BEST had a T3 there and we had to deal with it every day, even
>     though we were only using 5 MBits out of our 45 MBit T3.  The problem
>     was that a number of small 'backbones' were selling transit bandwidth
>     at MAE-WEST and overcommitting their bandwidth.  The moment any one of
>     their ports exceeded 45 MBits, the Gigaswitch went poof due to
>     head-of-queue blocking, a combination of software on the gigaswitch
>     and the way the gigabit switch's hardware queues packets for transfer
>     between cards. 

Sounds like they failed to implement QOS mechanisms and source quench
properly.  My general response to technology failures is that there is
a responsible human, somewhere.  I know that they had two gigaswitches
at one point in time, and it's obvious from a technical point of view
that two gigaswitches are worse than one gigaswitch.


> :The MAE-West problem was that (much!) fewer than 50% of the ports on
> :the two switches were dedicated to interconnect.
> 
>     The problem at MAE-WEST had absolutely nothing to do with this.  The
>     problem occured *inside* a *single* switch.  If one port overloaded on
>     that switch, all the ports started having problems due to head-of-line
>     blocking.

Look.  You can only shove as many bits down a pipe as the pipe will
take.  If it's one port that's killing you, then you start dropping
packets to and from that port, and punish the port.

While there were humans engaged in overcommit involved, I really have
a hard time understanding a design that would allow humans doing what
humans would obvious do, given the circumstances, to cause problems.

If the thing can't handle N/2 ports running at some speed X on each
port, then the ports shouldn't be run at speed X.



>     BEST had to pull peering with a number of overcommitted nodes for 
>     precisely this reason, but that didn't help nodes that were talking to
>     us who *were* peering with the overcommitted nodes.   The moment any of
>     these unloaded nodes tried to send a packet to an overcommitted node,
>     it blocked the head of the queue and created massive latencies for 
>     packets destined to other unloaded nodes.  Eventually enough pressure
>     was placed on the idiot backbones to clean up their act, but it took 2+
>     years for them to get to that point.

Ugh.  What a bad design that could allow such things to happen.  Better
to drop the packet to the overcommitted node, and deal with the traffic
that doesn't involve the overcommit.  To hell with the overcommitted
node.


>     The solution to this at MAE-WEST was to clamp down on the idiots who
>     were selling transit at MAE-WEST and overcommitting their ports, plus
>     numerous software upgrades none of which really solved the problem
>     completely.

With respect, technology should operate in the absence of human
imposition of policy.  It should have been technically impossible
for the idiots to successfully engage in the behaviour in the first
place, and if it wasn't, then that's a design problem with the
gigaswitches.



>     'you must never fill the pool faster then you can drain it' is a 
>     cop-out.  In the best case scenario, that is exactly what happens.
>     Unfortunately, the best case requires a level of sophistication and
>     scheduling that only a few people have gotten right.  Even Cisco
>     has blown it numerous times.

That's why the people who sell the hardware get the big money.  They
are being paid to resolve these issues, and paid well.

I disagree that it is a "cop out" to expect hardware to function at
its rated capacity, but not higher.  It should be impossible to drive
the hardware above capacity, IMO.


> :Increasing the pool size can only compensate for a slow decision process,
> :not resource overcommit by a budget-happy management.  The laws of physics
> :and mathematics bend to the will of no middle-manager.
> 
>     This is not a resource overcommit issue.  This is a resource scheduling
>     issue.  You can always overcommit a resource -- the router must deal
>     with that situation no matter what your topology.  It is *HOW* you deal
>     with the overcommit that matters.  It is not possible to avoid a resource
>     overcommit in a router or a switch because ports, even ports with the
>     same physical speed, have mismatched loads.

CPU cycles, like RAM, are a resource.  If you can't schedule as fast
as the data comes in, then your scheduler is overcommitted.  I wish
the people working on SMP would look at CPU cycles this way.

By definition, something is not overcommitted only when it can meet
its commitments.  If it can't, then it's broken.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199903220545.WAA10719>