Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jan 2002 00:49:55 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Michal Mertl <mime@traveller.cz>
Cc:        arch@FreeBSD.ORG
Subject:   Re: 64 bit counters again
Message-ID:  <3C47E1B2.6938136@mindspring.com>
References:  <Pine.BSF.4.41.0201180033210.82507-100000@prg.traveller.cz>

next in thread | previous in thread | raw e-mail | index | archive | help
Michal Mertl wrote:
> > 4)    Measure CPU overhead as well as I/O overhead.
> 
> I don't know what do you mean by I/O overhead here.

Say you could flood a gigabit interface, and it was 6% of the
CPU on average.  Now after you patches, suppose that it's 10%
of the CPU.  The limiting factor is the interface... but that's
only for your application, which is not doing CPU intensive
processing.  Something that did a lot of CPU work (like SSL),
would have a different profile, and you would be limiting the
application by causing it to become CPU bound earlier.

> > 6)    Use an SMP system, make sure that you have a sender
> >       on both CPUs, and measure TLB shootdown and page
> >       mapping turnover to ensure you get that overhead in
> >       there, too (plus the lock overhead).
> 
> I'm afraid I don't understand. I don't see that deep into kernel
> unfortunately. If you tell me what to look at and how...

The additional locks required for i386 64 bit atomicity will,
if the counter is accessed by more than one CPU, result in
bus contention for inter-CPU coherency.

> > 7)    Make sure you are sending data already in the kernel,
> >       so you aren't including copy overhead in the CPU cost,
> >       since practically no one implements servers with copy
> >       overhead these days.
> 
> What do you mean by that? Zero-copy operation? Like sendfile? Is Apache
> 1.x zero-copy?

Yes, zero copy.  Sendfile isn't ideal, but works.  Apache is
not zero copy.  The idea is to not include a lot of CPU work
on copies between the user space and the kernel, which aren't
going to happen in an extremely optimized application.


> > If you push data at 100Mbit, and not even at full throttle at
> > that, you can't reasonably expect to see a slowdown when you
> > have other bottlenecks between you and the changes.
> >
> > In particular, you're not going to see things like the pool
> > size go up because of increased pool retention time, etc.,
> > due to the overhead of doing the calculations.
> 
> That's probably correct eventhough I again don't fully understand what
> you're talking about :-).

Look at the max number of mbufs allocated.  They form a pool
of type stable memory from which mbufs are allocated (things
that get allocated get freed to the pool instead of freed to
the system).  You can see this in the zone counts by dumping
the zone information with vmstat, and in the mbuf counts in the
netstat -m case.

Basically, if you run without the 64 bit stuff, and get one
number, and then run with it, and get a larger number, then
this means that the time you are spending doing the stats is
increasing the amount of time it takes in the code path, and
so the mbufs don't get processed out as quickly.

The implication, IFF this is the case, is that the additional
processing overhead has increased the amount of time a buffer
remains in transit -- the pool retention time -- and thus it
increases the overall total pool size for a given throughput.

The upshot of this happening is that you now require more memory
for the same amount of work, or, if your machine is "maxed out",
then the high end amount of work you can do is reduced by the
changes.


> > Also, realize that even though simply pushing data doesn't
> > use up a lot of CPU on FreeBSD if you do it right, even 2%
> > or 4% increase in CPU overhead overall is enough to cause
> > problems for already CPU-bound applications (i.e. that's
> > ~40 less SSL connections per server).
> 
> You're right with that too. Of course I know that at full CPU load the
> clocks will be missing and maybe other things (memory bandwidth with
> locked operations?) will suffer.

Yes.  It's important to know whether it is significant for
the bottleneck figure of merit for a particular application.

For SSL, this is CPU cycles.  For an NFS server, this is how
much data it can push in a given period of time (overall
throughput).  For some other application, it's some other
number.

For example, the thing that limits the top end speed of SQUID
is how fast it can log, and the number one factor there is
actually the rate at which gettimeofday() can be called, and
still maintain the exhaustive log records that users have come
to expect (these are basically UI "eye candy", except when the
logs are digested and used for billing purposes, at which point
they are really absolutely critical).  Because network processing
for almost all packets in or out in the current FreeBSD occurs at
NETISR, this basically means that the closer it takes to a
quantum, the closer you are to a condition called "receiver
livelock".  This actually drops your top end by up to 15%, and
can actually stop your server in its tracks if you aren't very
careful (RED queueing, weighted fair share queue scheduling, to
ensure you don't spend all your time in the kernel, and none in
user space processing request, etc.).

> > But we can wait for your effects on the mbuf count high
> > watermark and CPU utilization values before jumping to any
> > conclusions...
> 
> I'm afraid I can't provide any measurement with faster interfaces. I can
> try to use real server to sned me some data so it's executing on both
> processors, but I would probably become limited with 100Mbit sooner than
> I'll notice processors have less time to do their job :(.

Well, you probably should collect *all* statistics you can,
in the most "this is the only thing I'm doing with the box"
way you can, before and after the code change, and then plot
the ones that get worse (or better) as a result of the change.


[ ... ]
> THE MOST IMPORTANT QUESTION, to which lots of you probably know answer
> is, DO WE NEED ATOMIC OPERATIONS FOR ACCESSING DIFFERENT COUNTERS (e.g.
> network-device (modified in ISR? - YES/NO) or network-protocol or
> filesystem ...)? NO MATTER WHAT THE SIZE OF THE COUNTER IS.
> 
> If we need atomic, we need atomic 32 bit as much as 64 bit. If we don't,
> we can have cheaper 64 bits counters. My API allows for different
> treatment of different classes of counters (if simple answer to my
> question exists) or places in kernel (you know you're calling when
> interrupt can occur, other CPU may modify the same counter...). I run the
> SMP kernel with the same test with "simple 64 bit add (addl,adcl)" without
> noticing anything went wrong and that sure isn't anywhere near as
> expensive as lock;cmpxchg8b.

I think the answer is "yes, we need atomic counters".  Whether they
need to be 64 bit or just 32 bit is really application dependent
(we have all agreed to that, I think).

See Bruce's posting about atomicity; I think it speaks very
eleoquently on the issue (much more brief than what I'd write
to say the same thing ;^)).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C47E1B2.6938136>