From owner-freebsd-smp@FreeBSD.ORG  Mon Nov 17 21:13:57 2008
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 083691065678
	for <freebsd-smp@freebsd.org>; Mon, 17 Nov 2008 21:13:57 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
	[IPv6:2001:470:1f10:75::2])
	by mx1.freebsd.org (Postfix) with ESMTP id DD8398FC19
	for <freebsd-smp@freebsd.org>; Mon, 17 Nov 2008 21:13:55 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1])
	(authenticated bits=0)
	by server.baldwin.cx (8.14.3/8.14.3) with ESMTP id mAHLDgus033788;
	Mon, 17 Nov 2008 16:13:48 -0500 (EST) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: "Archimedes Gaviola" <archimedes.gaviola@gmail.com>
Date: Mon, 17 Nov 2008 16:09:54 -0500
User-Agent: KMail/1.9.7
References: <42e3d810811100033w172e90dbl209ecbab640cc24f@mail.gmail.com>
	<42e3d810811170311uddc77daj176bc285722a0c8@mail.gmail.com>
	<42e3d810811170336rf0a0357sf32035e8bd1489e9@mail.gmail.com>
In-Reply-To: <42e3d810811170336rf0a0357sf32035e8bd1489e9@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200811171609.54527.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]);
	Mon, 17 Nov 2008 16:13:49 -0500 (EST)
X-Virus-Scanned: ClamAV 0.93.1/8642/Sun Nov 16 23:01:08 2008 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00,NO_RELAYS 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: freebsd-smp@freebsd.org
Subject: Re: CPU affinity with ULE scheduler
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Nov 2008 21:13:57 -0000

On Monday 17 November 2008 06:36:40 am Archimedes Gaviola wrote:
> On Mon, Nov 17, 2008 at 7:11 PM, Archimedes Gaviola
> <archimedes.gaviola@gmail.com> wrote:
> > On Fri, Nov 14, 2008 at 12:28 AM, John Baldwin <jhb@freebsd.org> wrote:
> >> On Thursday 13 November 2008 06:55:01 am Archimedes Gaviola wrote:
> >>> On Wed, Nov 12, 2008 at 1:16 AM, John Baldwin <jhb@freebsd.org> wrote:
> >>> > On Monday 10 November 2008 11:32:55 pm Archimedes Gaviola wrote:
> >>> >> On Tue, Nov 11, 2008 at 6:33 AM, John Baldwin <jhb@freebsd.org> 
wrote:
> >>> >> > On Monday 10 November 2008 03:33:23 am Archimedes Gaviola wrote:
> >>> >> >> To Whom It May Concerned:
> >>> >> >>
> >>> >> >> Can someone explain or share about ULE scheduler (latest version 2 
if
> >>> >> >> I'm not mistaken) dealing with CPU affinity? Is there any existing
> >>> >> >> benchmarks on this with FreeBSD? Because I am currently using 4BSD
> >>> >> >> scheduler and as what I have observed especially on processing 
high
> >>> >> >> network load traffic on multiple CPU cores, only one CPU were 
being
> >>> >> >> stressed with network interrupt while the rests are mostly in idle
> >>> >> >> state. This is an AMD-64 (4x) dual-core IBM system with GigE 
Broadcom
> >>> >> >> network interface cards (bce0 and bce1). Below is the snapshot of 
the
> >>> >> >> case.
> >>> >> >
> >>> >> > Interrupts are routed to a single CPU.  Since bce0 and bce1 are 
both on
> >>> > the
> >>> >> > same interrupt (irq 23), the CPU that interrupt is routed to is 
going
> >> to
> >>> > end
> >>> >> > up handling all the interrupts for bce0 and bce1.  This not 
something
> >> ULE
> >>> > or
> >>> >> > 4BSD have any control over.
> >>> >> >
> >>> >> > --
> >>> >> > John Baldwin
> >>> >> >
> >>> >>
> >>> >> Hi John,
> >>> >>
> >>> >> I'm sorry for the wrong snapshot. Here's the right one with my 
concern.
> >>> >>
> >>> >>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU 
COMMAND
> >>> >>    17 root        1 171   52     0K    16K CPU0   0  54:28 95.17% 
idle:
> >> cpu0
> >>> >>    15 root        1 171   52     0K    16K CPU2   2  55:55 93.65% 
idle:
> >> cpu2
> >>> >>    14 root        1 171   52     0K    16K CPU3   3  58:53 93.55% 
idle:
> >> cpu3
> >>> >>    13 root        1 171   52     0K    16K RUN    4  59:14 82.47% 
idle:
> >> cpu4
> >>> >>    12 root        1 171   52     0K    16K RUN    5  55:42 82.23% 
idle:
> >> cpu5
> >>> >>    16 root        1 171   52     0K    16K CPU1   1  58:13 77.78% 
idle:
> >> cpu1
> >>> >>    11 root        1 171   52     0K    16K CPU6   6  54:08 76.17% 
idle:
> >> cpu6
> >>> >>    36 root        1 -68 -187     0K    16K WAIT   7   8:50 65.53%
> >>> >> irq23: bce0 bce1
> >>> >>    10 root        1 171   52     0K    16K CPU7   7  48:19 29.79% 
idle:
> >> cpu7
> >>> >>    43 root        1 171   52     0K    16K pgzero 2   0:35  1.51%
> >> pagezero
> >>> >>  1372 root       10  20    0 16716K  5764K kserel 6  58:42  0.00% kmd
> >>> >>  4488 root        1  96    0 30676K  4236K select 2   1:51  0.00% 
sshd
> >>> >>    18 root        1 -32 -151     0K    16K WAIT   0   1:14  0.00% 
swi4:
> >>> > clock s
> >>> >>    20 root        1 -44 -163     0K    16K WAIT   0   0:30  0.00% 
swi1:
> >> net
> >>> >>   218 root        1  96    0  3852K  1376K select 0   0:23  0.00% 
syslogd
> >>> >>  2171 root        1  96    0 30676K  4224K select 6   0:19  0.00% 
sshd
> >>> >>
> >>> >> Actually I was doing a network performance testing on this system 
with
> >>> >> FreeBSD-6.2 RELEASE using its default scheduler 4BSD and then I used 
a
> >>> >> tool to generate big amount of traffic around 600Mbps-700Mbps
> >>> >> traversing the FreeBSD system in bi-direction, meaning both network
> >>> >> interfaces are receiving traffic. What happened was, the CPU (cpu7)
> >>> >> that handles the (irq 23) on both interfaces consumed big amount of
> >>> >> CPU utilization around 65.53% in which it affects other running
> >>> >> applications and services like sshd and httpd. It's no longer
> >>> >> accessible when traffic is bombarded. With the current situation of 
my
> >>> >> FreeBSD system with only one CPU being stressed, I was thinking of
> >>> >> moving to FreeBSD-7.0 RELEASE with the ULE scheduler because I 
thought
> >>> >> my concern has something to do with the distributions of load on
> >>> >> multiple CPU cores handled by the scheduler especially at the network
> >>> >> level, processing network load. So, if it is more of interrupt
> >>> >> handling and not on the scheduler, is there a way we can optimize it?
> >>> >> Because if it still routed only to one CPU then for me it's still
> >>> >> inefficient. Who handles interrupt scheduling for bounding CPU in
> >>> >> order to prevent shared IRQ? Is there any improvements with
> >>> >> FreeBSD-7.0 with regards to interrupt handling?
> >>> >
> >>> > It depends.  In all likelihood, the interrupts from bce0 and bce1 are 
both
> >>> > hardwired to the same interrupt pin and so they will always share the 
same
> >>> > ithread when using the legacy INTx interrupts.  However, bce(4) parts 
do
> >>> > support MSI, and if you try a newer OS snap (6.3 or later) these 
devices
> >>> > should use MSI in which case each NIC would be assigned to a separate 
CPU.
> >> I
> >>> > would suggest trying 7.0 or a 7.1 release candidate and see if it does
> >>> > better.
> >>> >
> >>> > --
> >>> > John Baldwin
> >>> >
> >>>
> >>> Hi John,
> >>>
> >>> I try 7.0 release and each network interface were already allocated
> >>> separately on different CPU. Here, MSI is already working.
> >>>
> >>>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU 
COMMAND
> >>>    12 root        1 171 ki31     0K    16K CPU6   6 123:55 100.00% idle:
> >> cpu6
> >>>    15 root        1 171 ki31     0K    16K CPU3   3 123:54 100.00% idle:
> >> cpu3
> >>>    14 root        1 171 ki31     0K    16K CPU4   4 123:26 100.00% idle:
> >> cpu4
> >>>    16 root        1 171 ki31     0K    16K CPU2   2 123:15 100.00% idle:
> >> cpu2
> >>>    17 root        1 171 ki31     0K    16K CPU1   1 123:15 100.00% idle:
> >> cpu1
> >>>    37 root        1 -68    -     0K    16K CPU7   7   9:09 100.00% 
irq256:
> >> bce0
> >>>    13 root        1 171 ki31     0K    16K CPU5   5 123:49 99.07% idle: 
cpu5
> >>>    40 root        1 -68    -     0K    16K WAIT   0   4:40 51.17% 
irq257:
> >> bce1
> >>>    18 root        1 171 ki31     0K    16K RUN    0 117:48 49.37% idle: 
cpu0
> >>>    11 root        1 171 ki31     0K    16K RUN    7 115:25  0.00% idle: 
cpu7
> >>>    19 root        1 -32    -     0K    16K WAIT   0   0:39  0.00% swi4:
> >> clock s
> >>> 14367 root        1  44    0  5176K  3104K select 2   0:01  0.00% dhcpd
> >>>    22 root        1 -16    -     0K    16K -      3   0:01  0.00% yarrow
> >>>    25 root        1 -24    -     0K    16K WAIT   0   0:00  0.00% swi6:
> >> Giant t
> >>> 11658 root        1  44    0 32936K  4540K select 1   0:00  0.00% sshd
> >>> 14224 root        1  44    0 32936K  4540K select 5   0:00  0.00% sshd
> >>>    41 root        1 -60    -     0K    16K WAIT   0   0:00  0.00% irq1:
> >> atkbd0
> >>>     4 root        1  -8    -     0K    16K -      2   0:00  0.00% g_down
> >>>
> >>> The bce0 interface interrupt (irq256) gets stressed out which already
> >>> have 100% of CPU7 while CPU0 is around 51.17%. Any more
> >>> recommendations? Is there anything we can do about optimization with
> >>> MSI?
> >>
> >> Well, on 7.x you can try turning net.isr.direct off (sysctl).  However, 
it
> >> seems you are hammering your bce0 interface.  You might want to try using
> >> polling on bce0 and seeing if it keeps up with the traffic better.
> >>
> >> --
> >> John Baldwin
> >>
> >
> > With net.isr.direct=0, my IBM system lessens CPU utilization per
> > interface (bce0 and bce1) but swi1:net increase its utilization.
> > Can you explained what's happening here? What does net.isr.direct do
> > with the decrease of CPU utilization on its interface? I really wanted
> > to know what happened internally during the packets being processed
> > and received by the interfaces then to the device interrupt up to the
> > software interrupt level because I am confused when enabling/disabling
> > net.isr.direct in sysctl. Is there a tool that can we used to trace
> > this process just to be able to know which part of the kernel internal
> > is doing the bottleneck especially when net.isr.direct=1? By the way
> > with device polling enabled, the system experienced packet errors and
> > the interface throughput is worst, so I avoid using it though.
> >
> >   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
> >
> >   16 root        1 171 ki31     0K    16K CPU10  a  86:06 89.06% idle: 
cpu10
> >   27 root        1 -44    -     0K    16K CPU1   1  34:37 82.67% swi1: net
> >   52 root        1 -68    -     0K    16K WAIT   b  51:59 59.77% irq32: 
bce1
> >   15 root        1 171 ki31     0K    16K RUN    b  69:28 43.16% idle: 
cpu11
> >   25 root        1 171 ki31     0K    16K RUN    1 115:35 24.27% idle: 
cpu1
> >   51 root        1 -68    -     0K    16K CPU10  a  35:21 13.48% irq31: 
bce0
> >
> >
> > Regards,
> > Archimedes
> >
> 
> One more thing, I observed that when net.isr.direct=1, bce0 is using
> irq256 and bce1 is using irq257 while net.isr.direct=0, bce0 is now
> using irq31 and bce1 is using irq32. What makes it different?

That is not from net.isr.direcct.  irq256/257 is when the bce devices are 
using MSI.  irq31/32 is when the bce devices are using INTx.

-- 
John Baldwin