From owner-freebsd-smp@FreeBSD.ORG  Mon Nov 17 11:11:01 2008
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 685861065678
	for <freebsd-smp@freebsd.org>; Mon, 17 Nov 2008 11:11:01 +0000 (UTC)
	(envelope-from archimedes.gaviola@gmail.com)
Received: from rv-out-0506.google.com (rv-out-0506.google.com [209.85.198.236])
	by mx1.freebsd.org (Postfix) with ESMTP id 3A05D8FC1B
	for <freebsd-smp@freebsd.org>; Mon, 17 Nov 2008 11:11:00 +0000 (UTC)
	(envelope-from archimedes.gaviola@gmail.com)
Received: by rv-out-0506.google.com with SMTP id b25so2282955rvf.43
	for <freebsd-smp@freebsd.org>; Mon, 17 Nov 2008 03:11:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:message-id:date:from:to
	:subject:cc:in-reply-to:mime-version:content-type
	:content-transfer-encoding:content-disposition:references;
	bh=fGVHgR0LABWszzv9fduqH285vli6uWOgvFgOhzs4J3Q=;
	b=MuS42+YF1eisZG4dwOyGA15agH5ASE+eZIWNhCd89OFWkJZflXhaE1/Q/yeIOtEs/d
	nYwy9V6nHQ8Y5uUfXWAJrR6C009wMNQ6Yg/t9AtB/yDPoEyliej8hm4xbTtJrxHP3RE+
	Latd+u7MhXF7DLqqv17FOGhIgEU5HFvsJ9AfM=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=message-id:date:from:to:subject:cc:in-reply-to:mime-version
	:content-type:content-transfer-encoding:content-disposition
	:references;
	b=Udn6JI+XZq/VnqSTH8pv6xYfZ6UbRTJD4aHGcj7JY2g51CupdupAjPU/JGoMniTbwB
	wuKl01sT7CRNci6PxutKnQM2zQ6+jGoNl/DNAd0e5CzISFmg20DJX9r057baDgM0MWzt
	YqlKr6yoXAlj81PiRSXqzoIXk7i3ypLtydP0M=
Received: by 10.114.174.2 with SMTP id w2mr2415221wae.195.1226920260630;
	Mon, 17 Nov 2008 03:11:00 -0800 (PST)
Received: by 10.115.76.12 with HTTP; Mon, 17 Nov 2008 03:11:00 -0800 (PST)
Message-ID: <42e3d810811170311uddc77daj176bc285722a0c8@mail.gmail.com>
Date: Mon, 17 Nov 2008 19:11:00 +0800
From: "Archimedes Gaviola" <archimedes.gaviola@gmail.com>
To: "John Baldwin" <jhb@freebsd.org>
In-Reply-To: <200811131128.55220.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <42e3d810811100033w172e90dbl209ecbab640cc24f@mail.gmail.com>
	<200811111216.37462.jhb@freebsd.org>
	<42e3d810811130355x3857bceap447e134b18eee04b@mail.gmail.com>
	<200811131128.55220.jhb@freebsd.org>
Cc: freebsd-smp@freebsd.org
Subject: Re: CPU affinity with ULE scheduler
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Nov 2008 11:11:01 -0000

On Fri, Nov 14, 2008 at 12:28 AM, John Baldwin <jhb@freebsd.org> wrote:
> On Thursday 13 November 2008 06:55:01 am Archimedes Gaviola wrote:
>> On Wed, Nov 12, 2008 at 1:16 AM, John Baldwin <jhb@freebsd.org> wrote:
>> > On Monday 10 November 2008 11:32:55 pm Archimedes Gaviola wrote:
>> >> On Tue, Nov 11, 2008 at 6:33 AM, John Baldwin <jhb@freebsd.org> wrote:
>> >> > On Monday 10 November 2008 03:33:23 am Archimedes Gaviola wrote:
>> >> >> To Whom It May Concerned:
>> >> >>
>> >> >> Can someone explain or share about ULE scheduler (latest version 2 if
>> >> >> I'm not mistaken) dealing with CPU affinity? Is there any existing
>> >> >> benchmarks on this with FreeBSD? Because I am currently using 4BSD
>> >> >> scheduler and as what I have observed especially on processing high
>> >> >> network load traffic on multiple CPU cores, only one CPU were being
>> >> >> stressed with network interrupt while the rests are mostly in idle
>> >> >> state. This is an AMD-64 (4x) dual-core IBM system with GigE Broadcom
>> >> >> network interface cards (bce0 and bce1). Below is the snapshot of the
>> >> >> case.
>> >> >
>> >> > Interrupts are routed to a single CPU.  Since bce0 and bce1 are both on
>> > the
>> >> > same interrupt (irq 23), the CPU that interrupt is routed to is going
> to
>> > end
>> >> > up handling all the interrupts for bce0 and bce1.  This not something
> ULE
>> > or
>> >> > 4BSD have any control over.
>> >> >
>> >> > --
>> >> > John Baldwin
>> >> >
>> >>
>> >> Hi John,
>> >>
>> >> I'm sorry for the wrong snapshot. Here's the right one with my concern.
>> >>
>> >>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
>> >>    17 root        1 171   52     0K    16K CPU0   0  54:28 95.17% idle:
> cpu0
>> >>    15 root        1 171   52     0K    16K CPU2   2  55:55 93.65% idle:
> cpu2
>> >>    14 root        1 171   52     0K    16K CPU3   3  58:53 93.55% idle:
> cpu3
>> >>    13 root        1 171   52     0K    16K RUN    4  59:14 82.47% idle:
> cpu4
>> >>    12 root        1 171   52     0K    16K RUN    5  55:42 82.23% idle:
> cpu5
>> >>    16 root        1 171   52     0K    16K CPU1   1  58:13 77.78% idle:
> cpu1
>> >>    11 root        1 171   52     0K    16K CPU6   6  54:08 76.17% idle:
> cpu6
>> >>    36 root        1 -68 -187     0K    16K WAIT   7   8:50 65.53%
>> >> irq23: bce0 bce1
>> >>    10 root        1 171   52     0K    16K CPU7   7  48:19 29.79% idle:
> cpu7
>> >>    43 root        1 171   52     0K    16K pgzero 2   0:35  1.51%
> pagezero
>> >>  1372 root       10  20    0 16716K  5764K kserel 6  58:42  0.00% kmd
>> >>  4488 root        1  96    0 30676K  4236K select 2   1:51  0.00% sshd
>> >>    18 root        1 -32 -151     0K    16K WAIT   0   1:14  0.00% swi4:
>> > clock s
>> >>    20 root        1 -44 -163     0K    16K WAIT   0   0:30  0.00% swi1:
> net
>> >>   218 root        1  96    0  3852K  1376K select 0   0:23  0.00% syslogd
>> >>  2171 root        1  96    0 30676K  4224K select 6   0:19  0.00% sshd
>> >>
>> >> Actually I was doing a network performance testing on this system with
>> >> FreeBSD-6.2 RELEASE using its default scheduler 4BSD and then I used a
>> >> tool to generate big amount of traffic around 600Mbps-700Mbps
>> >> traversing the FreeBSD system in bi-direction, meaning both network
>> >> interfaces are receiving traffic. What happened was, the CPU (cpu7)
>> >> that handles the (irq 23) on both interfaces consumed big amount of
>> >> CPU utilization around 65.53% in which it affects other running
>> >> applications and services like sshd and httpd. It's no longer
>> >> accessible when traffic is bombarded. With the current situation of my
>> >> FreeBSD system with only one CPU being stressed, I was thinking of
>> >> moving to FreeBSD-7.0 RELEASE with the ULE scheduler because I thought
>> >> my concern has something to do with the distributions of load on
>> >> multiple CPU cores handled by the scheduler especially at the network
>> >> level, processing network load. So, if it is more of interrupt
>> >> handling and not on the scheduler, is there a way we can optimize it?
>> >> Because if it still routed only to one CPU then for me it's still
>> >> inefficient. Who handles interrupt scheduling for bounding CPU in
>> >> order to prevent shared IRQ? Is there any improvements with
>> >> FreeBSD-7.0 with regards to interrupt handling?
>> >
>> > It depends.  In all likelihood, the interrupts from bce0 and bce1 are both
>> > hardwired to the same interrupt pin and so they will always share the same
>> > ithread when using the legacy INTx interrupts.  However, bce(4) parts do
>> > support MSI, and if you try a newer OS snap (6.3 or later) these devices
>> > should use MSI in which case each NIC would be assigned to a separate CPU.
> I
>> > would suggest trying 7.0 or a 7.1 release candidate and see if it does
>> > better.
>> >
>> > --
>> > John Baldwin
>> >
>>
>> Hi John,
>>
>> I try 7.0 release and each network interface were already allocated
>> separately on different CPU. Here, MSI is already working.
>>
>>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
>>    12 root        1 171 ki31     0K    16K CPU6   6 123:55 100.00% idle:
> cpu6
>>    15 root        1 171 ki31     0K    16K CPU3   3 123:54 100.00% idle:
> cpu3
>>    14 root        1 171 ki31     0K    16K CPU4   4 123:26 100.00% idle:
> cpu4
>>    16 root        1 171 ki31     0K    16K CPU2   2 123:15 100.00% idle:
> cpu2
>>    17 root        1 171 ki31     0K    16K CPU1   1 123:15 100.00% idle:
> cpu1
>>    37 root        1 -68    -     0K    16K CPU7   7   9:09 100.00% irq256:
> bce0
>>    13 root        1 171 ki31     0K    16K CPU5   5 123:49 99.07% idle: cpu5
>>    40 root        1 -68    -     0K    16K WAIT   0   4:40 51.17% irq257:
> bce1
>>    18 root        1 171 ki31     0K    16K RUN    0 117:48 49.37% idle: cpu0
>>    11 root        1 171 ki31     0K    16K RUN    7 115:25  0.00% idle: cpu7
>>    19 root        1 -32    -     0K    16K WAIT   0   0:39  0.00% swi4:
> clock s
>> 14367 root        1  44    0  5176K  3104K select 2   0:01  0.00% dhcpd
>>    22 root        1 -16    -     0K    16K -      3   0:01  0.00% yarrow
>>    25 root        1 -24    -     0K    16K WAIT   0   0:00  0.00% swi6:
> Giant t
>> 11658 root        1  44    0 32936K  4540K select 1   0:00  0.00% sshd
>> 14224 root        1  44    0 32936K  4540K select 5   0:00  0.00% sshd
>>    41 root        1 -60    -     0K    16K WAIT   0   0:00  0.00% irq1:
> atkbd0
>>     4 root        1  -8    -     0K    16K -      2   0:00  0.00% g_down
>>
>> The bce0 interface interrupt (irq256) gets stressed out which already
>> have 100% of CPU7 while CPU0 is around 51.17%. Any more
>> recommendations? Is there anything we can do about optimization with
>> MSI?
>
> Well, on 7.x you can try turning net.isr.direct off (sysctl).  However, it
> seems you are hammering your bce0 interface.  You might want to try using
> polling on bce0 and seeing if it keeps up with the traffic better.
>
> --
> John Baldwin
>

With net.isr.direct=0, my IBM system lessens CPU utilization per
interface (bce0 and bce1) but swi1:net increase its utilization.
Can you explained what's happening here? What does net.isr.direct do
with the decrease of CPU utilization on its interface? I really wanted
to know what happened internally during the packets being processed
and received by the interfaces then to the device interrupt up to the
software interrupt level because I am confused when enabling/disabling
net.isr.direct in sysctl. Is there a tool that can we used to trace
this process just to be able to know which part of the kernel internal
is doing the bottleneck especially when net.isr.direct=1? By the way
with device polling enabled, the system experienced packet errors and
the interface throughput is worst, so I avoid using it though.

   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND

   16 root        1 171 ki31     0K    16K CPU10  a  86:06 89.06% idle: cpu10
   27 root        1 -44    -     0K    16K CPU1   1  34:37 82.67% swi1: net
   52 root        1 -68    -     0K    16K WAIT   b  51:59 59.77% irq32: bce1
   15 root        1 171 ki31     0K    16K RUN    b  69:28 43.16% idle: cpu11
   25 root        1 171 ki31     0K    16K RUN    1 115:35 24.27% idle: cpu1
   51 root        1 -68    -     0K    16K CPU10  a  35:21 13.48% irq31: bce0


Regards,
Archimedes