From owner-freebsd-cluster@FreeBSD.ORG  Mon Apr 12 06:04:31 2004
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 36F9216A4CE; Mon, 12 Apr 2004 06:04:31 -0700 (PDT)
Received: from ms-smtp-01.nyroc.rr.com (ms-smtp-01.nyroc.rr.com [24.24.2.55])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 9405243D53; Mon, 12 Apr 2004 06:04:30 -0700 (PDT)
	(envelope-from jracine@maxwell.syr.edu)
Received: from [24.59.145.52] (syr-24-59-145-52.twcny.rr.com [24.59.145.52])
	i3CD4Rdd013237;	Mon, 12 Apr 2004 09:04:27 -0400 (EDT)
From: Jeffrey Racine <jracine@maxwell.syr.edu>
To: Roland Wells <freebsd@thebeatbox.org>
In-Reply-To: <024f01c41ffa$029327e0$0c03a8c0@internal.thebeatbox.org>
References: <024f01c41ffa$029327e0$0c03a8c0@internal.thebeatbox.org>
Content-Type: text/plain
Organization: Syracuse University
Message-Id: <1081775064.990.13.camel@x1-6-00-b0-d0-c2-67-0e.twcny.rr.com>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.5.6.2FreeBSD GNOME Team Port 
Date: Mon, 12 Apr 2004 09:04:24 -0400
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: Symantec AntiVirus Scan Engine
cc: freebsd-cluster@freebsd.org
cc: freebsd-amd64@freebsd.org
Subject: RE: LAM MPI on dual processor opteron box sees only one cpu...
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Apr 2004 13:04:31 -0000

Hi Roland.

I do get CPU #1 launched. This is not the problem.

The problem appears to be with the way that current is scheduling.

With mpirun np 2 I get the job running on CPU 0 (two instances on one
proc). However, it turns out that with np 4 I get the job running on CPU
0 and 1 though with 4 instances (and associated overhead). Here is top
for np 4... notice that in the C column it is using both procs.

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU
COMMAND
96090 jracine  131    0  7148K  2172K CPU1   1   0:19 44.53% 44.53%
n_lam
96088 jracine  125    0  7148K  2172K RUN    0   0:18 43.75% 43.75%
n_lam
96089 jracine  136    0  7148K  2172K RUN    1   0:19 42.19% 42.19%
n_lam
96087 jracine  135    0  7188K  2248K RUN    0   0:19 41.41% 41.41%
n_lam


One run (once when I rebooted lam) did allocate the job correctly with
np 2, but this is not in general the case. On other systems I use,
however, they correctly farm out np 2 to CPU 0 and 1...

Thanks, and any suggestions welcome.

-- Jeff

On Sun, 2004-04-11 at 14:20 -0500, Roland Wells wrote:
> Jeffrey,
> I am not familiar with the LAM MPI issue, but in a dual proc box, you
> should also get an additional line towards the bottom in your dmesg,
> similar to:
> 
> SMP: AP CPU #1 Launched!
> 
> -Roland
> -----Original Message-----
> From: owner-freebsd-cluster@freebsd.org
> [mailto:owner-freebsd-cluster@freebsd.org] On Behalf Of Jeffrey Racine
> Sent: Saturday, April 10, 2004 5:22 PM
> To: freebsd-amd64@freebsd.org; freebsd-cluster@freebsd.org
> Subject: LAM MPI on dual processor opteron box sees only one cpu...
> 
> 
> Hi.
> 
> I am converging on getting a new dual opteron box running. Now I am
> setting up and testing LAM MPI, however, the OS is not farming out 
> the job as expected, and only sees one processor. 
> 
> This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual
> processor PIV desktop. I am running 5-current. Basically, mpirun -np 1
> binaryfile has the same runtime as mpirun -np 2 binaryfile, while on the
> dual PIV box it runs in half the time. When I check top, mpirun -np 2
> both run on CPU 0... here is the relevant portion from top with -np 2...
> 
> 9306 jracine    4    0  7188K  2448K sbwait 0   0:03 19.53% 19.53% n_lam
> 29307 jracine  119    0  7148K  2372K CPU0   0   0:03 19.53% 19.53%
> n_lam
> 
> I include output from laminfo, dmesg (cpu relevnt info), and lamboot -d
> bhost.lam... any suggestions most appreciated, and thanks in advance!
> 
> -- laminfo
> 
>            LAM/MPI: 7.0.4
>             Prefix: /usr/local
>       Architecture: amd64-unknown-freebsd5.2
>      Configured by: root
>      Configured on: Sat Apr 10 11:22:02 EDT 2004
>     Configure host: jracine.maxwell.syr.edu
>         C bindings: yes
>       C++ bindings: yes
>   Fortran bindings: yes
>        C profiling: yes
>      C++ profiling: yes
>  Fortran profiling: yes
>      ROMIO support: yes
>       IMPI support: no
>      Debug support: no
>       Purify clean: no
>           SSI boot: globus (Module v0.5)
>           SSI boot: rsh (Module v1.0)
>           SSI coll: lam_basic (Module v7.0)
>           SSI coll: smp (Module v1.0)
>            SSI rpi: crtcp (Module v1.0.1)
>            SSI rpi: lamd (Module v7.0)
>            SSI rpi: sysv (Module v7.0)
>            SSI rpi: tcp (Module v7.0)
>            SSI rpi: usysv (Module v7.0)
> 
> -- dmesg sees two cpus...
> 
> CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU)
>   Origin = "AuthenticAMD"  Id = 0xf58  Stepping = 8
> 
> Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,
> MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
>   AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow!+,3DNow!>
> real memory  = 3623813120 (3455 MB)
> avail memory = 3494363136 (3332 MB)
> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
>  cpu0 (BSP): APIC ID:  0
>  cpu1 (AP): APIC ID:  1
> 
> -- bhost has the requisite information
> 
> 128.230.130.10 cpu=2 user=jracine
> 
> -- Here are the results from lamboot -d bhost.lam
> 
> -bash-2.05b$ lamboot -d ~/bhost.lam
> n0<29283> ssi:boot: Opening
> n0<29283> ssi:boot: opening module globus
> n0<29283> ssi:boot: initializing module globus
> n0<29283> ssi:boot:globus: globus-job-run not found, globus boot will
> not run n0<29283> ssi:boot: module not available: globus n0<29283>
> ssi:boot: opening module rsh n0<29283> ssi:boot: initializing module rsh
> n0<29283> ssi:boot:rsh: module initializing n0<29283>
> ssi:boot:rsh:agent: rsh n0<29283> ssi:boot:rsh:username: <same>
> n0<29283> ssi:boot:rsh:verbose: 1000 n0<29283> ssi:boot:rsh:algorithm:
> linear n0<29283> ssi:boot:rsh:priority: 10 n0<29283> ssi:boot: module
> available: rsh, priority: 10 n0<29283> ssi:boot: finalizing module
> globus n0<29283> ssi:boot:globus: finalizing n0<29283> ssi:boot: closing
> module globus n0<29283> ssi:boot: Selected boot module rsh
>  
> LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University
>  
> n0<29283> ssi:boot:base: looking for boot schema in following
> directories:
> n0<29283> ssi:boot:base:   <current directory>
> n0<29283> ssi:boot:base:   $TROLLIUSHOME/etc
> n0<29283> ssi:boot:base:   $LAMHOME/etc
> n0<29283> ssi:boot:base:   /usr/local/etc
> n0<29283> ssi:boot:base: looking for boot schema file:
> n0<29283> ssi:boot:base:   /home/jracine/bhost.lam
> n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam
> n0<29283> ssi:boot:rsh: found the following hosts:
> n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu (cpu=2)
> n0<29283> ssi:boot:rsh: resolved hosts:
> n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu --> 128.230.130.10
> (origin)n0<29283> ssi:boot:rsh: starting RTE procs
> n0<29283> ssi:boot:base:linear: starting
> n0<29283> ssi:boot:base:server: opening server TCP socket n0<29283>
> ssi:boot:base:server: opened port 49832 n0<29283> ssi:boot:base:linear:
> booting n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting
> lamd on (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting on n0
> (jracine.maxwell.syr.edu): hboot -t -c lam-conf.lamd -d -I -H
> 128.230.130.10 -P 49832 -n 0 -o 0 n0<29283> ssi:boot:rsh: launching
> locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname
> back: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile
> tkill: removing socket file ...
> tkill: socket
> file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket
> file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-io-socket
> tkill: f_kill = "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile"
> tkill: nothing to kill:
> "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile"
> hboot: booting...
> hboot: fork /usr/local/bin/lamd
> [1]  29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d n0<29283>
> ssi:boot:rsh: successfully launched on n0
> (jracine.maxwell.syr.edu)
> n0<29283> ssi:boot:base:server: expecting connection from finite list
> hboot: attempting to execute
> n-1<29286> ssi:boot: Opening
> n-1<29286> ssi:boot: opening module globus
> n-1<29286> ssi:boot: initializing module globus
> n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot will
> not run n-1<29286> ssi:boot: module not available: globus n-1<29286>
> ssi:boot: opening module rsh n-1<29286> ssi:boot: initializing module
> rsh n-1<29286> ssi:boot:rsh: module initializing n-1<29286>
> ssi:boot:rsh:agent: rsh n-1<29286> ssi:boot:rsh:username: <same>
> n-1<29286> ssi:boot:rsh:verbose: 1000 n-1<29286> ssi:boot:rsh:algorithm:
> linear n-1<29286> ssi:boot:rsh:priority: 10 n-1<29286> ssi:boot: module
> available: rsh, priority: 10 n-1<29286> ssi:boot: finalizing module
> globus n-1<29286> ssi:boot:globus: finalizing n-1<29286> ssi:boot:
> closing module globus n-1<29286> ssi:boot: Selected boot module rsh
> n0<29283> ssi:boot:base:server: got connection from 128.230.130.10
> n0<29283> ssi:boot:base:server: this connection is expected (n0)
> n0<29283> ssi:boot:base:server: remote lamd is at 128.230.130.10:50206
> n0<29283> ssi:boot:base:server: closing server socket n0<29283>
> ssi:boot:base:server: connecting to lamd at 128.230.130.10:49833
> n0<29283> ssi:boot:base:server: connected n0<29283>
> ssi:boot:base:server: sending number of links (1) n0<29283>
> ssi:boot:base:server: sending info: n0
> (jracine.maxwell.syr.edu)
> n0<29283> ssi:boot:base:server: finished sending
> n0<29283> ssi:boot:base:server: disconnected from 128.230.130.10:49833
> n0<29283> ssi:boot:base:linear: finished n0<29283> ssi:boot:rsh: all RTE
> procs started n0<29283> ssi:boot:rsh: finalizing n0<29283> ssi:boot:
> Closing n-1<29286> ssi:boot:rsh: finalizing n-1<29286> ssi:boot: Closing
> 
> 
> 
> _______________________________________________
> freebsd-cluster@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-cluster
> To unsubscribe, send any mail to
> "freebsd-cluster-unsubscribe@freebsd.org"
>