From owner-freebsd-cluster@FreeBSD.ORG Mon Apr 12 06:04:31 2004 Return-Path: Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 36F9216A4CE; Mon, 12 Apr 2004 06:04:31 -0700 (PDT) Received: from ms-smtp-01.nyroc.rr.com (ms-smtp-01.nyroc.rr.com [24.24.2.55]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9405243D53; Mon, 12 Apr 2004 06:04:30 -0700 (PDT) (envelope-from jracine@maxwell.syr.edu) Received: from [24.59.145.52] (syr-24-59-145-52.twcny.rr.com [24.59.145.52]) i3CD4Rdd013237; Mon, 12 Apr 2004 09:04:27 -0400 (EDT) From: Jeffrey Racine To: Roland Wells In-Reply-To: <024f01c41ffa$029327e0$0c03a8c0@internal.thebeatbox.org> References: <024f01c41ffa$029327e0$0c03a8c0@internal.thebeatbox.org> Content-Type: text/plain Organization: Syracuse University Message-Id: <1081775064.990.13.camel@x1-6-00-b0-d0-c2-67-0e.twcny.rr.com> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.5.6.2FreeBSD GNOME Team Port Date: Mon, 12 Apr 2004 09:04:24 -0400 Content-Transfer-Encoding: 7bit X-Virus-Scanned: Symantec AntiVirus Scan Engine cc: freebsd-cluster@freebsd.org cc: freebsd-amd64@freebsd.org Subject: RE: LAM MPI on dual processor opteron box sees only one cpu... X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2004 13:04:31 -0000 Hi Roland. I do get CPU #1 launched. This is not the problem. The problem appears to be with the way that current is scheduling. With mpirun np 2 I get the job running on CPU 0 (two instances on one proc). However, it turns out that with np 4 I get the job running on CPU 0 and 1 though with 4 instances (and associated overhead). Here is top for np 4... notice that in the C column it is using both procs. PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND 96090 jracine 131 0 7148K 2172K CPU1 1 0:19 44.53% 44.53% n_lam 96088 jracine 125 0 7148K 2172K RUN 0 0:18 43.75% 43.75% n_lam 96089 jracine 136 0 7148K 2172K RUN 1 0:19 42.19% 42.19% n_lam 96087 jracine 135 0 7188K 2248K RUN 0 0:19 41.41% 41.41% n_lam One run (once when I rebooted lam) did allocate the job correctly with np 2, but this is not in general the case. On other systems I use, however, they correctly farm out np 2 to CPU 0 and 1... Thanks, and any suggestions welcome. -- Jeff On Sun, 2004-04-11 at 14:20 -0500, Roland Wells wrote: > Jeffrey, > I am not familiar with the LAM MPI issue, but in a dual proc box, you > should also get an additional line towards the bottom in your dmesg, > similar to: > > SMP: AP CPU #1 Launched! > > -Roland > -----Original Message----- > From: owner-freebsd-cluster@freebsd.org > [mailto:owner-freebsd-cluster@freebsd.org] On Behalf Of Jeffrey Racine > Sent: Saturday, April 10, 2004 5:22 PM > To: freebsd-amd64@freebsd.org; freebsd-cluster@freebsd.org > Subject: LAM MPI on dual processor opteron box sees only one cpu... > > > Hi. > > I am converging on getting a new dual opteron box running. Now I am > setting up and testing LAM MPI, however, the OS is not farming out > the job as expected, and only sees one processor. > > This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual > processor PIV desktop. I am running 5-current. Basically, mpirun -np 1 > binaryfile has the same runtime as mpirun -np 2 binaryfile, while on the > dual PIV box it runs in half the time. When I check top, mpirun -np 2 > both run on CPU 0... here is the relevant portion from top with -np 2... > > 9306 jracine 4 0 7188K 2448K sbwait 0 0:03 19.53% 19.53% n_lam > 29307 jracine 119 0 7148K 2372K CPU0 0 0:03 19.53% 19.53% > n_lam > > I include output from laminfo, dmesg (cpu relevnt info), and lamboot -d > bhost.lam... any suggestions most appreciated, and thanks in advance! > > -- laminfo > > LAM/MPI: 7.0.4 > Prefix: /usr/local > Architecture: amd64-unknown-freebsd5.2 > Configured by: root > Configured on: Sat Apr 10 11:22:02 EDT 2004 > Configure host: jracine.maxwell.syr.edu > C bindings: yes > C++ bindings: yes > Fortran bindings: yes > C profiling: yes > C++ profiling: yes > Fortran profiling: yes > ROMIO support: yes > IMPI support: no > Debug support: no > Purify clean: no > SSI boot: globus (Module v0.5) > SSI boot: rsh (Module v1.0) > SSI coll: lam_basic (Module v7.0) > SSI coll: smp (Module v1.0) > SSI rpi: crtcp (Module v1.0.1) > SSI rpi: lamd (Module v7.0) > SSI rpi: sysv (Module v7.0) > SSI rpi: tcp (Module v7.0) > SSI rpi: usysv (Module v7.0) > > -- dmesg sees two cpus... > > CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU) > Origin = "AuthenticAMD" Id = 0xf58 Stepping = 8 > > Features=0x78bfbff MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2> > AMD Features=0xe0500800 > real memory = 3623813120 (3455 MB) > avail memory = 3494363136 (3332 MB) > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs > cpu0 (BSP): APIC ID: 0 > cpu1 (AP): APIC ID: 1 > > -- bhost has the requisite information > > 128.230.130.10 cpu=2 user=jracine > > -- Here are the results from lamboot -d bhost.lam > > -bash-2.05b$ lamboot -d ~/bhost.lam > n0<29283> ssi:boot: Opening > n0<29283> ssi:boot: opening module globus > n0<29283> ssi:boot: initializing module globus > n0<29283> ssi:boot:globus: globus-job-run not found, globus boot will > not run n0<29283> ssi:boot: module not available: globus n0<29283> > ssi:boot: opening module rsh n0<29283> ssi:boot: initializing module rsh > n0<29283> ssi:boot:rsh: module initializing n0<29283> > ssi:boot:rsh:agent: rsh n0<29283> ssi:boot:rsh:username: > n0<29283> ssi:boot:rsh:verbose: 1000 n0<29283> ssi:boot:rsh:algorithm: > linear n0<29283> ssi:boot:rsh:priority: 10 n0<29283> ssi:boot: module > available: rsh, priority: 10 n0<29283> ssi:boot: finalizing module > globus n0<29283> ssi:boot:globus: finalizing n0<29283> ssi:boot: closing > module globus n0<29283> ssi:boot: Selected boot module rsh > > LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University > > n0<29283> ssi:boot:base: looking for boot schema in following > directories: > n0<29283> ssi:boot:base: > n0<29283> ssi:boot:base: $TROLLIUSHOME/etc > n0<29283> ssi:boot:base: $LAMHOME/etc > n0<29283> ssi:boot:base: /usr/local/etc > n0<29283> ssi:boot:base: looking for boot schema file: > n0<29283> ssi:boot:base: /home/jracine/bhost.lam > n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam > n0<29283> ssi:boot:rsh: found the following hosts: > n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu (cpu=2) > n0<29283> ssi:boot:rsh: resolved hosts: > n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu --> 128.230.130.10 > (origin)n0<29283> ssi:boot:rsh: starting RTE procs > n0<29283> ssi:boot:base:linear: starting > n0<29283> ssi:boot:base:server: opening server TCP socket n0<29283> > ssi:boot:base:server: opened port 49832 n0<29283> ssi:boot:base:linear: > booting n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting > lamd on (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting on n0 > (jracine.maxwell.syr.edu): hboot -t -c lam-conf.lamd -d -I -H > 128.230.130.10 -P 49832 -n 0 -o 0 n0<29283> ssi:boot:rsh: launching > locally > hboot: performing tkill > hboot: tkill -d > tkill: setting prefix to (null) > tkill: setting suffix to (null) > tkill: got killname > back: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile > tkill: removing socket file ... > tkill: socket > file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-kernel-socketd > tkill: removing IO daemon socket file ... > tkill: IO daemon socket > file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-io-socket > tkill: f_kill = "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" > tkill: nothing to kill: > "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" > hboot: booting... > hboot: fork /usr/local/bin/lamd > [1] 29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d n0<29283> > ssi:boot:rsh: successfully launched on n0 > (jracine.maxwell.syr.edu) > n0<29283> ssi:boot:base:server: expecting connection from finite list > hboot: attempting to execute > n-1<29286> ssi:boot: Opening > n-1<29286> ssi:boot: opening module globus > n-1<29286> ssi:boot: initializing module globus > n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot will > not run n-1<29286> ssi:boot: module not available: globus n-1<29286> > ssi:boot: opening module rsh n-1<29286> ssi:boot: initializing module > rsh n-1<29286> ssi:boot:rsh: module initializing n-1<29286> > ssi:boot:rsh:agent: rsh n-1<29286> ssi:boot:rsh:username: > n-1<29286> ssi:boot:rsh:verbose: 1000 n-1<29286> ssi:boot:rsh:algorithm: > linear n-1<29286> ssi:boot:rsh:priority: 10 n-1<29286> ssi:boot: module > available: rsh, priority: 10 n-1<29286> ssi:boot: finalizing module > globus n-1<29286> ssi:boot:globus: finalizing n-1<29286> ssi:boot: > closing module globus n-1<29286> ssi:boot: Selected boot module rsh > n0<29283> ssi:boot:base:server: got connection from 128.230.130.10 > n0<29283> ssi:boot:base:server: this connection is expected (n0) > n0<29283> ssi:boot:base:server: remote lamd is at 128.230.130.10:50206 > n0<29283> ssi:boot:base:server: closing server socket n0<29283> > ssi:boot:base:server: connecting to lamd at 128.230.130.10:49833 > n0<29283> ssi:boot:base:server: connected n0<29283> > ssi:boot:base:server: sending number of links (1) n0<29283> > ssi:boot:base:server: sending info: n0 > (jracine.maxwell.syr.edu) > n0<29283> ssi:boot:base:server: finished sending > n0<29283> ssi:boot:base:server: disconnected from 128.230.130.10:49833 > n0<29283> ssi:boot:base:linear: finished n0<29283> ssi:boot:rsh: all RTE > procs started n0<29283> ssi:boot:rsh: finalizing n0<29283> ssi:boot: > Closing n-1<29286> ssi:boot:rsh: finalizing n-1<29286> ssi:boot: Closing > > > > _______________________________________________ > freebsd-cluster@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-cluster > To unsubscribe, send any mail to > "freebsd-cluster-unsubscribe@freebsd.org" >