From owner-freebsd-amd64@FreeBSD.ORG Sat Apr 10 15:21:27 2004 Return-Path: Delivered-To: freebsd-amd64@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6E7BD16A4CE; Sat, 10 Apr 2004 15:21:27 -0700 (PDT) Received: from ms-smtp-02.nyroc.rr.com (ms-smtp-02.nyroc.rr.com [24.24.2.56]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0097743D1D; Sat, 10 Apr 2004 15:21:27 -0700 (PDT) (envelope-from jracine@maxwell.syr.edu) Received: from [24.59.145.52] (syr-24-59-145-52.twcny.rr.com [24.59.145.52]) i3AMLOuV005168; Sat, 10 Apr 2004 18:21:25 -0400 (EDT) From: Jeffrey Racine To: freebsd-amd64@freebsd.org, freebsd-cluster@freebsd.org Content-Type: text/plain Organization: Syracuse University Message-Id: <1081635706.67575.26.camel@x1-6-00-b0-d0-c2-67-0e.twcny.rr.com> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.5.6.2FreeBSD GNOME Team Port Date: Sat, 10 Apr 2004 18:21:46 -0400 Content-Transfer-Encoding: 7bit X-Virus-Scanned: Symantec AntiVirus Scan Engine Subject: LAM MPI on dual processor opteron box sees only one cpu... X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Apr 2004 22:21:27 -0000 Hi. I am converging on getting a new dual opteron box running. Now I am setting up and testing LAM MPI, however, the OS is not farming out the job as expected, and only sees one processor. This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual processor PIV desktop. I am running 5-current. Basically, mpirun -np 1 binaryfile has the same runtime as mpirun -np 2 binaryfile, while on the dual PIV box it runs in half the time. When I check top, mpirun -np 2 both run on CPU 0... here is the relevant portion from top with -np 2... 9306 jracine 4 0 7188K 2448K sbwait 0 0:03 19.53% 19.53% n_lam 29307 jracine 119 0 7148K 2372K CPU0 0 0:03 19.53% 19.53% n_lam I include output from laminfo, dmesg (cpu relevnt info), and lamboot -d bhost.lam... any suggestions most appreciated, and thanks in advance! -- laminfo LAM/MPI: 7.0.4 Prefix: /usr/local Architecture: amd64-unknown-freebsd5.2 Configured by: root Configured on: Sat Apr 10 11:22:02 EDT 2004 Configure host: jracine.maxwell.syr.edu C bindings: yes C++ bindings: yes Fortran bindings: yes C profiling: yes C++ profiling: yes Fortran profiling: yes ROMIO support: yes IMPI support: no Debug support: no Purify clean: no SSI boot: globus (Module v0.5) SSI boot: rsh (Module v1.0) SSI coll: lam_basic (Module v7.0) SSI coll: smp (Module v1.0) SSI rpi: crtcp (Module v1.0.1) SSI rpi: lamd (Module v7.0) SSI rpi: sysv (Module v7.0) SSI rpi: tcp (Module v7.0) SSI rpi: usysv (Module v7.0) -- dmesg sees two cpus... CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU) Origin = "AuthenticAMD" Id = 0xf58 Stepping = 8 Features=0x78bfbff AMD Features=0xe0500800 real memory = 3623813120 (3455 MB) avail memory = 3494363136 (3332 MB) FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 -- bhost has the requisite information 128.230.130.10 cpu=2 user=jracine -- Here are the results from lamboot -d bhost.lam -bash-2.05b$ lamboot -d ~/bhost.lam n0<29283> ssi:boot: Opening n0<29283> ssi:boot: opening module globus n0<29283> ssi:boot: initializing module globus n0<29283> ssi:boot:globus: globus-job-run not found, globus boot will not run n0<29283> ssi:boot: module not available: globus n0<29283> ssi:boot: opening module rsh n0<29283> ssi:boot: initializing module rsh n0<29283> ssi:boot:rsh: module initializing n0<29283> ssi:boot:rsh:agent: rsh n0<29283> ssi:boot:rsh:username: n0<29283> ssi:boot:rsh:verbose: 1000 n0<29283> ssi:boot:rsh:algorithm: linear n0<29283> ssi:boot:rsh:priority: 10 n0<29283> ssi:boot: module available: rsh, priority: 10 n0<29283> ssi:boot: finalizing module globus n0<29283> ssi:boot:globus: finalizing n0<29283> ssi:boot: closing module globus n0<29283> ssi:boot: Selected boot module rsh LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University n0<29283> ssi:boot:base: looking for boot schema in following directories: n0<29283> ssi:boot:base: n0<29283> ssi:boot:base: $TROLLIUSHOME/etc n0<29283> ssi:boot:base: $LAMHOME/etc n0<29283> ssi:boot:base: /usr/local/etc n0<29283> ssi:boot:base: looking for boot schema file: n0<29283> ssi:boot:base: /home/jracine/bhost.lam n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam n0<29283> ssi:boot:rsh: found the following hosts: n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu (cpu=2) n0<29283> ssi:boot:rsh: resolved hosts: n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu --> 128.230.130.10 (origin)n0<29283> ssi:boot:rsh: starting RTE procs n0<29283> ssi:boot:base:linear: starting n0<29283> ssi:boot:base:server: opening server TCP socket n0<29283> ssi:boot:base:server: opened port 49832 n0<29283> ssi:boot:base:linear: booting n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting lamd on (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting on n0 (jracine.maxwell.syr.edu): hboot -t -c lam-conf.lamd -d -I -H 128.230.130.10 -P 49832 -n 0 -o 0 n0<29283> ssi:boot:rsh: launching locally hboot: performing tkill hboot: tkill -d tkill: setting prefix to (null) tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile tkill: removing socket file ... tkill: socket file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-kernel-socketd tkill: removing IO daemon socket file ... tkill: IO daemon socket file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-io-socket tkill: f_kill = "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" tkill: nothing to kill: "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" hboot: booting... hboot: fork /usr/local/bin/lamd [1] 29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d n0<29283> ssi:boot:rsh: successfully launched on n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:base:server: expecting connection from finite list hboot: attempting to execute n-1<29286> ssi:boot: Opening n-1<29286> ssi:boot: opening module globus n-1<29286> ssi:boot: initializing module globus n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<29286> ssi:boot: module not available: globus n-1<29286> ssi:boot: opening module rsh n-1<29286> ssi:boot: initializing module rsh n-1<29286> ssi:boot:rsh: module initializing n-1<29286> ssi:boot:rsh:agent: rsh n-1<29286> ssi:boot:rsh:username: n-1<29286> ssi:boot:rsh:verbose: 1000 n-1<29286> ssi:boot:rsh:algorithm: linear n-1<29286> ssi:boot:rsh:priority: 10 n-1<29286> ssi:boot: module available: rsh, priority: 10 n-1<29286> ssi:boot: finalizing module globus n-1<29286> ssi:boot:globus: finalizing n-1<29286> ssi:boot: closing module globus n-1<29286> ssi:boot: Selected boot module rsh n0<29283> ssi:boot:base:server: got connection from 128.230.130.10 n0<29283> ssi:boot:base:server: this connection is expected (n0) n0<29283> ssi:boot:base:server: remote lamd is at 128.230.130.10:50206 n0<29283> ssi:boot:base:server: closing server socket n0<29283> ssi:boot:base:server: connecting to lamd at 128.230.130.10:49833 n0<29283> ssi:boot:base:server: connected n0<29283> ssi:boot:base:server: sending number of links (1) n0<29283> ssi:boot:base:server: sending info: n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:base:server: finished sending n0<29283> ssi:boot:base:server: disconnected from 128.230.130.10:49833 n0<29283> ssi:boot:base:linear: finished n0<29283> ssi:boot:rsh: all RTE procs started n0<29283> ssi:boot:rsh: finalizing n0<29283> ssi:boot: Closing n-1<29286> ssi:boot:rsh: finalizing n-1<29286> ssi:boot: Closing