| User Support & Documentation | |||||||||||||||||||||||
Running Jobs with MPICH-G2 (for Globus 2)On this page
Related Links
Need Help?MPICH-G2 is "a Grid-enabled implementation of the popular MPI (Message Passing Interface) framework. MPICH-G2 uses Grid capabilities for starting processes on remote systems, for staging executables and data to remote systems, and for security. MPICH-G2 also selects the appropriate communication method (high-performance local interconnect, Internet protocol, etc.) depending on whether messages are being passed between local or remote systems." The instructions below for running MPICH-G2 jobs on TeraGrid assume that the user already has acquired their Globus security credentials and has their DN entered in the gridmap files across the TeraGrid. 1. Setting up your environment with your ~/.softNOTE: These instructions are for use with CTSS 3. See the yellow callout box below for use with CTSS 2. For Globus Toolkit and MPICH-G2 built with Intel compilers your ~/.soft file should look like this: +mpich-g2-intel For Globus Toolkit and MPICH-G2 built with gcc compilers your ~/.soft file should look like this: +mpich-g2-gcc Setting up your MPICH-G2 environment in CTSS 2 If you want to use the Globus Toolkit and MPICH-G2 built with Intel compilers your ~/.soft file should look like this: @remove +globus If you want to use the Globus Toolkit and MPICH-G2 built with Gnu compilers your ~/.soft file should look like this: @remove +globus 2. Compiling your MPICH-G2 applicationCompile your application at each TeraGrid site. There are four MPICH-G2 compilers; for C programs use mpicc, for C++ use mpicxx, for Fortran77 use mpif77, and for Fortran90 use mpif90. These compilers should already be in your path based on the ~/.soft specfications in step 1. 3. Write your Globus RSL fileYour MPICH-G2 application can be executed on one or more TeraGrid sites in a single run. From your application's point of view it will appear as a single MPI_COMM_WORLD that spans the TeraGrid sites you're running on. You describe the job you want to run using the Globus Resource Specification Language (RSL). Click here for a full description of the RSL syntax. Here is an example RSL file to run the "allall_g2" program as a single 6-proc job across NCSA, SDSC, and ANL (2 processes at each site). Please note that for "count", ranks are packed onto clusters such that if a compute host has 2 processors, both will be used. Also note the use of queue= and project= RSL parameters where appropriate. (The project parameter is currently required at IU, Purdue, and UC/ANL.) CTSS 3 Users NoteFor NCSA users interested in testing MPICH-G2 within the CTSS 3 environment, please substitute the resource manager line with the following: (resourceManagerContact="grid-hg.ncsa.teragrid.org/jobmanager-pbs")
+
( &(resourceManagerContact="tg-login1.ncsa.teragrid.org/jobmanager-pbs_intel")
(count=2)
(queue=debug)
(jobtype=mpi)
(project="TG-STA040001N")
(maxwalltime=5)
(label="subjob 0")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
(GLOBUS_TCP_PORT_RANGE ""))
(directory="/home/ncsa/arnoldg/mpi")
(executable="/home/ncsa/arnoldg/mpi/allall_g2")
(arguments="10" "700" "2000")
)
( &(resourceManagerContact="tg-login1.sdsc.teragrid.org/jobmanager-pbs_intel")
(count=2)
(jobtype=mpi)
(project="TG-STA040001N")
(maxwalltime=5)
(label="subjob 1")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)
(GLOBUS_TCP_PORT_RANGE ""))
(directory="/users/ux453915/mpi")
(executable="/users/ux453915/mpi/allall_g2")
(arguments="10" "700" "2000")
)
( &(resourceManagerContact="tg-grid1.uc.teragrid.org/jobmanager-pbs_intel")
(count=2)
(host_count=1:ia64-compute)
(jobtype=mpi)
(project="TG-STA040001N")
(maxwalltime=5)
(label="subjob 2")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 2)
(GLOBUS_TCP_PORT_RANGE "")
)
(directory="/home/arnoldg/mpi")
(executable="/home/arnoldg/mpi/allall_g2")
(arguments="10" "700" "2000")
)
Then run the command mpirun -globusrsl allall.rsl RSL Parameters by Site
Things to note about the RSL file:a. The resourceManagerContact is different for each site. Here are the values that you should specify for TG sites: UC/ANL Gnu: tg-grid1.uc.teragrid.org/jobmanager-pbs_gcc NCSA Gnu: tg-login1.ncsa.teragrid.org/jobmanager-pbs_gcc SDSC Gnu: tg-login1.sdsc.teragrid.org/jobmanager-pbs_gcc b. what you specify for count should match what you specify for host_count c. the host_count syntax for each site is slightly different. Here are examples when count=2. Note that the resourceManagerContact you specify to run at UC/ANL is determined by your choice of either Gnu- or Intel-built versions of Globus and MPICH-G2 and that the compute nodes you are assigned at UC/ANL (i.e., IA64 or IA32) is based on the host_count suffix. in other words, the resourceManagerContact you specify for UC/ANL is the same for IA64 and IA32 jobs. d. each "subjob" has a label and sets an env var GLOBUS_DUROC_SUBJOB_INDEX both of which are counters that start at 0 and increment with each subjob e. the entire RSL must start with a + f. executable full pathname is different for each subjob ... it is the location of the executable at that site. g. each subjob must have (jobtype=mpi). 4. Run your applicationLike any Globus job you must have a valid proxy to run a job. To create one use the grid-proxy-init command. With a valid proxy and an RSL file you are ready to run your program using MPICH-G2's mpirun command like this: mpirun -globusrsl myfile.rsl NOTE: The ~/.soft file that you defined in step 1 must look as you created it in step 1 when you launch the job. In other words, if you are launching your job from some TG site A where some of your processes will run on TG site B, your ~/.soft at site B must look like the one defined in step 1 even though you are not explicitly logging on to site B to run the job. Extras1. If you're running on both the IA-64 and IA-32 bit systems you can get improved performance in your MPI collective operations (e.g., MPI_Bcast) if you "tell" MPICH-G2 that the IA-64 and IA-32 nodes are both located at the same site. You do this by specifying specifying identical values for the environment variable GLOBUS_LAN_ID for the IA-32 and IA-64 subjobs. For example,
+
(
&(resourceManagerContact="tg-login1.sdsc.teragrid.org/jobmanager-
pbs_intel")
(count=2)
(host_count="2")
(jobtype=mpi)
(label="subjob 0")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
(executable="/users/ux421134/Apps/Intel/ring")
)
(
&(resourceManagerContact="tg-grid1.uc.teragrid.org/jobmanager-
pbs_intel")
(count=2)
(hostcount="2:ia64-compute")
(jobtype=mpi)
(label="subjob 1")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)
(GLOBUS_LAN_ID foo))
(executable="/home/karonis/64bit/Apps/Intel/ring")
)
(
&(resourceManagerContact="tg-grid1.uc.teragrid.org/jobmanager-
pbs_intel")
(count=2)
(hostcount="2:ia32-compute")
(jobtype=mpi)
(label="subjob 2")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 2)
(GLOBUS_LAN_ID foo))
(executable="/home/karonis/32bit/Apps/Intel/ring")
)
2. Toplogy discovery mechanismNote: The topology discovery mechanism described in this section works directly only from C/C++. It does not work directly from F77/F90. Fortran applications that want to make use of MPICH-G2's topology discovery mechanism should write a C function that calls the MPICH-G2 routines described in this section, call that C function from their Fortran application, and have their C function return the topology information to the Fortran application. MPICH-G2 communicates over TCP or a vendor-supplied MPI (vMPI). However, some of MPI's collective operations are implemented in MPICH-G2 by making a distinction between WAN-TCP, LAN-TCP, and intra-machine TCP. Some MPI applications could make use of this TCP stratification by creating communicators (e.g., MPI_Comm_split) that cluster processes based on this topology information. To that end, MPICH-G2 has added two attributes associated with every communicator, MPICHX_TOPOLOGY_DEPTHS and MPICHX_TOPOLOGY_COLORS. Described briefly, MPICH-G2 processes communicate using either TCP or, where available, the preferred vMPI. Associated with this multi-protocol design is something we call topology depth. Processes that communicate using TCP only have a topology depth=3 (lvl0=WAN-TCP, lvl1=LAN-TCP, and lvl2=intra-machine TCP) while processes that can communicate using vMPI have a topology depth=4 (lvl3=vMPI). MPICHX_TOPOLOGY_DEPTHS is a vector (length = communicator size) of integers in which the i-th element is the topology depth of the i-th ranked process in the communicator. Using these ordered lists of multi-method communication we can ask the question "Can process A communicate with process B at multi-protocol level i?" For example, any two processes can communicate with each other at lvl0=WAN-TCP. Processes can communicate at lvl1=LAN-TCP if and only if they are in the same LAN cluster, they can communicate at lvl2=intra-machine TCP if and only if they are in the same RSL subjob, and they can communicate at lvl3=vMPI if and only if they are in the same RSL subjob and that subjob specifies (jobtype=mpi). We capture this notion of "being able to communicate at a particular level" by defining something that we call color. At any given level two processes have the same color (an integer value always >=0) if and only if they can communicate at that level. MPICHX_TOPOLOGY_COLORS is a vector (again, length = communicator size) of integer pointers in which the i-th element is, in turn, a pointer to a vector of integers (length = MPICHX_TOPOLOGY_DEPTHS[i]) and MPICHX_TOPOLOGY_COLORS[i][j] is the color of the i-th ranked process at level j (note that those processes that cannot communicate over vMPI have a topology-depth=3, and therefore, do not have a color defined at MPICHX_TOPOLOGY_COLORS[i][3]). To illustrate the use of MPICHX_TOPOLOGY_DEPTHS and MPICHX_TOPOLOGY_COLORS we provide an example MPI application, report_colors.c.
#include <mpi.h>
#include <stdio.h>
void print_topology(int me,
int size,
int *depths,
int **colors)
{
int i, j, max = 0;
FILE *fp;
char fname[100];
sprintf(fname, "colors.%d", me);
if (!(fp = fopen(fname, "w")))
{
fprintf(stderr,
"ERROR: could not open fname %s\n",
fname);
MPI_Abort(MPI_COMM_WORLD, 1);
} /* endif */
fprintf(fp, "proc\t");
for (i = 0; i < size; i++)
fprintf(fp, "% 3d", i);
fprintf(fp, "\nDepths\t");
for (i = 0; i < size; i++)
{
fprintf(fp, "% 3d", depths[i]);
if ( max < depths[i] )
max = depths[i];
} /* endfor */
for (j = 0; j < max; j++)
{
fprintf(fp, "\nlvl %d\t", j);
for (i = 0; i < size; i++)
if ( j < depths[i] )
fprintf(fp, "% 3d", colors[i][j]);
else
fprintf(fp, " ");
} /* endfor */
fprintf(fp, "\n");
fclose(fp);
return;
} /* end print_topology() */
int main (int argc, char *argv[])
{
int me, nprocs, flag, rv;
int *depths;
int **colors;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
rv = MPI_Attr_get(MPI_COMM_WORLD,
MPICHX_TOPOLOGY_DEPTHS,
&depths,
&flag);
if ( rv != MPI_SUCCESS )
{
printf("MPI_Attr_get(depths) failed, aborting\n");
MPI_Abort(MPI_COMM_WORLD, 1);
} /* endif */
if ( flag == 0 )
{
printf("MPI_Attr_get(depths): depths not available...\n");
MPI_Abort(MPI_COMM_WORLD, 1);
} /* endif */
rv = MPI_Attr_get(MPI_COMM_WORLD,
MPICHX_TOPOLOGY_COLORS,
&colors,
&flag);
if ( rv != MPI_SUCCESS )
{
printf("MPI_Attr_get(colors) failed, aborting\n");
MPI_Abort(MPI_COMM_WORLD, 1);
} /* endif */
if ( flag == 0 )
{
printf("MPI_Attr_get(colors): depths not available...\n");
MPI_Abort(MPI_COMM_WORLD, 1);
} /* endif */
print_topology(me, nprocs, depths, colors);
MPI_Finalize();
return 0;
} /* end main() */
The first thing to note is that the vectors depth and color returned by MPI_Attr_get should not be freed by the MPI application (i.e., it is erroneous to do so). The second thing to note is that, for a given communicator, every process in that communicator gets identical values for both MPICHX_TOPOLOGY_DEPTHS and MPICHX_TOPOLOGY_COLORS. This will always be the case and provides a convenient mechanism for MPI applications to create new grid-aware communicators that cluster processes based on process topology. For example, one could partition MPI_COMM_WORLD into LAN-TCP clusters adding the following to report_colors.c:
MPI_Comm LANcomm;
MPI_Comm_split(MPI_COMM_WORLD,
colors[me][1],
0,
&LANcomm);
or perhaps partitioning MPI_COMM_WORLD into vMPI clusters by adding:
MPI_Comm Vcomm;
MPI_Comm_split(MPI_COMM_WORLD,
(depths[me] == 4 ? colors[me][3] : -1),
0,
&Vcomm);
which has the added benefit(?) of placing all processes that cannot communicate over vMPI into a single group (recall that colors are defined to be >= 0). Alternatively, one could partition MPI_COMM_WORLD into vMPI clusters by adding:
MPI_Comm Vcomm;
MPI_Comm_split(MPI_COMM_WORLD,
(depths[me] == 4 ? colors[me][3] :
MPI_UNDEFINED),
0,
&Vcomm);
in which case Vcomm is set to MPI_COMM_NULL for those processes that cannot communicate over vMPI. |
|||||||||||||||||||||||
![]() |
![]() |
|
The TeraGrid project is funded by the National Science Foundation
and includes 11 partners: Please email help@teragrid.org with questions or comments. |
||
![]() |
![]() |