| User Support & Documentation | ||
Jobs: MyClusterOn this page
Related Links
Need Help?IntroductionThe MyCluster tool is an environment that behaves like an additional resource control plane, constructing a personal cluster composed of CPUs from multiple systems on the TeraGrid. MyCluster adapts and migrates jobs transparently between authorized resources to optimize the overall throughput of the entire workflow computation. First, the user defines the allocated resources to use. Then, the user specifies the initial number of parallel job proxies with maximum run-time durations. The system then submits the job proxies to the individual resources. The user submits jobs as if using a local host running a Condor pool or Condor DAGMan. Using DAGMan in a pipeline is particularly useful for distributed parallel computing with dependencies. As MyCluster makes CPU resources available, DAGMan can schedule jobs on them. OverviewMyCluster creates virtual Condor [1] pools across TeraGrid clusters connected through Globus [4]. The tool provides a single Condor submission interface to the entire heterogeneous collection of computational resources on the TeraGrid. MyCluster uses GridShell [2] to provide a transparent execution framework for submitting and maintaining condor starter jobs (known as daemons) in local batch queuing systems on the TeraGrid clusters. These Condor starter jobs can either
A sample MyCluster login session is shown in Figure 1 (screenshot formatted slightly for display).
Figure 1 - A sample MyCluster login session from a client workstation incorporating compute nodes at NCSA and SDSC. This screenshot highlights several important points.
To add TeraGrid resources to a pre-existing Condor pool, different options need to be specified when the vo-login command is invoked. This will be described later in the user guide section. Also when a session is detached, the user can reattach to the session with a vo-attach command from a different terminal or workstation. This command provides a list of existing sessions which the user can select for reattaching and resuming interaction with the virtual Condor pool. This will also be described in greater detail in the user guide section. SettingsYou can also configure additional options in MyCluster by setting environment variables in your login script (.cshrc or .bashrc). These additional options include setting the project account to which the Condor starter jobs will be charged, setting a limit to the maximum number of starter jobs at each site to enable "pending" job load-balancing, or enabling periodic checker scripts to detect faulty site conditions and disabling job dispatch to these sites. The following examples are in TCSH syntax (you may convert accordingly in a BASH environment). Project Accounting If you wish to have your Condor starter jobs submitted and charged to a non-default account, you can add the optional line in your login script: setenv _GRID_PROJECT_NAME <LSF/PBS project account> Your jobs will be now charged to the specified account when the starter jobs are submitted by MyCluster on your behalf. Load-Balancing "Pending" Jobs across Clusters MyCluster will migrate "pending" Condor starter jobs between cluster sites, if you specify a limit to the maximum number of Condor starter jobs in your login script: setenv _GRID_MAX_THROTTLE 10 This will cause pending jobs to be migrated to sites which have shorter queue wait times, up to the number of jobs set in this configuration variable. If this is not set, only the specified number of Condor starter jobs for the site will be submitted. The default interval at which MyCluster will migrate "pending" Condor starter jobs is 600 seconds. You can however reset this by specifying the _GRID_GETJOB_INTERVAL environment variable: setenv _GRID_GETJOB_INTERVAL 300 # every 300 seconds Site Checker Scripts You can also specify the location of a checker script for MyCluster to run periodically on the login node with the _GRID_USER_CHECKER environment variable. An example entry in your login script could be: setenv _GRID_USER_CHECKER /home/ewalker/checker.sh If the script, checker.sh, exits with a value other then "0", MyCluster will not submit any more jobs to the site during your login session. This is useful if you wish to prevent jobs from running at a site when commonly occurring error scenarios, such as a full scratch space, missing libraries, etc., are detected. The default interval at which this user check script will be invoked is 600 seconds. However, you can reset this by specifying the _GRID_USERCHECK_INTERVAL: setenv _GRID_USERCHECK_INTERVAL 60 # every 60 seconds User GuideThis section describes how the vo-login command is used, the format of the VO configuration file, how the configuration of the virtual Condor pool can be modified, and finally presents some simple examples of typical TeraGrid user login sessions. The vo-login command Name vo-login — virtual login command for creating a MyCluster session Synopsis vo-login [-d] [-h] [-n <jobs:size>] [-m] [-H <conf_file>] [-W <mins>] [-T] [-M <host>] [-J] vosub.pl [-d] [-h] [-n <jobs:size>] [-m] [-H <conf_file>] [-W <mins>] [-T] [-M <host>] [-J] Description The options are as follows: -d -h -n <jobs:size> -m -H <conf_file> The vo-login configuration file is an ASCII file listing the clusters contributing to a virtual organization. Comments in the file are introduced with "#", and the _GRID_THROTTLE and _GRID_JOB_SIZE configuration variables are used to specify site specific Condor starter job submission requirements; e.g. tg- login.tacc.utexas.edu%_GRID_THROTTLE=2:_GRID_JOB_SIZE=32 speci- fies 2 Condor starter jobs of size 32 processors each submitted to the cluster tg-login.tacc.utexas.edu. Note that if the _GRID_MAX_THROTTLE specification is set at the participating sites, the starter submission requirements are treated as inital "hints" by the system. MyCluster will attempt to move pending Condor starter jobs between sites, to minimize their queue wait times. -W <mins> -M <host> -T -J -u The configuration file The vo-login configuration file is an ASCII file listing the clusters contributing to a virtual organization. Comments in the file are introduced with "#", and the _GRID_THROTTLE and _GRID_JOB_SIZE configuration variables are used to specify site specific Condor starter job submission requirements. Note that if the _GRID_MAX_THROTTLE specification is set at the participating sites, these Condor starter submission requirements are treated only as "hints" initially by the system. The system will attempt to move pending Condor starter jobs between sites, to minimize their queue wait times. You can also specify the remote access protocol for each participating cluster in your virtual organization. The currently supported protocols are Globus-GRAM, GSI-SSH, and SSH. You can use the following URL format to specify the protocol for MyCluster to use to start remote agents at a remote cluster site: [proto://]<site-address> For example, the URL gsissh://tg-login.tacc.utexas.edu specifies that the GSI-SSH protocol should be used by MyCluster to start agents at the TACC cluster. The keywords for the supported protocols are gram, gsissh, and ssh for Globus-GRAM, GSI-SSH and SSH respectively. The default protocol is always Globus GRAM. A sample configuration file is shown below: # vo-login configuration file - comment The example configuration file indicates that the clusters at NCSA and SDSC are part of the vo-login virtual organization. The configuration explicitly specifies that the system should submit 5 jobs of 32 processors each at NCSA, over-riding what is specified in the -n option when the vo-login command is invoked. Configuration Template FilesMyCluster uses template files for creating the actual configuration files for the Condor processes. The template files for the master processes (i.e. Collector, Negotiator, Schedd), and the starter process (i.e. Startd) are respectively located at:
You may also provide your own template files, overriding some of the default options set in the virtual Condor pool instantiation. To do this, copy both the condor_config.master.template and condor_config.glidein.template to your own private directory. Then in your login script (.cshrc or .bashrc), set the environment variable _GRID_TEMPLATE_DIR to point to this directory. You may now modify the template files to reset some of the default options in your private Condor pool. Note that configuration values marked with 'XXX' are replaced by the MyCluster tool. You may also modify these values as appropriate to enable different pool setups. ExamplesSample Condor Submit File # File: mysub.file # Running 100 instances of "a.out -i <rank>" # # Consult http://www.cs.wisc.edu/condor for general info on Condor and # http://www.cs.wisc.edu/condor/manual/v6.6/condor_submit.html # for info about condor_submit # IMPORTANT: TimeToLive is the expected execution time of your application # in secs, and this _must_ be set with the job Requirements. Executable = a.out Arguments = -i $(PROCESS) Universe = vanilla Requirements = (TimeToLive > 300) && (FileSystemDomain != "dummy") && (Arch != "dummy") && (OpSys != "dummy") Should_transfer_files = true When_to_transfer_output = on_exit output = log/out.$(CLUSTER).$(PROCESS) error = log/err.$(CLUSTER).$(PROCESS) queue 100 Example Cross-Site Runs on TeraGrid An example MyCluster run on the TeraGrid is specified by the command line invocation as shown: %> vo-login -H ~/vo.conf -n 5:32 -W 280 This submits 5 Condor starter jobs, with 32 processors and a wall-clock limit of 280 minutes each, on each of the sites specified in the vo.conf configuration file. The content of the configuration file is as follows: # file contains remote sites in the cross-site run The local site (where the vo-login command is invoked) does not need to be listed in the file. In the case where you might not want jobs to be submitted to you local client workstation when you invoke a cross-site MyCluster run, the -m option should be invoked as shown below: %> vo-login -H ~/vo.conf -n 5:32 -W 280 -m The vo.conf configuration file can be used to specify a different number of Condor starters to be submitted for each site in your virtual organization. This can be specified in the configuration file with the configuration variables _GRID_JOB_SIZE and _GRID_THROTTLE. An example configuration file with individual settings for each site is as follows: tg-login.ncsa.teragrid.org%_GRID_THROTTLE=3:_GRID_JOB_SIZE=8 tg-login.tacc.teragrid.org%_GRID_THROTTLE=5:_GRID_JOB_SIZE=32 The configuration above starts 3 Condor starter jobs of 8 processors each at NCSA, and 5 Condor starter jobs of 32 processors each at TACC. To add TeraGrid compute resources to your local Condor pool, where the master host is tejas.utexas.edu, you can issue the following command: %> vo-login -H ~/vo.conf -M tejas.utexas.edu -u -J Note that the local Condor pool needs to have the HOSTALLOW_READ and HOSTALLOW_WRITE configuration set appropriately to allow TeraGrid resources to join this pool. Adding TeraGrid resources to an existing Condor poolUsers with existing departmental Condor pools can continue submitting jobs to their departmental pool, and have MyCluster add TeraGrid resources to their local pool during peak job submission periods. The user thus continues to use the default Condor commands in his departmental pool to manage and control his job submissions, whilst MyCluster manages the submission (and re-submission) of Condor starter daemons through the different resource managers on the TeraGrid clusters. There are multiple ways of configuring MyCluster to do this, but we will highlight one example of how this may be achieved. In our scenario, we assume the user wants his/her jobs to run on any local or TeraGrid contributed CPU resource. There are three steps that need to be performed in order to do this: Step 1: Configure your local Condor pool to give TeraGrid permission to add resources to it. This can be achieved by adding the following configuration line into your local Condor installation: HOSTALLOW_READ = 129.114.*, *.teragrid.org Step 2: Configure the local MyCluster template configuration file on each cluster to start jobs only with a specified job classAd, e.g. Tg_Resource. In order to do this, create a personal template directory at each cluster, and copy over the condor starter configuration file (condor_config.glidein.template) to this directory, remembering to set _GRID_TEMPLATE_DIR in your login environment as well. Then modify the START configuration with the following: START = (TARGET.Project =?= "Tg_Resource") Step 3: Submit your jobs advertising a TeraGrid project classAd. You can do this by adding the following line in your condor submit file: +Project = "Tg_Resource" This ensures that only your jobs will be run on TeraGrid resources contributed to your local Condor pool. References
[1] The Condor Workload Management System, http://www.cs.wisc.edu/condor |
||
![]() |
![]() |
|
The TeraGrid project is funded by the National Science Foundation
and includes 11 partners: Please email help@teragrid.org with questions or comments. |
||
![]() |
![]() |