|
TERAGRID
Data
Primer
TeraGrid
Data Working Group January 2004 1. What is a “data service” in the TeraGrid? 2. What data services are available on the
TeraGrid? 3. What is the core set of services
supported at each site of the TeraGrid? 4. Are the same services available at every
site in the TeraGrid? 5. How do I get started using the TeraGrid
Data Services? 6. Should one make an explicit reservation
request for data services? 7. What is the most useful way by which to communicate application
data needs to the TeraGrid? 8. Is there User Services support available
to address the data needs of an application? 9. Are there good examples of data-oriented
applications in the TeraGrid? 10. How does a user find out local
environment variable settings and policies? 1. What
is a “data service” in the TeraGrid? The term “data service” refers to a TeraGrid capability that provides support for data management (storage, access, organization, analysis, manipulation) or data transfer activities. Emphasis is placed on capabilities that are required by data intensive applications. 2. What
data services are available on the TeraGrid? The TeraGrid provides a set of services that are accessible in the distributed TeraGrid environment and provide capabilities that are essential to a variety of scientific applications. These can be broadly classified into data management and data transfer services as listed below: a.
Data management services: i. Filesystems: Disk-based filesystem capabilities to enable storing of data and program files, including parallel filesystems. Data is immediately accessible and filesystems offer a range of backup and purge policies to suit a variety of needs. ii. Archives: Disk and tape-based hierarchical storage management systems for long-term storage of large files. iii. Collection management: The Storage Resource Broker (SRB) enables users to store and access collections of files/data sets—along with associated, user-specified metadata—using a variety of backend storage systems. Detailed information on the SRB can be found at http://www.npaci.edu/Research/DI/srb/[1]. iv. Federated data sources: Database management system (DBMS) capability that enables federation of multiple relational databases and/or structured files. b.
Remote data transfer services: i. gridFTP: Transfers between storage resources at different sites in the TeraGrid are supported using the gridFTP protocol and associated client software. In particular, globus-url-copy is provided for data transfers. ii. SRB: Remote transfers can also be performed using the SRB middleware, either between two SRB servers within the TeraGrid, or from a non-SRB source to SRB (e.g., using the Sput or Sbload commands), or from SRB to a non-SRB destination (e.g., using the Sget or Sbunload commands). iii. gsiscp: Transfer of small amounts of data between remote storage systems can also be done using gsiscp, which is a GSI-enabled version of scp. Additional information on data transfers is available at: www.teragrid.org/userinfo/data/transfer.php 3. What
is the core set of services supported at each site of the TeraGrid? Each site in the TeraGrid provides a home filesystem, node local scratch, and a parallel filesystem. The MPI-IO parallel I/O routines are available at each site via the mpich-gm library and compilers, which are part of the Common TeraGrid Software Stack (CTSS). See the I/O chapter of the MPI-2 standard, found at http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html, for information on the MPI-IO routines. These routines will also be supported by mpich-vmi2, which will be part of CTSS Version 2 (Note: beta releases of VMI Version 2 are available now at NCSA and SDSC and may be accessed by modifying your softenv (soft add +mpich-vmi-2.0b3-gcc-r1 or soft add +mpich-vmi-2.0b3-intel-r1) before compiling or using mpirun). The Hierarchical Data Format (HDF) provides structured file formats optimized for scientific data management. More information about HDF can be found at http://hdf.ncsa.uiuc.edu. Both HDF4 and HDF5 libraries and tools are supported on TeraGrid as part of CTSS. The Softenv keys HDF4_HOME and HDF5_HOME provide the paths to the installation directories at each site. In addition, a parallel version of HDF5 that is optimized to work with parallel filesystems is available on some systems and will be included in future versions of CTSS. For the location of PHDF5 and the most recent releases of HDF4 and HDF5 see http://hdf.ncsa.uiuc.edu/ncsa_support/Teragrid_Clusters.htm. Also, from each TeraGrid site, applications are able to access the same set of TeraGrid data services using associated client software that is part of the CTSS. These services include: - ability to transfer data using gridFTP (with globus-url-copy) - ability to access SRB data collections using the SRB client - ability to access the DB2 database using the DB2 client (in Version 2 of CTSS). 4. Are
the same services available at every site in the TeraGrid? Yes, as mentioned above, the same services are accessible from any site in the TeraGrid. However, it is not expected that every TeraGrid site will implement all services. In other words, some services may be remotely invoked across the Grid. Each site will implement services based on their local expertise and capabilities. Indeed, even if a given service such as the GPFS parallel filesystem is implemented at multiple sites, the characteristics of the service such as the storage capacity may vary among sites. The TeraGrid contains multiple parallel filesystem implementations across sites. Some sites have deployed GPFS while other sites have deployed PVFS and the HP PFS. Not all TeraGrid sites provide an archival storage (or, Hierarchical Storage Management, HSM) server. Different HSMs are in use at different sites. NCSA provides the UniTree system, while SDSC provides HPSS and the SAM-QFS systems, and PSC provides a SLASH / SGI DMF facility. Caltech provides HPSS as well. The different archival storage systems are accessible via GSI-enabled FTP or gridFTP. SRB servers are provided at SDSC and PSC. At SDSC, the SRB serves some of the storage allocated from the SAM-QFS filesystem. The SRB server is hosted on a Sun 15K system, which also hosts the associated SRB Metadata Catalog (MCAT). At PSC, SRB and GridFTP have been integrated with the SLASH /DMF system. The TeraGrid database server is implemented at SDSC using IBM’s DB2 Version 8 and DB2 Information Integrator, which provides the data federation service. The DB2 server is implemented on two 32-node IBM Regatta systems, one with 128GB of main memory and another with 256GB of memory. Both systems have access to about 10TB SAN-attached storage. Within each SMP node, the DB2 is configured using multiple logical partitions, which provides for better resource utilization and scaling of databases and database workloads. Further, the set up also provides for physical partitioning of databases across the two SMP nodes. The per-site storage capacities are shown in Table I. 5. How do I get started using the TeraGrid Data
Services? The first step is to acquire a TeraGrid account via the grants process. The online application for adding users to an existing TeraGrid account is found at http://accounts.teragrid.org/add_user.html. Once you have an account, you may use the mounted filesystems (TG_CLUSTER_HOME, TG_CLUSTER_PFS, TG_CLUSTER_SCRATCH, TG_NODE_SCRATCH, etc.) at any time. Using some of the advanced data services requires a few additional steps. To utilize GridFTP or other GSI-enabled services you will need to acquire a certificate. Refer to http://www.teragrid.org/userinfo/access/index.php for an overview of authentication methods. To utilize GSI-enabled services, authenticate against your certificate before initiating actions involving those services and your authentication will be propagated where necessary. To utilize SRB, you will need to acquire an SRB account here by filling out the form at http://www.npaci.edu/DICE/SRB/install/SRBUserRegister.html. When you receive your SRB account you will be given a default environment, which you must put in the file $HOME/.srb/.MdasEnv on any UNIX system where you plan to invoke the SRB commands interactively. You must also create the file $HOME/.srb/.MdasAuth to reflect either a password-based authentication scheme or GSI-enabled authentication, according to your preference. To utilize the TeraGrid database service, you will need to contact the TeraGrid User Services group and be prepared to provide information on database size, growth rate, user community, anticipate workload, and length of time that the database is expected to remain online. 6. Should one make an explicit reservation request
for data services? This depends on the size of the need. The current proposal process for acquiring time on the TeraGrid requires only information on the number of CPU units being requested. For the moment, there are sufficient data resources across the TeraGrid to satisfy the near-term anticipated user needs. Thus, the application process does not require explicit information related to the data capacity or functionality needs of a given application. Typically, users that have applications with more demanding needs, in terms of data size and/or I/O rates, work closely with the TeraGrid User Services group to ensure that the application requirements can be met. Anticipating that such applications will become routine in future, the TeraGrid and the PACI centers are working towards an allocation policy that will cover data requirements as well. 7. What is the most useful way by which to
communicate application data needs to the TeraGrid. A typical data-intensive application consists of some or all of the following sequence of steps: Stage data: prime the pump – get data needed for startup
Pre-process – generate initial conditions and perhaps first time-step
Run computation – execute core code and write out data per time step
Post-process – generate derived elements
Drain and archive – off-load data and move to long-term storage The outline presents a high-level and generic characterization of data applications. The data volume, request rate, and storage accessed at each stage will vary depending on the application. For example, in one application Step 1 (Stage data) may involve the transfer of a large data set or collection from archival storage to a parallel filesystem. Subsequently, in Step 4 (Post-process), the output may be sent to a visualization engine. Another application may generate large output, which in Step 5 (Drain and archive), may be moved to long-term storage and deleted from the parallel filesystem where it was originally written. In another application, Stage 1 it may involve the bulk loading of a database, which remains on disk for a period of time, say, until multiple applications are executed. Each step may require access to different types of data services, and data may have to be transferred from one location to another between steps. When analyzing the data needs of your application, it is useful to think in some detail of the types of data services required in each step and the amount of data that is being handled. There are different modalities in which the data may be handled. For some applications (or, some stages of a given application), the data may be in “flat” files, thus requiring access to a filesystem. There may be a need for parallel access to the data and thus for MPI-IO capability. In other cases, the file may need to be created/accessed using HDF libraries. Furthermore, input files may be local to or remote from the site of the application. Similarly, the output of an application may need to be written to local storage or remote storage. If an application analyzes many small files (e.g. a set of image files in a digital sky collection), or generates many files (e.g. the ENZO application), it may need to use the SRB collection management software to manage the large numbers of files. Applications may also read data directly from and write output directly to a database system (e.g. the EOL application described below). In each case, different data services are required to serve the application needs. Based on preliminary test runs and/or some experience with an application, it may also be possible to estimate the required I/O bandwidth for different stages of a given application. This is also extremely useful information to consider when planning an application run on the TeraGrid. 8. Is there User Services support available to
address the data needs of an application? Yes, TeraGrid sites will typically have consultants available who can help with planning an application and also providing detailed information on how to use the various TeraGrid data services such as parallel filesystems, archives, HDF libraries, SRB, and DBMS. Specific questions can also be sent to help@teragrid.org. 9. Are there good examples of data-oriented
applications in the TeraGrid? Yes, there are several “flagship applications” / early adopters that demonstrate the power of the TeraGrid Data framework. They are: ENZO
-Engages more than few hundred CPUs writing
independent files @ 10s of MB/file over the space of
a couple minutes. -Necessary data element: Parallel file system capable of
sustaining 800-900 MB/s. ENZO is a relatively long-standing
science application that is being adapted to take run in the TeraGrid and take
advantage of the TeraGrid capabilities. EOL (Encyclopedia of Life) Application that spawns many processes (~100’s), each of which reads data from a large database (~2-5TB) and writes data back into the database (~100MB to a few GB per process). EOL is a bioinformatics analysis pipeline-based application that has been recently developed. It is also being adapted to take advantage of the database services available in the TeraGrid. TeraBridge This is a new application (based on a recent NSF ITR grant related to monitoring the health of civil infrastructure). Continuous data streams are obtained from sensors in the field (e.g. bridges) and loaded into a TeraGrid database. One of the objectives is data assimilation—to incorporate the observed sensor data into analytical models in order to improve the quality of the models. The application generates a continuous workload of about 400,000 rows/day, which need to be inserted into a database. Viz - ASCI Flash White Dwarf
Simulation 10. How does a user find out local environment
variable settings and policies? A user can discover the local settings of environment variables by issuing the env command from the TeraGrid login nodes (viz., tg-login). Here is a list of storage-related variables that are common to all sites: $TG_CLUSTER_HOME - specifies the path to a user’s home directory on a TG cluster. This space is visible by all the nodes in the cluster, including the login nodes. $TG_CLUSTER_PFS - specifies the path to the default parallel filesystem space for the cluster. This space is accessible to all nodes in a TG cluster, including the login nodes. $TG_CLUSTER_SCRATCH - specifies the path to scratch space that is accessible to all nodes in a TG cluster, including the login nodes. $TG_NODE_SCRATCH - specifies path to scratch space that is local to a TG compute or login node for use by an application while it is running on the node. Each site may have additional variables. For more, please see: http://www.teragrid.org/userinfo/jobs/environment.php. Also, each TeraGrid site is implementing a “policy” command. Typing in “policy” at each site will provide detailed information on the local policy setting, for example, how scratch space is handled at each site. We are planning on including a “policy” command in CTSSv2. TABLE I: Per-Site Storage Details
1,2. At ANL and Caltech, cluster scratch and
parallel are the same file-system(s). [1]Throughout this document, “SRB” refers to the SDSC implementation of the Storage Resource Broker. |