| User Support & Documentation | ||
Data: GlossaryThis glossary includes commands and terms that are helpful for understanding data transfer and storage. Supported commands are highlighted for ease of reference. A B C D E F G H I J K L M N O P Q R S T U V X Y Z archival storage: term used to refer to mass storage in systems maintained on specific TeraGrid sites for permanent preservation of data. SRB, UniTree, and HPSS are archival storage systems available to TeraGrid users. (see Data Storage page) cross-site data transfer: term used to refer to movement of very large files or large numbers of files from one TeraGrid site to another; usually best accomplished using third party transfers via gridFTP servers, gridFTP client, and optimization parameters set for highest transfer rates. data collection: Data collections represent permanent data storage that is organized, searchable, and available to a wide audience, either a collaborative group or the scientific public in general. Data collections usually have a Web interface or portal for displaying and retrieving data. fast file system: term used to refer to parallel file systems available for temporary storage on all TeraGrid nodes. Transfer rates will be greater when moving data between fast file systems. $TG_CLUSTER_PFS refers to the default parallel file system space. Sites may have more than one parallel file system available (PVFS, GPFS, SAMQFS). File reaping should be turned on at all sites, and purge policies and quotas will vary. get: uberftp command to retrieve a single file from the remote service globus-url-copy: a GridFTP client for transfering files from the command line. It is not an interactive command; that is, because it is a command line client, no feedback is provided after the command is executed. It is the client of choice for imbedding transfers in job scripts. It is part of the CTSS, that is, it is available from all TeraGrid sites. GPFS: IBM's General Parallel File System (GPFS) is a high-performance shared-disk file system that can provide fast data access from all nodes in a homogenous or heterogenous cluster. GPFS allows parallel applications simultaneous access to the same files, or different files, from any node which has the GPFS file system mounted while managing a high level of control over all file system operations. gridFTP: a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth, wide-area networks based on FTP; two gridFTP clients are available in the software stack: uberFTP and globus-url-copy gsincftp: an interactive ftp client that includes tools built on the NcFTP client that uses a proxy for authentication. These tools are not part of the Globus Toolkit. ( gsincftpls, gsincftpget, gsincftpput). Each command displays usage information with the -h flag. Due to a data corruption bug, this method is no longer supported on the TeraGrid as of Apr. 15, 2004. Use uberftp instead for similar interactive access, including directory listing. gsi-openssh: a version of OpenSSH that supports grid authentication, allowing people who already have grid certificates to use these tools without establishing another set of keys. Services include remote login (gsi-ssh), and remote copy (gsi-scp). gsiscp: a grid authentication enabled form of scp. Syntax of gsiscp is the same as with scp. home: refers to user's home directory on each TeraGrid cluster (see $TG_CLUSTER_HOME below) HPSS: archival storage at SDSC; requires HSI interface and commands for retrieval and storage of data HSI: client interface for HPSS. Uses its own commands for movement of data rather than UNIX commands, globus-url-copy commands, or SRB commands mass storage: term used to refer to archival storage in systems maintained on specific TeraGrid sites for large, long-term preservation of data mget: uberftp command to retrieve multiple files that match a given expression mput: uberftp command to send multiple files that match the given expression optimization parameters : arguments passed to transfer commands
using gridFTP clients that maximize transfer speeds. TCP buffer size
and number of parallel streams are the two parameters that users can control
that have the largest effect on the speed of a transfer using uberFTP or
globus-url-copy parallel file system: parallel N: uberftp optimization parameter to set N number of parallel data streams; set to 1 for best transfer performance (used with tcpbuf = 8388608). The default is 1. For high network traffic or for very large files (>X GB) N can be set to 2 (used with tcpbuf= ) put: uberftp command to send a single file to the remote service PVFS: the Parallel Virtual File System (PVFS) is a high-performance and scalable parallel file system for clusters. PVFS is open source and released under the GNU General Public License. It requires no special hardware or modifications to the kernel. PVFS provides four important capabilities in one package:
Reliable Transfer Service (RFT): The Reliable Transfer Service (RFT) is an OGSA-based service that provides interfaces for controlling and monitoring third party file transfers using GridFTP servers. It uses a database to store its state periodically so the transfers can be recovered from any failures. RFT uses standard grid security mechanisms for authorization and authentication of the users. RFT: see Reliable Transfer Service Sget Sput SRB: see Storage Resource Broker scp: a program for copying files securely over the network. It uses ssh for data transfer, and uses the same authentication and provides the same security as ssh. Use of scp requires an SSH2 client. Storage Resource Broker (SRB): The Storage Resource Broker, a data management tool, may be used for storage, replication, archiving, third-party copying and movement of large TeraGrid data sets across distributed, heterogeneous storage systems. It uses its own set of commands. Any TeraGrid user can use an SRB client to download data that is available in public collections. In order to create collections using SRB, TeraGrid users need to request an SRB account via the Help Desk. For more information on usage and availability, please see: TeraGrid Archival and Data Services. tcpbuf: uberftp optimization parameter used to change the TCP buffer size; best setting to increase transfer performance: tcpbuf = 8388608 $TG_CLUSTER_GPFS: environment variable that, along with $TG_CLUSTER_PVFS and TG_CLUSTER_LUSTRE, refers to paths to the various types of parallel filesystems available to all nodes on a TeraGrid cluster; one of these will likely point to the same directory as $TG_CLUSTER_PFS. Some sites may support multiple filesystems - for example SDSC has both GPFS and PVFS. If a site does not have a particular type of parallel filesystem, the corresponding variable will not be defined. (See Envi $TG_CLUSTER_HOME: environment variable that refers to user's home directory on each TeraGrid cluster; this space is visible by all the nodes in the cluster, including the login nodes. Data in this area is never purged; users are responsible for their own backups TG_CLUSTER_LUSTRE: environment variable that, along with $TG_CLUSTER_PVFS and TG_CLUSTER_GPFS, refers to paths to the various types of parallel filesystems available to all nodes on a TeraGrid cluster; one of these will likely point to the same directory as $TG_CLUSTER_PFS. Some sites may support multiple filesystems - for example SDSC has both GPFS and PVFS. If a site does not have a particular type of parallel filesystem, the corresponding variable will not be defined. $TG_CLUSTER_SCRATCH: environment variable refers to path for scratch space that is accessible to all nodes in a TeraGrid cluster, including the login nodes; this storage is shared with other users that may running on the cluster at the same time, and may be physically co-located with other logical storage areas; purge policies and quotas may vary between sites $TG_CLUSTER_PFS: environment variable that refers to path to the default parallel filesystem space for the cluster. This space is accessible to all nodes in a TeraGrid cluster, including the login nodes. This storage may be physically co-located with other logical storage areas (e.g. TG_CLUSTER_SCRATCH) $TG_CLUSTER_PVFS: environment variable that, along with $TG_CLUSTER_GPFS and TG_CLUSTER_LUSTRE, refers to paths to the various types of parallel filesystems available to all nodes on a TeraGrid cluster; one of these will likely point to the same directory as $TG_CLUSTER_PFS. Some sites may support multiple filesystems - for example SDSC has both GPFS and PVFS. If a site does not have a particular type of parallel filesystem, the corresponding variable will not be defined. tgcp: The transfer command tgcp is a command-line user tool that simplifies efficient copying of files and directories between and within gridFTP enabled clusters. tgcp is a wrapper for globus-url-copy, RFT, and cp, providing options such as '-big', for a striped globus-url-copy, and '-rft', to call RFT. Because an administrator maintains configuration files describing optimal TCP buffer sizes for specific Tera Grid host/domain source-destination pairs, users do not need to enter these values. uberftp : an interactive GridFTP file transfer client. In addition to standard FTP client mechanics, UberFTP supports GSI authentication, parallel data channels and striping.
UniTree: mass storage system at NCSA; has gridFTP server front-end for facilitating large data transfers UNIX commands: native UNIX utilities cp and mv work on all compute nodes for local movement of files between home, work, or scratch directories at one site |
||
![]() |
![]() |
|
The TeraGrid project is funded by the National Science Foundation
and includes 11 partners: Please email help@teragrid.org with questions or comments. |
||
![]() |
![]() |