AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis
I. Raicu, I. Foster, A. Szalay, G.Turcu
The creation of large digital sky surveys presents the astronomy community with tremendous scientific opportunities. However, these astronomy datasets are generally terabytes in size and contain hundreds of millions of objects separated into millions of files—factors that make many analyses impractical to perform on small computers. To address this problem, we have developed a Web Services-based system, AstroPortal, that uses grid computing to federate large computing and storage resources for dynamic analysis of large datasets. Building on the Globus Toolkit 4, we have built an AstroPortal prototype and implemented a first analysis, "stacking," that sums multiple regions of the sky, a function that can help both identify variable sources and detect faint objects. We have deployed AstroPortal on the TeraGrid distributed infrastructure and applied the stacking function to the Sloan Digital Sky Survey (SDSS), DR4, which comprises about 300 million objects dispersed over 1.3 million files, a total of 3 terabytes of compressed data, with promising results. AstroPortal gives the astronomy community a new tool to advance their research and to open new doors to opportunities never before possible on such a large scale.
VisEnable: A User Friendly and Extensible System for Remote Visualization
D Guzman, G. Johnson, P. Shanmugam
With the ever increasing size and complexity of scientific datasets, researchers are finding that they need a simple, intuitive and efficient means of visualizing data. However, most do not possess the visualization or programming backgrounds to make full use of the currently available tools. In addition, many current remote visualization solutions support a limited set of visualization techniques and dataset types. To make visualization accessible to all researchers that could benefit from it, a system is required that makes the process as straightforward and simple as possible while providing a rich set of visualization algorithms. Users should not be required to install additional software, have direct access to specialized hardware, or need any knowledge of scripting or programming languages. Furthermore, the system should feature an extensible set of visualization methods that can be applied to a wide range of scientific data.
VisEnable is a remote visualization system that attempts to address these needs by providing a web-based interface to TACC's advanced visualization hardware while providing a powerful yet easy-to-use set of visualization algorithms based on VTK (the Visualization Toolkit). At the core of our approach is a knowledge-based system that frees the researcher from the need to convert data to a particular file format, knowing which specific visualization algorithms can be applied to it, or which TACC machines support the necessary graphics processing operations.
The nanoHUB—Online Simulations and a Community for Nanoscience and Nanotechnology
M. McLennan, S. Goasguen, K. Madhavan, D. Kearney, J. Cychosz
The "Network for Computational Nanotechnology (NCN)" is a multi-university, NSF-funded initiative with a vision to be the place where theory, experiment, and computation meet and move nanoscience to nanotechnology.
The NCN is a leader in modeling and simulation research through connection to center-level efforts and targeted research. http://nanoHUB.org is the primary NCN outreach vehicle, which currently provides interactive online simulation and educational resources such as tutorials, seminars, and online courses. The nanoHUB's resources are used annually by over 10,000 users. The raw web-page hit exceeds 6.1 million per year. Over 2,100 users perform over 57,000 interactive simulations using over 40 simulation tools ranging from toy models to sophisticated simulation engines. The NCN provides the resource for models, simulation and computation via web delivery—without any software installation for users.
Many of the nanoHUB applications are intended for educational use. For research use, several industrial-strength simulation tools have been rolled out, and additional tools are available to nanoHUB researchers. Some of these codes require significant resources and run on national resources such as the TeraGrid, without the end user being aware of the actual simulation result delivery.
In this demonstration, we will give an overview of the various capabilities available for free at the nanoHUB website, including simulation tools, teaching materials, seminar presentations, and interactive workspaces. We will run a variety of simulation tools on devices ranging from MOSFETs to carbon nanotubes, silicon nanowires, and quantum dots. Finally we will give an overview of the Rappture (rapid application infrastructure) toolkit, which is our enabling technology for giving research codes a modern user interface.
Using Shared Collections to Manage TeraGrid Simulation Output
R. Moore, S. Meier, G. Kremenek, W. Schroeder, L. Brieger
TeraGrid applications are capable of generating tens of Terabytes of simulation output comprising up to a million files. The management of simulation output is simplified through the use of collections. Output files can be organized from multiple simulations, attributes can be defined to support search, and access controls can be set to enable sharing. A description of the use of the SDSC Storage Resource Broker will be given that demonstrates the use of data grid technology to build shared collections that span multiple TeraGrid resources.
Managing Storm Simulation Workflows using LEAD Gateway
S. Marru, M. Christie
An atmospheric scientist who wants to predict a storm using a secured grid system has to deal with security mechanisms that require learning certain computational skills otherwise irrelevant to their science goals. Constructing, configuring, scheduling, executing, and monitoring a storm simulation workflow on a grid system involves multiple interactions between resources, executing and monitoring services, and the like. The Linked Environments for Atmospheric Discovery (LEAD; http://leadproject.org) gateway, an NSF-funded ITR project, is building cyberinfrastructure to enable scientists to predict mesoscale weather events like tornadoes. To accomplish these goals, LEAD is developing an adaptive, on-demand grid infrastructure that responds to complex weather-driven events. The LEAD Gateway will incorporate the next generation of TeraGrid services and schedulers with a focus on using on-demand compute resources for ensemble weather simulations and TeraGrid storage resources for data mining tasks and services. This paper will focus on components of the LEAD Service Oriented Architecture that will allow users to compose a storm simulation workflow by connecting Fortran applications wrapped as web services. The paper also discusses how a user can launch the composed workflow on TeraGrid resources and monitor its progress. Methods to visualize the storm simulation workflow output are presented. Interfaces dealing with metadata of simulations, observational data, and forecasted data outputs are discussed. The paper also presents information on various other LEAD Gateway capabilities including an authorization, authentication, and auditing framework that will allow different classes of users to seamlessly access and use TeraGrid resources.
Building A Numerical Modeling Application Portal on TeraGrid Environment
B. Kim, N. Kim, J. Cho,K. Cho, Y. Kim
Utilizing portal environment these days as a unified interface to the grid resources seems to be the most appropriate approach for any existing scientific application to take advantage of available grid technology. Though there are many grid portals out there, customized service layers that connect the scientific applications in between the portal framework and the grid resources are hard to discover. In addition to that, requirements for individual scientific application are different from each other. Therefore the service layers should be able to incorporate those special requests of the applications in the grid portal framework. Development of customized application-specific portal will be the solution for this. At NCSA, with a collaboration research effort with KISTI, an application portal has been developed. A computational flow dynamics numerical model has been integrated into the portal framework and several customized service modules are also integrated into the portal. The CoaxSim Grid portal that has been developed in this project offers application specific services in web-based problem solving environment. The demo will show alternative way of achieving community-oriented environment, enabling greater level of communication than ever before between researchers in the specific community.
Developing the Modular Information Provider (MIP) to Support Interoperable Production-Level Grid Information Services
S. Wang, E. Shook, A. Padmanabhan, R. Briggs, L. Pearlman
The Modular Information Provider (MIP) has been developed to systematically aggregate multiple information sources for establishing Grid information services. MIP tackles the challenge of mapping information from a large number of sources to information services with minimal human intervention. MIP addresses such mappings using a modular approach through which it can be deployed to achieve the interoperability among several Grid environments. MIP can be customized in a straightforward way to a specific Grid environment, and it also supports flexible information schema. Current MIP implementation is based on Globus MDS4 and the XML version of GLUE Schema 1.2. The design of the MIP aims to address the shortcomings that exist in Grid information providers (e.g., the Generic Information Providers) as well as to support the web service-based Grid information services. Our modular approach was developed to ease maintenance and management of Grid information systems by automatically pulling information sources together and filling appropriate pieces of information into Grid information services based on information schemas. This approach minimizes memory requirements as it allows dynamically loading only necessary modules that are customizable to the requirements from a particular Grid resource.
GAMA—Grid Account Management Architecture 2.0
K. Mueller, S. Chandra, K. Bhatia
GAMA 2.0 is the newest version of the Grid Account Management Architecture software package developed at SDSC for simplifying Grid user account creation, credential management, authentication, and authorization. GAMA 1, released in March 2005, provided a portal-based system for users to request Grid accounts and for administrators to review account requests and trigger account and credential creation through a familiar web GUI. The GAMA portal components communicated via web services with a dedicated GAMA server machine that was easily installed using the Rocks build system and incorporated CACL Credential Authority (CA) software and a MyProxy credential repository. Building on the proven foundation of GAMA 1, we have significantly expanded the capabilities of the GAMA system in version 2.0. The GAMA server now uses a modular plug-in workflow system for performing tasks such as authentication and user account creation. This enables the GAMA administrator to configure a complex sequence of tasks that are carried out for each GAMA server function, and also enables easier integration with existing infrastructure components such as LDAP servers, SRB systems, and other resources. GAMA now supports multiple sites on a single GAMA server, with each site having a local site administrator with limited administrative rights on the GAMA server in order to manage users from their site only. Finally, GAMA now supports synchronization of cluster accounts, whereby the GAMA administrator specifies that a set of users should get accounts on a set of resources and the resources automatically poll the GAMA server and create accounts.
SCEC Earthworks Science Gateway: Widening SCEC Community Access to the TeraGrid
J. Muench, P. Maechling, H. Francoeur, D. Okaya, Y. Cui, E. Deelman, G. Mehta, T. Jordan
The SCEC Earthworks Science Gateway is designed to allow members of the SCEC geoscience community to perform sophisticated, computationally intensive, geophysical research using TeraGrid resources, even if they have no prior experience with high performance computing. The SCEC Earthworks Science Gateway allows users to configure and execute earthquake wave propagation simulations using well-validated geophysical models and high performance simulation software. The SCEC Earthworks system generates a series of data sets including surface seismograms and ground motion maps. It also interfaces with the Incorporated Research Institutions in Seismology (IRIS) Data Handling Interface (DHI), which provides the system with access to observed data include earthquake catalogs and seismograms.
Users access the SCEC Earthworks system through a web-based portal built using the GridSphere Portlets engine. Using a portlet-based interface, users can configure, submit, and monitor wave propagation simulations. They can also access the resulting simulation data products. The portlets allow users to browse simulation data products, save configurations, and share simulations results with other users. All steps in the wave propagation simulations, including mesh generation, wave propagation, and post processing are run using a grid-based workflow system based on the Virtual Data System (VDS), the Pegasus meta-scheduler system, and the Globus toolkit. These workflow tools perform the backend steps of registering data with a RLS (Replica Location Service) and building, submitting, and monitoring workflows. The metadata for the resulting data products are registered within a MCS (Metadata Catalog Service).
Demonstration of the Computational Science and Engineering Online (CSE-Online)
M. Nayak, T. Cook, P. Mahajan, W. Duncan, T. Truong
The demonstration will focus on the use of the Computational Science and Engineering Online (CSE-Online) for research and teaching Computational Chemistry. CSE-Online supports a number of application tools for research and education in quantum chemistry, chemical kinetics, bio-molecular modeling and simulations. We will demonstrate CSE-Online capabilities of concurrent access data, tools, and resources from multiple remote servers as well as the computing grid and of sharing data with others in a collaborative environment.
NSTG portal demonstration
J. Cobb, S. Miller , G. Pike, S. Vazhkudai , M. Hagen , M. Chen, G. Granroth, J. Kohl, V. Lynch
The Neutron Science TeraGrid gateway (NSTG) is being developed at the Oak Ridge National Laboratory to create linkages between TeraGrid Cyberinfrastructure and Neutron Scattering facilities in general and specifically the Spallation Neutron Source (SNS), also located at Oak Ridge. SNS's seven year, $1.4-billion construction phase ended less than a month before the TeraGrid '06 conference and is now moving into operations. This demonstration will show NSTG prototyped and (alpha) production services for SNS and neutron science facilities in general as well as early experience with the SNS's advanced software development group and preliminary data from SNS commissioning experience. It will also discuss future plans for support of user operations of increasing number of beamline user programs. Demonstrated features will include: data management, metadata browsing, Monte-Carlo instrument simulation, raw and reduced data visualization, and data analysis, all within a portal presentation environment.
Remote, real-time visualization of multidimensional biological images
C. Gilpin, L. Katherine, K. Gaither
Modern biological light and electron microscopy can be used to produce multidimensional images of organisms, cells, organelles and molecules. Confocal microscopy is used to collect Z stacks of 3 channel data often as a time series. In electron tomography a series of images are collected over a large range of small increment tilt angles. With appropriate software, raw 2D data are rendered into 3D volumes. In order to extract essential information from the data, 3D volumes need to be displayed, segmented, rendered, and freely rotated and zoomed in real time at full resolution. Most visualization software available to biological imaging laboratories operates on a single workstation and is limited by processor speed, memory capacity, and graphics hardware. We present a potential solution by using the TeraGrid to access remote multi-processor computation and remote visualization capabilities. Our test platform is Paraview, an open source application that can be run on distributed and shared memory. Paraview can be run in "server mode" where rendering is computed remotely and the 3D volume data are transferred over the network to the host workstation for local viewing. This has proved unsatisfactory due to bandwidth limitations. Our alternative approach is to render and display the volume on a remote system and transfer the screen image via VNC to the local system. Thus only screen pixels need to be transferred over a network connection. Example data will be shown and file size and rendering complexity possibilities explored.
Lustre Wan Demo
D. Balog, J. Huffman, G. Pike, S. Simms
TeraGrid users desire a globally accessible high performance filesystem. PSC, along with NCSA, IU, and ORNL have been experimenting with the Lustre parallel filesystem across the ETF. Generally Lustre has been used within a cluster of machines that have a high bandwidth and low latency interconnect. We will demonstrate Lustre across the ETF, a high bandwidth high latency network.
From Simulation to Visualization: Large-Scale Parallel CFD Application on The TeraGrid
R. Payli, E. Yilmaz, H. Akay, A. Ecer
Performing large-scale simulations with Computational Fluid Dynamics (CFD) codes and visualizing the results are challenging tasks requiring much processing power, fast communication, and large storage systems. The TeraGrid makes such simulations and visualizations possible.
We have demonstrated applications on the TeraGrid for medium-scale problems in the past. In this paper, we will present larger-scale simulations and visualizations on various TeraGrid resources. Our 3D unstructured in-house CFD code, PACER3D, is used for simulations; the open source ParaView is used for visualizations of the results. Single program multiple data (SPMD) parallel model with Message Passing Interface (MPI) is employed. A domain decomposition algorithm is used, for which the solution domain is partitioned into multiple pieces of data called solution blocks. In this model, all processors use the same program but each has its own data. After each time step of the computations, the processors exchange the results with their neighbor processors to update the solution at the block boundaries.
To show the performance of the parallel solver, flow around an aircraft configuration is considered. This geometry contains 18 million tetrahedral elements and more than 3 million grid points. The geometry is partitioned up to 1024 blocks with our General Divider (GD) unstructured grid partitioning software, which uses graph partitioning library to partition the domain and prepare the interface information for the flow solver. The parallel performance results up to 512 blocks on the different TeraGrid resources will be discussed. We will also show how to use the distributed and parallel rendering mode of the ParaView to render the results of our simulations on the TeraGrid and display the results on the local machine without bringing large amounts of data from the simulation sites.
Accelerating 3D Volume Visualization for nanoHUB.org
M. McLennan, W. Qiao, R. Kennell, G. Klimeck
The Network for Computational Nanotechnology (NCN) is funded by the National Science Foundation (NSF) to create a science gateway called nanoHUB.org for nanotechnology research. Last year, more than 12,000 researchers visited the nanoHUB, where they viewed online seminars, tutorials, courses and web-based instructional modules, and downloaded other resources, such as simulation codes and homework assignments. Among those users, more than 2,500 also ran simulations online. They not only launched simulations, but also tweaked parameters and visualized the results, right from their desktop via an ordinary web browser. The nanoHUB has a powerful middleware for launching jobs, simulating, and visualizing the results. Many simulators produce 3D scalar and vector fields, which are automatically sent off to a small rendering farm for visualization and interactive exploration. In this demonstration, we will show how simulations are launched and how results are visualized on nanoHUB.org.
VisEnable: A User Friendly and Extensible System for Remote Visualization
D. Guzman, G. Johnson, P. Shanmugam
With the ever increasing size and complexity of scientific datasets, researchers are finding that they need a simple, intuitive and efficient means of visualizing data. However, most do not possess the visualization or programming backgrounds to make full use of the currently available tools. In addition, many current remote visualization solutions support a limited set of visualization techniques and dataset types. To make visualization accessible to all researchers that could benefit from it, a system is required that makes the process as straightforward and simple as possible while providing a rich set of visualization algorithms. Users should not be required to install additional software, have direct access to specialized hardware, or need any knowledge of scripting or programming languages. Furthermore, the system should feature an extensible set of visualization methods that can be applied to a wide range of scientific data.
VisEnable is a remote visualization system that attempts to address these needs by providing a web-based interface to TACC's advanced visualization hardware while providing a powerful yet easy-to-use set of visualization algorithms based on VTK (the Visualization Toolkit). At the core of our approach is a knowledge-based system that frees the researcher from the need to convert data to a particular file format, knowing which specific visualization algorithms can be applied to it, or which TACC machines support the necessary graphics processing operations.
The Data Capacitor Project
S. Simms, B. Hammond, M. Link, C. Stewart
Advances in technology have dramatically increased the rate at which scientific data can be created. In this new data-centric computing environment, researchers will be looking to information technology providers for easy ways to store, manipulate, and organize these large data sets.
To address this need, as part of the Major Research Instrumentation (MRI) Program, the National Science Foundation awarded Indiana University $1.7 million to architect and deploy the Data Capacitor, a massive high speed storage system for the temporary storage of large datasets.
Learn about some of the challenges facing researchers who use large datasets and discover how the Data Capacitor project will help meet those challenges and extend both the capacity and capabilities of the TeraGrid.
Demonstration of Interactive Simulation, Analysis, and Visualization of Fluid Turbulence Using Distributed TeraGrid Resources
D. Porter
Abstract not available.
iShare—Bringing the TeraGrid to the User's Desktop
A. Basumallik, X. Ren, R. Eigenmann, S. Goasguen
iShare is an Internet-sharing system that supports end-users and providers of computing resources (applications, data and hardware). iShare allows providers to disseminate resources and users to access them in a way that allows open participation. A fully decentralized organization for resource dissemination is enabled via the integration of a peer-to-peer (P2P) system and web standards such as XML and RDF. iShare has an open extensible architecture that allows different access mechanisms and protocols to be plugged in. It delivers a desktop-based environment for publishing and using remote resources, which decouples the computing environment perceived by end users from the underlying physical platforms. This paper describes the iShare plug-ins, implemented using the Java Commodity Grid (CoG) Kit, that enable the sharing and use of TeraGrid resources. iShare allows certificate based authentication, remote job submission using GRAM and file transfers using GridFTP. An SRB plug-in also allows users to access data collections across distributed and heterogeneous platforms. Together, these plug-ins enable the end-user to effectively discover and use TeraGrid resources from a desktop.
Predicting Bounds on the Batch Queuing Delay Experienced by Individual TeraGrid User Jobs in Real Time
R. Wolski, R. Garver, D. Nurmi, J. Brevik
In this talk we present a new method for providing TeraGrid end-users with real-time predictions of the bounds on queuing delay individual jobs will experience when waiting to be scheduled to a machine partition. Predicting the delay users will experience while waiting for their jobs to be scheduled is a problem that has been studied by the academic and commercial HPC communities for some time. Our approach, based on a new statistical methodology, predicts bounds on the waiting time (upper or lower) that individual jobs will experience with quantified confidence measures. Thus the predictions made by this system constitute a statistical guarantee of best-case and worst-case waiting delay where the confidence measure quantifies the quality of the guarantee.
We have implemented this new methodology as part of the Network Weather Service and deployed it on TeraGrid where it currently provides real-time bounds predictions. In the talk we will report on th effectiveness of the system that has been in operation as a prototype for approximately 8 months. We will discuss the methodology and its evaluation using batch-queue logs spanning 10 years at the NSF and open DOE supercomputer centers. We will also demonstrate the web interface to the system and make "live" predictions of TeraGrid delay bounds during the presentation from the web page located at http://nws.cs.ucsb.edu/batchq and we will detail the operation of a set of command-line tools that are portable among all ETF architectures.
Our results show that it is possible to predict delay bounds with specified confidence levels for individual jobs in different queues, and for jobs requesting different ranges of processor counts and different maximum execution delays. Using these predictions, users with roaming allocations or with allocations at multiple TeraGrid sites can choose the machine that is most likely to minimize turn-around time. Users can also determine the probability that a job will meet a specified deadline in a particular queue. Finally, the system is portable to all ETF architectures, making it possible for users to consider the use of heterogeneous resources, and to predict which is most likely to impose the shortest waiting time for their jobs.
vGrid—On-demand Virtual Supercomputing
T. Stef-Praun, S. Goasguen, K. Madhavan
One of the key problems facing educators, students, and beginning scientists is the high barrier of entry that supercomputing and advanced cyberinfrastructure like the TeraGrid represent. In this paper, we discuss the implementation details of a middleware tool funded by the NSF NMI effort that attempts to simplify access and learning of advanced cyberinfrastructure and supercomputing. vGrid—an infrastructure for on-demand virtual supercomputing—was successfully used by over 140 participants as part of the Supercomputing 2005 Education Program. vGrid is a grid system built on virtual resources. This implementation addresses issues of resource allocation to novice users, and provides several benefits such as dedicated grids, with isolation, QoS, and simple control and management.
Computational cycles, storage and bandwidth are in high demand, and increase with the introduction of supercomputing to novice users and to the general public through informal science efforts. The vGrid solution integrates disparate heterogeneous resources into a loosely coupled grid system. Such a computational grid presents to the user standard access interfaces (including authentication, access, execution, and storage) such that it creates the illusion of a supercomputer or the TeraGrid.
The main problem that large, complex systems with many users have to address is the fair and efficient allocation and sharing of the resources. In the case of grids, users and their jobs need to be managed in such a way that maximizes both their experience in the system and the system's resource allocation efficiency. This is generally a very complex problem, as the requirements for each job submitted by the user varies greatly in terms of resource and timing needs, and the impact of a set of jobs sharing the same hardware resource makes it almost impossible to guarantee desired levels of quality of service.
As the allocation complexity is clearly NP-hard, trying to compute optimal allocations for dynamic systems in which users and resources can arrive and leave any time, can make the overhead for computing the allocation exceed its utility. There have been several efforts to address this problem, and the classic solution is to accept reservations for the resources. Several more efficient and advanced solutions are suggested by economics and build markets where users can compete through bidding to acquire resources. While the outcome of such implementation is the closest to the ideal, making a market functional implies handling payment strategies and currencies, which poses a high barrier in the case of automated systems and for unsophisticated users. vGrid is a first attempt to address these issues. This paper will discuss them and provide a real-time demonstration of vGrid.
The Application Hosting Environment: Lightweight Middleware for Grid Based Computational Science
P. V. Coveney, S. K. Sadiq, R. Saksena, M. Mc Keown, S. Pickles, and S. J. Zasada
Current grid computing technologies have often been seen as being too heavyweight and unwieldy from a client perspective, requiring complicated installation and configuration steps to be taken that are too time consuming for most end users. This has led many of the people who would benefit most from grid technology, namely computational scientists, to avoid using it. In response to this we have developed the Application Hosting Environment, a lightweight, easily deployable environment designed to allow the scientist to quickly and easily run unmodified applications on distributed grid resources. We do this by building a layer of middleware on top of existing technologies such as Globus, and expose the functionally as web services using the WSRF::Lite toolkit. The scientist can start and manage an application via these services, with the extra layer of middleware abstracting the details of the particular underlying grid resource in use. We show how the flexibilty of this design has allowed us to create complex workflows when using grid infrastructure to investigate the molecular dynamics of HIV-1 protease.