A Checkpoint and Recovery System for the Pittsburgh Supercomputing Center Terascale Computing System - PDF

Please download to get full document.

View again

of 17
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Entertainment

Published:

Views: 2 | Pages: 17

Extension: PDF | Download: 0

Share
Related documents
Description
A Checkpoint and Recovery System for the Pittsburgh Supercomputing Center Terascale Computing System Nathan Stone John Kochmar Raghurama Reddy J.
Transcript
A Checkpoint and Recovery System for the Pittsburgh Supercomputing Center Terascale Computing System Nathan Stone John Kochmar Raghurama Reddy J. Ray Scott Jason Sommerfield Chad Vizino Pittsburgh Supercomputing Center, Pittsburgh, PA Phone: , Fax: Abstract The Pittsburgh Supercomputing Center (PSC) has designed, implemented and deployed a user-level checkpoint and recovery system for the Terascale Computing System (TCS), a Compaq AlphaServer SC platform managed by the open version of the Portable Batch Scheduler (OpenPBS). This checkpoint system allows for the automated recovery of jobs following both machine failures and scheduled maintenance periods. As an added feature, this system allows that any time lost by the user process because of machine failure between the time of the failure and the time of the last checkpoint can be automatically credited back to their allocation. This system can be used to maximize resource utilization in time periods approaching scheduled maintenance. Thus, it improves both the usability and administration of the TCS. It is accessible to C, C++ and Fortran- based jobs and is implemented in a relatively portable manner. This article describes its design goals, logical operations, and implementation details. Overview TCS, like many of today's terascale clusters, consists of hundreds of interdependent nodes assembled from commercial, off-the-shelf (and otherwise independent) components. One of the dominant design goals of this installation is to support single parallel jobs that utilize the entire machine. As such, the probability that any node will fail for any reason, abnormally terminating a user's job, is sufficiently high that the mean time to failure of the machine as a whole can be shorter than typical run times for these large jobs. This environment mandates that system architects address automated job recovery. Some platforms (e.g. Cray Unicos, SGI IRIX, NEC SX) provide a system-level checkpoint feature principally to facilitate job interruption and migration or to protect against CPU failure. In this article we present original work whose principle goal is to address not merely CPU failure but the potential failure of hundreds of CPUs, file systems, networks, and independent operating systems. PSC has created and deployed a user-level checkpoint infrastructure that addresses these issues and facilitates automated recovery of users jobs through interactions with the batch scheduling system. This system was designed with three primary requirements: job recoverability, minimum effect on job running time and scalable, sustainable resource management. However, given the nature of issues to be addressed by such a system, we would also like to use such a system to maximize the machine utilization, e.g., in the event of so-called drain operations where the machine is cleared of running jobs. Minimizing the time required to create checkpoint files is essential to preserve the efficiency of the resource. Our implementation goal is to spend no more than 5% of a job s wall-clock time within the checkpoint library. In our early design discussions, we concluded that high-end users desiring to run on thousands of processors were a) typically already using some kind of user-level checkpointing, and b) willing to invest some effort to ensure the recoverability of these unusually large jobs. Therefore we designed an infrastructure that requires the user make slight modifications to both source code and job scripts, as discussed below. User library & tools The Application Program Interface (API) for the PSC s checkpoint library (see Appendix A) is written in such a way as to be directly applicable to C, C++, and Fortran developers. Users must include these functions in their source code to access the functionality offered by the checkpoint system (see sample code in Appendix B). If the user utilizes this library and invokes the recovery properly in the job script (see sample script in Appendix C), then the integrated job recovery infrastructure will ensure that the job will be automatically recovered and restarted. The user can also utilize a set of shell tools (see list in Appendix D) in the job script or in interactive sessions, which have been provided for certain non-programmatic tasks. Although users could write their own checkpoint algorithms, such checkpoint files can be lost or unavailable in the event of machine failure. The explicit advantage of this custom checkpoint library is that files written via this system are inherently recoverable and automatically migrated to the proper nodes when job restart is requested. Checkpoint files written via the checkpoint library have a well-defined and easily understood format, which allows for later direct re-use by applications outside of the checkpoint system. Users can thus leverage the use of checkpoint files for other purposes, e.g. visualization files or job status monitoring. The checkpoint library contains some job monitoring functionality as well as checkpoint and recovery. A user can invoke the tcs_postmessage_ function (see Appendix A) to track the execution of a running application. Messages can be posted from any combination of processes in an active job. All messages passed to the tcs_postmessage_ function will be displayed on the web-based monitoring applet showing the TCS machine status (see Furthermore, by formatting the message appropriately, the user can direct that the message is also sent out to an arbitrary address. This could enable users to track the status of running jobs via a PDA or other wireless devices. Checkpoint Plans PSC's implementation of the checkpoint system allows the user to choose from among several possible checkpoint strategies, or Plans . Each Plan is identified by an alphanumeric string (e.g. C1 ) for convenience. To select a specific Plan, the user can set the TCS_PLAN environment variable in the job script to one of the alphanumeric values listed below. Some Plans have additional parameters that the user can use to control the precise behavior of the checkpoint. These are listed below in the descriptions of each Plan. If the environment variables specified below are undefined then the checkpoint system will supply default values, also identified below. The default values have been chosen to provide the best all-around performance while also assuring the correct behavior of the checkpoint. To date, in all Plans the primary copies of checkpoint files are written to local disk on the compute nodes. Different Plans achieve robustness by backing up these primary copies in different ways. There may, however, be future implementations in which primary checkpoint files are written directly to remote servers (i.e. I/O redirection Plans) thus avoiding the potential loss due to compute node failure just after the checkpoint has completed but before the files have been redundantly backed up. Plan B2 : Duplication to File Server Plan B2 is a direct, brut force approach. Under Plan B2 every primary checkpoint file is copied to a remote file server immediately after closure. Within the tcs_close_ method implementation, a request is passed to the TCS IO daemon on the (local) compute node requesting that the checkpoint file be replicated to a file server. The user's job then returns to computation while the local daemon performs the third-party file migration on the user's behalf. Primary checkpoint files that are unavailable due to machine failure are thus recovered from their duplicate on the file server. This Plan achieves the maximum effective bandwidth to disk, utilizing the system's buffered I/O. It also imposes no limits on the size of buffers passed to the tcs_write_ function. Furthermore, since file migration is done by a separate process (the TCS IO daemon), computation is minimally impacted by IO operations. However, this Plan doubles the disk space requirements of the checkpoint system and there is an implicit latency (the file migration time) during which a successfully checkpointed program could lose a node and be forced to recover from a previous checkpoint. Plan C1 : XOR-based Parity File Plan C1 is an implementation making use of unused computing cycles during an I/O operation. Under Plan C1, parity files are calculated via a bit-wise XOR of checkpoint data. On every processor, data are written to the primary checkpoint file via asynchronous I/O while also participating in a checkpoint calculation. The set of processors in the job is divided up into sets spanning N processes, each on different nodes, where N is a configurable parameter. The value of N can be set via the environment variable TCS_NODES_XOR; its default value is 8. Thus, a job can be recovered if no more than one out of every N nodes (per XOR set) has failed. During the write operation, the first process in each XOR set performs an MPI_Reduce operation, calculating the XOR value and writing the resulting data to an XOR file, which is stored on a remote file server. The data from any missing node(s) can be regenerated by the reverse XOR calculation, including the N-1 primary checkpoint files and the XOR file for that set. By calculating only one XOR file per TCS_NODES_XOR files, the disk utilization overhead of this scheme is minimized. However, this Plan requires that there is sufficient free memory to allocate an additional XOR buffer as large as the buffers passed to the tcs_write_ function. Thus, if a user were utilizing 3GB of memory on each node, a write operation of chunks much larger than 1GB would result in a failure, since our compute nodes are currently configured with only 4GB of physical memory. Furthermore, the asynchronous I/O library (libaio.so) cannot make use of the operating system s buffered I/O. Thus, it achieves apparent write bandwidths more characteristic of direct I/O. Implementation Details The design goals of this project are sufficiently broad in scope that an in-depth discussion of implementation details is merited. We address here several facets of the implementation in an effort to provide greater technical clarity on some of the features discussed above. All components of this system have been written in C, C++ or Perl in such a way as to be highly portable. The presence of a locally-accessible SQL database is required. The high-performance bulk-data transport layer is, however, machine- and interconnectspecific. Administration The installation and version management of this system software is achieved by Depot (see a management tool developed at Carnegie Mellon University. It automatically distributes all of the installation files to all of the file systems involved. State information for the checkpoint system as a whole is maintained in a central Mini SQL (see database. This allows read-only access to both users and operators from all nodes within the system and write-access to root processes as well. The checkpoint system stores and manages a significant amount of state information in this database, which is shared by the Resource Management System (RMS). The presence of available checkpoint files, checkpoint activity and authentication information, checkpoint system performance and other job and statusdependent variables are all stored in this central database. The system requires the presence of daemons (tcsiod) running on each compute node and file server. These daemons serve in the authentication and transport operations to provide asynchronous file migration for active jobs, access to the native Quadrics transport layer, access to remote, locally-attached storage, read/write access to the central database, and other implementation-specific functionality. A process control script for this daemon must reside in /sbin/init.d, with appropriate symbolic links in /sbin/rc.d, on each compute node and file server as well. There is a command-line tool to verify the presence of responsive daemons throughout the system (see Appendix D). Job- and daemon- related activity, including error and diagnostic conditions, is logged via the UNIX syslog utility. These log messages are configured to log, not only to each node (localhost), but also to a centralized management server. In this way near real-time status messages are collected and filtered to provide an additional diagnostic of the checkpoint/recovery system. Security Security in this system is primarily concerned with access permissions for both the database, since state information for each job is stored here, and the checkpoint files. This system uses the standard UNIX file system protection scheme when handling or providing access to any and all checkpoint files. Thus, when reading or writing files via the TCS checkpoint library API (see Appendix A) the protection presented by the file system is sufficient. Regarding created-file permissions, ownership of migrated files can either follow the UID of the requestor, given sufficient permissions to read the file (as with the UNIX cp command), or the stats of the original file can be preserved. Since third-party operations are performed by the daemon on behalf of the user, additional care is needed. In such instances the user s process, utilizing an encapsulated, object-oriented interface to the library, passes its UID/GID to the daemon with any requests. The UID/GID received by the daemon is then used in determining further access permissions. Communication The communication layer consists of a single C++ object ( Connection ) that currently encapsulates standard TCP socket communications. It is highly portable and has been used in other projects and on many other platforms. This communication class is used to handle the passing of any and all meta-data and command invocation requests and return parameters. It is an ultra-lightweight schema patterned after the CORBA model. In the case of bulk data transport, the Connection class is used to handle only the metadata while the transport itself is handled by a second custom communication library developed at PSC, based on the native Quadrics Elan3 communication library. This library has been benchmarked at greater than 200 Mbytes/sec transport rates utilizing a single rail of the computational interconnect. Given these two communication vehicles, our infrastructure employs a highly portable control channel and a high-performance data channel. Storage Architecture Our system includes 64 file servers that are each equipped with 500 GB of disk storage for user files. This storage is presented via direct-attached SCSI. Each SCSI chain will be attached to two file servers, exploiting a multi-initiator SCSI configuration. In this manner users will be able to retrieve files from this user file system even if one of the file serving nodes fails. The TCS system was designed to utilize up to 64 file such file servers whose function can be dynamically controlled. Specifically, file servers can be included in the compute server pool and thus used to run user jobs or they can be excluded from this pool of compute nodes, thus serving I/O operations exclusively. If there is a job queued that has high I/O requirements, the system can schedule it for a time when all 64 file servers are held free of running jobs. This is an implementation feature of our customizations to the Portable Batch Scheduler (PBS, see below). Key aspects of the storage architecture are configurable via the contents of a configuration file (/etc/tcsiod.conf), e.g. the host and path specification of primary checkpoint files, parity files, redundant file server fail-over, and the amount of performance logging. Following is a sample segment of that file: define complex path for local (primary) checkpoint files TCS_LOCALCHKDIR {local:/local1/tcschk,/local2/tcschk define complex path for central (duplicate) checkpoint files TCS_CENTRALCHKDIR lemieux[0-32](n+1):/usr/tcschk All complex paths specified in this configuration file have the format: hostexp(redundancy):full_path1,full_path2 The host expression (hostexp) can either be an RMS-style regular expression, as shown in the value of TCS_CENTRALCHKDIR, which can be expanded into a list of node names, or a generic host specification surrounded by braces, as shown in the value of TCS_LOCALCHKDIR. We currently support two generic specifications: {local, meaning the actual name of localhost (via gethostname()); and {cluster, meaning any node in the TruCluster of which this node is a member. Compaq TruCluster is a software management layer over the Compaq Tru64 operating system that clusters nodes together into groups of up to 32 nodes and provides some failover infrastructure and shared file systems among those nodes. The complex path can include a redundancy specification. The redundancy specifies, for example in the n+1 case shown, that file servers are paired together as node0:node1, node2:node3, node4:node5, etc. This type of redundancy can be achieved via multiinitiator SCSI between two file-serving nodes. In the event of a file serving node failure, any process attempting to retrieve a checkpoint file has a straightforward path for attempting retrieval from an alternative node--its redundant partner. In each of the host specifications described above, if the local host is a member of the expanded host list then it is designated as the target node, or the node to which checkpoint files will be written. All other nodes in the expanded list will serve as failover nodes for reading back those files. If the local host is not a member of the expanded host list then the target node is chosen by a load-balancing scheme employing the requesting agent s host ID (the numeric part of its host name) modulated by the number of values in the expanded host list. There is a further load-balancing feature. If the target host is the same as localhost, then the comma-separated list of full paths shown are compared according to available space, whereupon the path with the most free space is chosen. Otherwise, the target path is selected by the same load-balancing scheme described in the previous paragraph. These load-balancing features allow for an extremely flexible and configurable system at the cost of only a few configuration parsing functions. Performance Monitoring The checkpoint library implementation includes internal diagnostic and performance monitoring. I/O traffic involving the checkpoint library is logged to the central SQL database when checkpoint files are closed. The library is currently tracking time spent in open, read, write and close operations as well as the amount of data read and written and an error mask indicating any errors encountered by the user. Thus administrators can obtain valuable usage and performance information for the library on a per-file basis. Administrators can completely deactivate this feature at any time by modifying the value of a parameter in the TCS I/O configuration file (/etc/tcsiod.conf). Scheduling Integration with the scheduling system is a critical aspect of the checkpoint and recovery system. PSC added modifications to our port of the Open version of the Portable Batch Scheduler (OpenPBS) to enable automated job recovery. Significant effort was invested to port OpenPBS to the Compaq AlphaServer SC platform, integrating it into RMS. That effort leverages the presence of the RMS database, as does the checkpoint system. User interaction with the compute nodes is all controlled and monitored by OpenPBS. OpenPBS provides convenient customization points. Specifically, it allows for the execution of prologue and epilogue scripts that run immediately before and after each submitted job, respectively. The OpenPBS prologue script assists in the initialization of the checkpoint system by invoking the tcsinit shell tool (see Appendix D). The OpenPBS epilogue script triggers job recovery. This script queries several status values from the central database including the RMS job status (the status resulting from running the prun
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks