Information Security

National Nuclear Security Administration Needs to Improve Contingency Planning for Its Classified Supercomputing Operations Gao ID: GAO-11-67 December 9, 2010

In the absence of underground nuclear weapons testing, the National Nuclear Security Administration (NNSA) relies on its supercomputing operations at its three weapons laboratories to simulate the effects of changes to current weapons systems, calculate the confidence of future untested systems, and ensure military requirements are met. GAO was requested to assess the extent to which (1) NNSA has implemented contingency and disaster recovery planning and testing for its classified supercomputing systems, (2) the laboratories are able to share supercomputing capacity for recovery operations, and (3) NNSA tracks the costs for contingency and disaster recovery planning for supercomputing assets. To do this work, GAO examined contingency and disaster recovery planning policies and activities, and analyzed classified supercomputing capabilities at the weapons laboratories, and NNSA budgetary data.

All three NNSA weapons laboratories--Los Alamos, Sandia, and Lawrence Livermore--have implemented some components of a contingency planning and disaster recovery program. NNSA, however, has not provided effective oversight to ensure that the laboratories have comprehensive and effective contingency and disaster recovery planning and testing. Further, due to lack of planning and analysis by NNSA and the laboratories, the impact of a system outage is unclear. Only one of the three laboratories--Los Alamos--had conducted a business impact analysis to assess the criticality of resources and acceptable outage time frames; yet, NNSA and all three laboratories consider the consequence associated with the loss of system availability to be low impact and do not consider the classified supercomputers to be mission critical. Nonetheless, NNSA classified supercomputing capabilities serve as a computational surrogate to nuclear weapons testing and are used to address other areas of national security. Despite the absence of business impact analyses, all laboratories had key components of a contingency planning program in place. However, shortcomings existed. For example, all laboratories had backup processes in place and had developed contingency plans, but the plans were not comprehensive. Specifically, one plan did not address the supercomputing operations, and none of the plans had been tested at the time of GAO's review. In addition, the laboratories addressed disaster recovery to a limited extent, but not specifically for the supercomputers. These shortcomings existed, at least in part, because NNSA's component organizations, including the Office of the Chief Information Officer, were unclear about their roles and responsibilities for providing oversight in the laboratories' implementation of contingency and disaster recovery planning. Until the agency fully implements a contingency and disaster recovery planning program for its weapons laboratories, it has limited assurance that vital information can be recovered and made available to meet national security priorities and requirements. Although the laboratories have the technological capability to share supercomputing capacity across all three weapons laboratories, barriers exist that could impede recovery operations. For example, the laboratories do not know the minimum supercomputing capacity needed to meet program requirements, such as simulating the effects of changes to weapons systems, should a disruption occur. In addition, the laboratories have not tested the technological capability to share the capacity on an on-demand basis for recovery operations. Without having an understanding of capacity needs and subsequent testing, the laboratories have little assurance that they could effectively share capacity if needed. Although NNSA obligated approximately $1.7 billion to help implement its classified supercomputing program from fiscal years 2007 through 2009, the agency has not tracked costs for contingency and disaster recovery planning and is uncertain of actual funds that were spent toward these efforts. GAO recommends, among other things, that NNSA clearly define roles and responsibilities for its component organizations in providing oversight for contingency and disaster recovery planning for the classified supercomputing environment. NNSA agreed with most of GAO's recommendations, but did not concur with recommendations relating to capacity planning and cost tracking.

Recommendations

Our recommendations from this work are listed below with a Contact for more information. Status will change from "In process" to "Open," "Closed - implemented," or "Closed - not implemented" based on our follow up work.

Director: Gregory C. Wilshusen Team: Government Accountability Office: Information Technology Phone: (202) 512-6244


GAO-11-67, Information Security: National Nuclear Security Administration Needs to Improve Contingency Planning for Its Classified Supercomputing Operations This is the accessible text file for GAO report number GAO-11-67 entitled 'Information Security: National Nuclear Security Administration Needs to Improve Contingency Planning for Its Classified Supercomputing Operations' which was released on December 9, 2010. This text file was formatted by the U.S. Government Accountability Office (GAO) to be accessible to users with visual impairments, as part of a longer term project to improve GAO products' accessibility. Every attempt has been made to maintain the structural and data integrity of the original printed product. Accessibility features, such as text descriptions of tables, consecutively numbered footnotes placed at the end of the file, and the text of agency comment letters, are provided but may not exactly duplicate the presentation or format of the printed version. The portable document format (PDF) file is an exact electronic replica of the printed version. We welcome your feedback. Please E-mail your comments regarding the contents or accessibility features of this document to Webmaster@gao.gov. This is a work of the U.S. government and is not subject to copyright protection in the United States. It may be reproduced and distributed in its entirety without further permission from GAO. Because this work may contain copyrighted images or other material, permission from the copyright holder may be necessary if you wish to reproduce this material separately. United States Government Accountability Office: GAO: Report to Congressional Requesters: December 2010: Information Security: National Nuclear Security Administration Needs to Improve Contingency Planning for Its Classified Supercomputing Operations: GAO-11-67: GAO Highlights: Highlights of GAO-11-67, a report to congressional requesters. Why GAO Did This Study: In the absence of underground nuclear weapons testing, the National Nuclear Security Administration (NNSA) relies on its supercomputing operations at its three weapons laboratories to simulate the effects of changes to current weapons systems, calculate the confidence of future untested systems, and ensure military requirements are met. GAO was requested to assess the extent to which (1) NNSA has implemented contingency and disaster recovery planning and testing for its classified supercomputing systems, (2) the laboratories are able to share supercomputing capacity for recovery operations, and (3) NNSA tracks the costs for contingency and disaster recovery planning for supercomputing assets. To do this work, GAO examined contingency and disaster recovery planning policies and activities, and analyzed classified supercomputing capabilities at the weapons laboratories, and NNSA budgetary data. What GAO Found: All three NNSA weapons laboratories”-Los Alamos, Sandia, and Lawrence Livermore”-have implemented some components of a contingency planning and disaster recovery program. NNSA, however, has not provided effective oversight to ensure that the laboratories have comprehensive and effective contingency and disaster recovery planning and testing. Further, due to lack of planning and analysis by NNSA and the laboratories, the impact of a system outage is unclear. Only one of the three laboratories-”Los Alamos-”had conducted a business impact analysis to assess the criticality of resources and acceptable outage time frames; yet, NNSA and all three laboratories consider the consequence associated with the loss of system availability to be low impact and do not consider the classified supercomputers to be mission critical. Nonetheless, NNSA classified supercomputing capabilities serve as a computational surrogate to nuclear weapons testing and are used to address other areas of national security. Despite the absence of business impact analyses, all laboratories had key components of a contingency planning program in place. However, shortcomings existed. For example, all laboratories had backup processes in place and had developed contingency plans, but the plans were not comprehensive. Specifically, one plan did not address the supercomputing operations, and none of the plans had been tested at the time of GAO‘s review. In addition, the laboratories addressed disaster recovery to a limited extent, but not specifically for the supercomputers. These shortcomings existed, at least in part, because NNSA‘s component organizations, including the Office of the Chief Information Officer, were unclear about their roles and responsibilities for providing oversight in the laboratories‘ implementation of contingency and disaster recovery planning. Until the agency fully implements a contingency and disaster recovery planning program for its weapons laboratories, it has limited assurance that vital information can be recovered and made available to meet national security priorities and requirements. Although the laboratories have the technological capability to share supercomputing capacity across all three weapons laboratories, barriers exist that could impede recovery operations. For example, the laboratories do not know the minimum supercomputing capacity needed to meet program requirements, such as simulating the effects of changes to weapons systems, should a disruption occur. In addition, the laboratories have not tested the technological capability to share the capacity on an on-demand basis for recovery operations. Without having an understanding of capacity needs and subsequent testing, the laboratories have little assurance that they could effectively share capacity if needed. Although NNSA obligated approximately $1.7 billion to help implement its classified supercomputing program from fiscal years 2007 through 2009, the agency has not tracked costs for contingency and disaster recovery planning and is uncertain of actual funds that were spent toward these efforts. What GAO Recommends: GAO recommends, among other things, that NNSA clearly define roles and responsibilities for its component organizations in providing oversight for contingency and disaster recovery planning for the classified supercomputing environment. NNSA agreed with most of GAO‘s recommendations, but did not concur with recommendations relating to capacity planning and cost tracking. View [hyperlink, http://www.gao.gov/products/GAO-11-67] or key components. For more information, contact Gregory C. Wilshusen at (202) 512-6244 or wilshuseng@gao.gov, Gene Aloise at (202) 512-3841 or aloisee@gao.gov, or Naba Barkakati at (202) 512-6415 or barkakatin@gao.gov. [End of section] Contents: Letter: Background: NNSA Has Not Fully Implemented Contingency and Disaster Recovery Planning and Testing for Its Classified Supercomputing Assets: The Laboratories Have the Ability to Share Supercomputing Capacity, but Barriers Exist: NNSA Does Not Track the Costs for Ensuring Contingency and Disaster Recovery Planning for Its Supercomputing Assets: Conclusions: Recommendations for Executive Action: Agency Comments and Our Evaluation: Appendix I: Objectives, Scope, and Methodology: Appendix II: NNSA Annual Obligations for Its Advanced Simulation and Computing Program, Fiscal Years 2007 through 2009: Appendix III: Comments from the National Nuclear Security Administration: Appendix IV: GAO Contacts and Staff Acknowledgments: Table: Table 1: Inventory of NNSA-Deployed Classified Supercomputing Systems (as of October 2010): Figures: Figure 1: Common Hardware Components of a Supercomputing System: Figure 2: NNSA's Classified Supercomputing Network Infrastructure: Figure 3: Total Usable Supercomputing Capacity at Each Weapons Laboratory, 2010 and 2011: Figure 4: Annual Obligations for NNSA's Advanced Simulation and Computing Program, Fiscal Years 2007 through 2009: Abbreviations: ASC: Advanced Simulation and Computing: BIA: business impact analysis: CNSS: Committee on National Security Systems: DISCOM: Distance Computing: DOD: Department of Defense: DOE: Department of Energy: FISMA: Federal Information Security Management Act of 2002: FLOPS: floating-point operations per second: Livermore: Lawrence Livermore National Laboratory: Los Alamos: Los Alamos National Laboratory: NIST: National Institute of Standards and Technology: NNSA: National Nuclear Security Administration: Sandia: Sandia National Laboratories: [End of section] United States Government Accountability Office: Washington, DC 20548: December 9, 2010: The Honorable Henry Waxman: Chairman: Committee on Energy and Commerce: House of Representatives: The Honorable Edward J. Markey: Chairman: Subcommittee on Energy and the Environment: Committee on Energy and Commerce: House of Representatives: The Honorable Bart Stupak: Chairman: Subcommittee on Oversight and Investigations: Committee on Energy and Commerce: House of Representatives: The National Nuclear Security Administration[Footnote 1] (NNSA) provides classified supercomputing capabilities for assessing the performance of nuclear weapons. In the absence of nuclear weapons testing--which ceased in 1992--the simulation capabilities of NNSA's supercomputers are a necessary means to determine the effects of changes to current weapons systems and to determine a level of confidence in the performance of future untested systems.[Footnote 2] These simulation capabilities also contribute to the enhancement of NNSA's ability to predict the performance of weapons systems to ensure the systems meet all military requirements established by the Department of Defense (DOD). NNSA's three nuclear weapons laboratories--Los Alamos National Laboratory (Los Alamos) in New Mexico, Lawrence Livermore National Laboratory (Livermore) in California, and the Sandia National Laboratories (Sandia) with locations in New Mexico and California--use these supercomputing simulation capabilities to obtain a comprehensive understanding of the entire nuclear weapons life cycle, from design to safe processes for dismantlement. These classified supercomputing capabilities are a considerable investment and serve as a cornerstone for NNSA's Stockpile Stewardship Program.[Footnote 3] In addition, classified supercomputing capabilities are essential for informing critical decisions related to the nuclear stockpile, including all stockpile modernization and warhead studies. NNSA classified supercomputing capabilities are also used to address other areas of national security, including intelligence analyses, nuclear forensics, and emergency response. Because of the importance of these classified supercomputing capabilities to issues central to national security, contingency and disaster recovery planning[Footnote 4] are key to ensuring that, when unexpected events occur, NNSA can recover and reconstitute its classified supercomputing systems, data, and operations. Our objectives were to assess the extent to which (1) NNSA has implemented contingency and disaster recovery planning and testing for its classified supercomputing assets, (2) the three laboratories are able to share classified supercomputing capacity for recovery operations, should service disruptions occur, and (3) NNSA tracks the costs for ensuring contingency and disaster recovery planning for its classified supercomputing assets. To accomplish these objectives, we examined contingency and disaster recovery planning controls for the systems within the classified supercomputing environment that are a necessary means for NNSA's achievement of its nuclear weapons mission. In addition, we performed technical assessments of classified supercomputing capabilities at each weapons laboratory, including each laboratory's ability to share supercomputing capacity. Further, we obtained information from NNSA and laboratory officials to determine how expenditures were tracked for contingency and disaster recovery planning of the classified supercomputing systems at each of the laboratories, as well as projected future cost estimates for ensuring the recovery of these assets. We conducted this performance audit from December 2009 through December 2010 in accordance with generally accepted government auditing standards. Those standards require that we plan and perform the audit to obtain sufficient, appropriate evidence to provide a reasonable basis for our findings and conclusions based on our audit objectives. We believe that the evidence obtained provides a reasonable basis for our findings and conclusions based on our audit objectives. A more detailed description of our objectives, scope, and methodology is contained in appendix I. Background: NNSA relies on its Stockpile Stewardship Program to ensure the safety, security, and effectiveness of the nuclear weapons stockpile. The Stockpile Stewardship Program is comprised of various elements, including, but not limited to: (1) the Advanced Simulation and Computing (ASC) Campaign, which provides the computational science and simulation tools to understand the behaviors and effects of nuclear weapons; (2) Directed Stockpile Work, which provides evidence of the health of the nuclear weapons stockpile and involves day-to-day maintenance of these weapons, including life extension efforts; (3) the Science Campaign, which provides tools and capabilities geared toward advancing the general understanding of all nuclear weapons systems; and (4) the Engineering Campaign, which provides a sustained basis for stockpile certification and assessments throughout the life cycle of each weapon. The coordination among the Stockpile Stewardship elements is instrumental to increasing NNSA's confidence in the performance of nuclear weapons. To help accomplish its Stockpile Stewardship mission, NNSA relies on the three weapons laboratories--Los Alamos, Livermore, and Sandia. Los Alamos and Livermore are the two design laboratories that are responsible for designing the nuclear weapons' explosive package and conducting research to better understand nuclear weapons phenomena. Sandia is an engineering laboratory and has principal responsibility for the research, design, and development of nonnuclear warhead components; integration of these components with Los Alamos and Livermore; and overall warhead systems integration with DOD. In accordance with NNSA, management and operations contractors, who are responsible for day-to-day operations of the laboratories, are required to adhere to agency policies.[Footnote 5] At the time of our review, NNSA's classified supercomputing resources consisted of 12 classified supercomputing systems. Figure 1 shows the hardware configuration of a supercomputing system. Figure 1: Common Hardware Components of a Supercomputing System: [Refer to PDF for image: illustration] Compute chip: Compute card: Node card: Cabinet: System: Source: GAO, data provided by Los Alamos, Livermore, and Sandia. [End of figure] NNSA classified supercomputing systems employ a large number of interdependent processors, which are the core unit of a computer that gathers instructions and data. These processors are mounted onto a compute chip, which is the portion of the system that carries out the instructions of a computer program. These compute chips are inserted onto a compute card, which also holds memory for the compute chips to use. A number of compute cards are attached to a node card, which have one or more processors with a common memory and are connected by high- speed interconnection networks. Each node card is inserted into a single cabinet, and that configuration is repeated many times to build a single supercomputing system. Each supercomputing system has a peak performance, which is the maximum rate of floating-point operations per second (FLOPS) that the system can sustain.[Footnote 6] Currently, almost all NNSA classified supercomputer systems operate at the teraFLOP level, which represents a trillion FLOPS. According to NNSA, the laboratories have three types of classified supercomputing systems: Capacity: Small systems that execute parallel problems with more modest computational requirements. These systems serve as the workhorse for the ASC program and are responsible for processing the day-to-day supercomputing workload. Capability: This type of supercomputer is used to solve the largest and most demanding problems that other computing systems cannot manage. Advanced architecture: Research and development systems that assist the ASC program in preparing to rapidly deploy and exploit the next generation of supercomputing technology. These systems have a targeted workload and serve as the foundation for the next generation of NNSA supercomputers. Table 1 shows the classified supercomputing systems currently in use at the three weapons laboratories. Table 1: Inventory of NNSA-Deployed Classified Supercomputing Systems (as of October 2010): Site: Los Alamos: System name: Roadrunner Base; System type: Capacity; Delivery date: 10/2006; Total processors: 18,432; Peak performance (TeraFLOPS): 76.0. System name: Roadrunner Phase-3; System type: Advanced architecture; Delivery date: 9/2008; Total processors: 24,480; Peak performance (TeraFLOPS): 1,280.0. System name: Hurricane; System type: Capacity; Delivery date: 9/2008; Total processors: 5,760; Peak performance (TeraFLOPS): 51.2. Site: Livermore: System name: BlueGene/L; System type: Advanced architecture; Delivery date: 11/2004; Total processors: 131,072; Peak performance (TeraFLOPS): 367.0. System name: Purple[A]; System type: Capability; Delivery date: 6/2005; Total processors: 12,288; Peak performance (TeraFLOPS): 93.4. System name: Rhea; System type: Capacity; Delivery date: 9/2006; Total processors: 4,608; Peak performance (TeraFLOPS): 22.1. System name: Minos; System type: Capacity; Delivery date: 6/2007; Total processors: 6,912; Peak performance (TeraFLOPS): 33.2. System name: Juno; System type: Capacity; Delivery date: 5/2008; Total processors: 18,432; Peak performance (TeraFLOPS): 162.2. System name: Dawn; System type: Advanced architecture; Delivery date: 1/2009; Total processors: 147,456; Peak performance (TeraFLOPS): 501.4. Site: Sandia-NM: System name: Red Storm; System type: Advanced architecture; Delivery date: 3/2005; Total processors: 31,680; Peak performance (TeraFLOPS): 284.0. System name: Unity; System type: Capacity; Delivery date: 3/2009; Total processors: 4,352; Peak performance (TeraFLOPS): 38.0. Site: Sandia-CA; System name: Whitney; System type: Capacity; Delivery date: 3/2009; Total processors: 4,352; Peak performance (TeraFLOPS): 38.0. Source: GAO summary of data from Los Alamos National Laboratory, Lawrence Livermore National Laboratory, and Sandia National Laboratories. [A] Although Purple was the capability system in use at the time of our site visits, Livermore retired the system in November 2010. [End of table] NNSA's classified supercomputing capabilities consist of supporting resources, including (1) parallel files systems, which store transitory data; (2) network file systems, which store user and project data for a calculation; (3) archival storage systems, which serve as storage for data; and (4) visualization systems, which enable users to better comprehend the results of their computations. NNSA's classified supercomputing systems are connected via its Enterprise Secure Network and the Distance Computing (DISCOM) network, which function as supporting resources for the classified supercomputing environment. The Enterprise Secure Network provides classified communications across the nuclear weapons complex, including security services and other activities that ensure the flow of NNSA's data sharing and business missions. DISCOM provides secure, high-speed remote access for intra-and inter-site file transfers and enables users, across the three weapons laboratories, to operate on remote computing resources as if they were local. DISCOM and the Enterprise Secure Network serve as the backup networks to each other. Figure 2 shows the composition of NNSA's classified supercomputing network infrastructure. Figure 2: NNSA's Classified Supercomputing Network Infrastructure: [Refer to PDF for image: illustration] The illustration depicts the following connections: Connected to DISCOM Network (10 Gigabits per second): and to NNSA Enterprise Secure Network (1 Gigabit per second): Lawrence Livermore National Laboratory: * BlueGene/L; * Purple; * Rhea; * Minos; * Juno; * Dawn. Los Alamos National Laboratory: * Road Runner; * Road Runner Phase 3; * Hurricane. Sandia National Laboratories, California: * Whitney. Sandia National Laboratories, New Mexico: * Red Storm; * Unity. DISCOM Network and NNSA Enterprise Secure Network are interconnected. Source: GAO, data provided by Los Alamos, Livermore, and Sandia. [End of figure] NNSA reported obligating approximately $1.7 billion from fiscal years 2007 through 2009 to support ASC program activities at the three weapons laboratories.[Footnote 7] The $1.7 billion was predominantly associated with three efforts: Weapons codes and models. This effort is intended to develop and improve weapons simulation models and codes for predicting the behavior of weapons systems and devices in the nuclear stockpile. Computational systems and software environment. This effort is intended to provide ASC users a stable, seamless computing environment for ASC-deployed platforms. It is responsible for procuring, delivering, and deploying ASC computational systems and user environments via technology development and integration across the three weapons laboratories. Facility operations and user support. This effort is intended to provide both the necessary physical facility and operational support for reliable supercomputing and storage environments, as well as a suite of user services for effective use of the three weapons laboratories' computing resources. Facility operations cover physical space, power and other utility infrastructure, and local-and wide-area networking, as well as system administration, cyber security, and operations services for ongoing support. The user support function includes planning, development, integration and deployment, continuing product support, and quality and reliability activity collaborations. To strengthen the security of information and information systems across the federal government, including those at NNSA's weapons laboratories, the Federal Information Security Management Act of 2002 (FISMA) requires each agency to develop, document, and implement an agencywide information security program that supports the operations and assets of the agency, including those provided or managed by another agency or contractor on its behalf.[Footnote 8] This security program is to include plans and procedures to ensure the continuity of operations for information systems that support the agency's operations.[Footnote 9] Pursuant to its FISMA responsibilities, the National Institute of Standards and Technology (NIST) has issued federal standards and guidelines on information security, such as a contingency planning guide for federal information systems, and recommended security controls, which address contingency and disaster recovery planning and testing.[Footnote 10] To further ensure the security of national security systems, the Committee on National Security Systems (CNSS)[Footnote 11] requires federal agencies with national security systems to implement a comprehensive set of security controls and enhancements for these systems.[Footnote 12] CNSS requires that each agency implement a contingency and disaster recovery planning capability that ensures the integrity and availability of its national security information and information systems.[Footnote 13] FISMA, NIST guidelines,[Footnote 14] and CNSS policies all call for contingency and disaster recovery planning--also referred to as continuity of operations for information systems--for critical components of information protection. DOE and NNSA policies also regard contingency and disaster recovery plans as being necessary for information protection. If normal operations are interrupted, contingency and disaster recovery plans allow senior agency officials to detect, mitigate, and recover operations. Examples of the key components that make up contingency and disaster recovery planning programs include (1) assessing the criticality and sensitivity of computerized operations and identification of supporting resources such as developing business impact analyses (BIA), (2) taking steps to prevent and minimize potential damage and interruption such as establishing data backup processes, (3) developing comprehensive contingency and disaster recovery plans,[Footnote 15] and (4) conducting periodic testing of contingency and disaster plans. The extent to which controls--such as contingency and disaster recovery planning--are implemented depends on a level of risk assigned to the system or information maintained on the system. NIST standards and guidelines, CNSS instructions, and NNSA policy allow consideration of risk in determining the level of protection of systems and data. These standards and policies require that organizations consider the impact or consequences of loss as it relates to the confidentiality, integrity, and availability of the information, and assign a value of low, moderate, or high impact levels. For contingency and disaster recovery planning, consideration of "availability" is the key element. NNSA policy defines the values for the consequences of loss associated with availability as follows: High: Loss of life might result from loss of availability; information must always be available on request, with no tolerance for delay; loss of availability will have an adverse effect on national-level interests; federal requirement (i.e., requirement for material control and accountability inventory); or loss of availability will have an adverse effect on confidentiality. Moderate: Information must be readily available with minimum tolerance for delay; bodily injury might result from loss of availability; or loss of availability will have an adverse effect on organizational- level interests. Low: Information must be available with flexible tolerance for delay. NNSA Has Not Fully Implemented Contingency and Disaster Recovery Planning and Testing for Its Classified Supercomputing Assets: Contingency and disaster recovery planning and testing for NNSA's classified supercomputing systems have not been fully implemented at each of the three weapons laboratories--Los Alamos, Sandia, and Livermore. Specifically, NNSA did not ensure that the laboratories (1) developed BIAs to determine the impact of potential service disruptions, (2) fully tested data backup processes, and (3) developed and tested contingency and disaster recovery plans. These shortcomings existed, at least in part, because NNSA's component organizations were unclear of their roles and responsibilities for providing oversight in the laboratories' implementation of contingency and disaster recovery planning. Until the agency fully implements a contingency and disaster recovery planning program for its classified supercomputing assets at the weapons laboratories, it has limited assurance that vital information can be recovered and made available to meet national security priorities and requirements. Not All of the Laboratories Assessed the Criticality and Sensitivity of Supercomputer Operations and Resources, or Potential Outage Impact: To assess the criticality and sensitivity of computerized operations and identification of supporting resources, NIST guidelines state that agencies should determine their recovery strategies by performing business impact analyses of their systems. A BIA is an analysis of information technology system requirements, processes, and interdependencies used to characterize system contingency requirements and priorities in the event of a significant disruption. NIST guidelines state that agencies conduct a BIA to identify critical information systems to fully characterize the system's requirements, processes, and interdependencies to determine contingency requirements and priorities. In addition, according to NIST guidelines, the BIA process should follow three main steps: (1) identify critical data and information technology resources, (2) identify outage impacts and allowable outage times,[Footnote 16] and (3) develop recovery priorities and strategies. NNSA policy also requires a BIA to identify systems that provide critical services to site operations and prioritize these systems and their components. One of the laboratories--Los Alamos--had conducted a BIA that addressed its classified supercomputing systems, generally following the three steps of a BIA. However, the BIA was not always specific. For example, the laboratory identified critical information technology resources for each of its classified supercomputing systems, but did not specifically identify the critical data. Instead, Los Alamos noted that the systems are not considered mission critical nor mission essential to the business needs of the laboratory,[Footnote 17] and that the consequence of loss for system availability is low. Additionally, it defined a specific number of days for the allowable time frames for fully and partially disabled systems, but did not provide specifics on allowable outage impacts. Further, the analyses indicated high-level recovery priorities, but did not provide specifics regarding the recovery process or strategies that would be used for recovery efforts. The other two laboratories did not conduct BIAs specifically for classified supercomputing systems, but plan to do so. Livermore has a BIA in place for its logical assets--the applications and services that provide basic operational support to the Livermore computing environment, but the BIA did not address any of the classified supercomputing systems. However, at the time of our site visit, Livermore officials stated they were beginning the process of developing a BIA that would address their information technology needs for their classified supercomputing systems, but the process was still in the planning stage. Similarly, according to Sandia officials, the laboratory has BIAs that address its unclassified information technology systems, but does not currently have one specifically for its classified supercomputing systems. However, Sandia officials indicated that they plan to conduct a BIA for classified supercomputing systems in 2011. Although the two laboratories have not conducted any BIAs--in line with the BIA conducted by Los Alamos--they have considered the risk of consequence of loss from availability as low impact. NNSA also considers the consequence of loss as low impact. In addition, NNSA and the three laboratories do not consider the classified supercomputers to be "mission critical." One laboratory categorized the systems as "mission essential," while another referred to them as "mission support elements, not mission essential elements." However, NNSA's mission includes maintaining the safety, security, and effectiveness of the nuclear deterrent without nuclear testing. The supercomputers provide a necessary means to determine the effects of changes to current weapons systems and to determine a level of confidence in the performance of future untested systems. The classified supercomputing capabilities serve as the computational surrogate to nuclear weapons testing and are central to national security. Regarding recovery priorities and strategies, each of the laboratories indicated that it would likely rely on a process that is currently being used for the capability system shared among the laboratories. The laboratories generally rely on the Capability Computing Campaign to prioritize the workload and develop priorities for jobs that need to be run on the capability system.[Footnote 18] In the event of a service disruption or emergency, laboratory officials told us that they would likely rely on the same process for all of their systems. However, this process has not been documented as a means for establishing overall recovery priorities across the laboratories. Until all of the laboratories have a BIA in place for their classified supercomputing systems that (1) identifies and categorizes critical data, (2) identifies acceptable allowable outage impacts and time frames, and (3) establishes emergency processing priorities and strategies, the potential impact of a system outage will remain unclear. The Laboratories Have Backup Processes in Place, but One Storage Site May Be Susceptible to Damage: Data backup processes offer a means of taking steps to prevent and minimize potential damage and interruption to computerized services. NIST guidelines, as well as CNSS instructions and NNSA policies, call for agencies to conduct backups of user-level information, system- level information, and information system documentation. In addition, NIST, CNSS, and NNSA all provide that agencies establish an alternate storage site that is separated from the primary storage site so that both are not susceptible to the same hazards. To ensure the availability of data stored in the alternate storage site, NIST and CNSS require that agencies test the backup information to verify the integrity of the data. All of the laboratories had backup processes in place. Each of the laboratories follows similar data backup processing--both manual and automated procedures--to back up user-level information, system-level information, and information system documentation. For example, this information can include global directories, user home directories, project directories, desktop systems, and critical systems documentation. Backups occur in increments: daily incremental backups to disk, weekly full backups to tape, and monthly full-system backups to tape (with a 6-month on-site storage retention policy). The laboratories also have vendor-provided software that takes periodic snapshots of user directories for storage retention purposes. The snapshot process can be performed manually or can be set up for automatic processing. Users are encouraged to maintain their data in a shared environment on the network and are allowed to make their own determinations regarding what data should be backed up from the classified supercomputing systems. Not all of the laboratories have an alternate storage site sufficiently separated from the primary site to not be susceptible to the same hazards. Two of the three laboratories have alternate storage sites a considerable distance from their primary storage site. Livermore sends its system backups electronically to Los Alamos every 6 months. Sandia sends its backup data to its alternate site locations (e.g, the California site sends its data to the New Mexico site and the New Mexico site sends its data to the California site). However, Los Alamos maintains its alternate storage facility on-site in a building located less than 1 mile away from the primary local backup storage facility. Consequently, both sites could be susceptible to the same hazards, such as a wildfire. The laboratories had processes in place to verify the integrity of the backed up data. However, tests of their backup procedures rely predominantly on ad hoc recovery, rather than periodically planned tests. Los Alamos officials indicated that thousands of file recoveries have been performed over the years by end users as part of their testing. Livermore officials stated that the laboratory tests its local backup procedures through actual system usage on almost a daily basis, and tests their remote backup procedures at least once annually. Further, Sandia officials told us they had successfully tested a sample of data at their offsite facility. Not All Laboratories Had Developed and Tested Contingency and Disaster Recovery Plans: NIST guidelines and CNSS policies call for the development and testing of contingency plans and the development of disaster recovery plans for each information system to ensure that, in the event of a service disruption, the work and supporting functions of the agency can continue to be performed. According to NIST guidelines, at a minimum, the contingency plan should address the identification and notification of key personnel, plan activation, system recovery, and system reconstitution to meet the needs of the agency's critical supporting operations. The guidelines also state that the plan should be tested periodically; CNSS specifies that the frequency of testing should be annually. NIST also notes that the disaster recovery plan should be designed to restore operability of the targeted system, application, or computer facility at an alternate site after a major service disruption. DOE and NNSA policies also require the development of contingency and disaster recovery plans and the testing of these plans in line with NIST and CNSS. Each of the laboratories had developed contingency plans for their classified supercomputing systems; however, the plans were not always comprehensive, and at the time of our site visits, these plans had not been tested. The laboratories addressed disaster recovery planning to a limited extent; none specifically addressed the supercomputing environment. For example, * Two laboratories--Los Alamos and Sandia--had contingency plans that addressed the classified supercomputing systems. Although Livermore had an information technology contingency plan and a master security plan, neither specifically addressed the supercomputers. In addition, the plans for both Los Alamos and Sandia included key components such as the identification and notification of key personnel, plan activation, system recovery, and system reconstitution procedures; however, the sufficiency of the level of detail varied. For instance, one plan provided specific details regarding system recovery processes and the notification and identification of key personnel, but provided limited details regarding plan activation and system reconstitution procedures. * At the time of our site visits, none of the laboratories had tested their contingency plans, which were less than a year old. One of the laboratories--Los Alamos--had created testing guides but had not yet conducted formal testing. Subsequent to our site visit, Los Alamos officials indicated that the first test of their plan took place in September 2010 and noted that the results would be finalized in December. Additionally, although Sandia had a contingency plan in place, the plan states that testing is not required because, in the event of a service disruption, the laboratory would either wait until the equipment was fully operational or simply acquire new equipment. This is contrary to NIST guidelines, CNSS instructions, and DOE and NNSA policies. * Each of the laboratories had addressed disaster recovery planning to a limited extent. For example, - Los Alamos included disaster recovery planning as a section within their classified supercomputing system's contingency plans. Although it provided high-level instructions such as directing individuals to call 911 for all emergencies, it did not include information regarding the specifics for restoring operability of the classified supercomputing system at an alternate site after a major service disruption. - None of the plans submitted by Livermore specifically addressed the supercomputing environment, although in the disaster recovery section of the master security plan, the laboratory noted that it had no mission-essential systems in the computing environment, and systems may be offline for an extended period for system upgrades. - Sandia also had disaster recovery plans in place, designed for emergency preparedness and disease response planning needs for the laboratory. These plans focused on emergencies involving the facilities, operations, and activities for the laboratory, and provided individuals with emergency information should pandemics plague the laboratory. However, these plans did not include any information regarding the classified supercomputing systems. Unless each of the laboratories develops and sufficiently tests comprehensive contingency and disaster recovery plans in accordance with applicable policies and guidance for their classified supercomputing systems, they face a risk of not being able to successfully recover their supercomputing assets and operations after a service disruption. NNSA Component Organizations Were Unclear of Their Roles and Responsibilities for Providing Oversight: The aforementioned shortcomings existed, at least in part, because NNSA's component organizations were unclear of their roles and responsibilities for providing oversight in the laboratories' implementation of contingency and disaster recovery planning. FISMA requires that the chief information officer, in coordination with other senior agency officials, manage the development and implementation of an agencywide information security program that includes plans and procedures to ensure continuity of operations for information systems that support the operations and assets of the agency. NIST guidelines and DOE policies call for individuals with information system or security management and oversight responsibilities to take responsibility for the development, implementation, assessment, monitoring, reviewing, and updating of security planning policies and procedures, which includes contingency and disaster recovery plans. Further, the NNSA Safety Management Function and Responsibilities and Authorities Manual states that the chief information officer is responsible for information technology programs and initiatives and for ensuring the security of the agency's information and systems. Although roles and responsibilities are defined at a high level in FISMA, NIST guidelines, as well as DOE and NNSA policies, NNSA component organizations were confused about their roles in providing oversight of the laboratories' implementation of contingency and disaster recovery planning for the supercomputing systems. For example, at the beginning of our review, ASC officials told us that, although they were responsible for administering and managing the program that uses the classified supercomputing systems, they were not responsible for contingency and disaster recovery planning. Instead, they directed us to the Office of the Chief Information Officer (OCIO), where officials told us that they were not responsible for contingency and disaster recovery planning for these systems, and noted that they would only provide guidance if requested by ASC. Further, OCIO officials told us that ASC has not requested any assistance. ASC officials subsequently acknowledged that they had responsibility for contingency and disaster recovery planning; however, this organizational responsibility is contrary to NIST guidelines and DOE policies, as well as NNSA's own manual, which gives this responsibility to the OCIO. In the absence of effective oversight, the laboratories did not consistently comply with, or fully implement, federal requirements and guidance related to contingency planning and disaster recovery. Until NNSA clearly establishes and carries out defined roles and responsibilities for OCIO and ASC pertaining to contingency and disaster recovery planning for the classified supercomputing environment, it will not be able to effectively manage and oversee the recovery of its supercomputing operations should service disruptions occur. The Laboratories Have the Ability to Share Supercomputing Capacity, but Barriers Exist: Technologically, the weapons laboratories have demonstrated the ability to share classified supercomputing capacity using their capacity and capability systems under normal operating conditions. Although these supercomputers process unique workloads and operate independently, they are designed with a similar operating system, resource manager, and job scheduler, which is built on a LINUX foundation. These supercomputers also include application codes that are portable across supercomputing systems and a data network, which allows authorized users local and remote access to the systems. Although the weapons laboratories have the ability to share supercomputing capacity, barriers exist. One barrier to sharing supercomputing capacity is that the weapons laboratories do not know the minimum supercomputing capacity needed to achieve processing priorities in the event of a service disruption. NIST guidelines recommend, and NNSA policy requires, that capacity planning be conducted so that there is adequate capacity for information processing and supporting resources during contingency operations. Although the weapons laboratories have identified the supercomputing processing needed for normal business operations, they have not identified the minimum supercomputing capacity needed to achieve processing priorities in the event of a service disruption. Another barrier to sharing supercomputing processing is the disparity in usable supercomputing processing across the laboratories. Figure 3 depicts this disparity by identifying the amount of total usable supercomputing capacity, in teraFLOPS, for each of the three weapons laboratories for 2010 and 2011.[Footnote 19] Figure 3: Total Usable Supercomputing Capacity at Each Weapons Laboratory, 2010 and 2011: [Refer to PDF for image: vertical bar graph] TeraFLOPS: Los Alamos; Total usable capacity: 2010: 127; Total usable capacity: 2011: 1,427. TeraFLOPS: Livermore; Total usable capacity: 2010: 311; Total usable capacity: 2011: 218. TeraFLOPS: Sandia; Total usable capacity: 2010: 76; Total usable capacity: 2011: 76. Source: GAO analysis of supercomputing capacity data provided by Los Alamos, Livermore, and Sandia. [End of figure] For example, in 2010, total usable capacity at Livermore has been 311 teraFLOPS, whereas Los Alamos and Sandia have had 127 teraFLOPS and 76 teraFLOPS, respectively. Should Livermore experience service disruptions for a sustained amount of time, neither Los Alamos nor Sandia possesses the necessary usable supercomputing capacity to accommodate the additional workload and NNSA will have to reprioritize the computational workloads across the other two laboratories. As previously noted, officials at the laboratories told us that, should disruptions occur, they would use the Capability Computing Campaign model for re-prioritizing the workload. However, this process has not been documented for recovery activities. Further limiting the ability of the weapons laboratories to recover from a service disruption, in 2011, there will be a significant disparity in projected usable supercomputing capacity. For example, in 2011, Los Alamos' usable capacity is projected to be 1,427 teraFLOPS, whereas usable capacity at Livermore and Sandia is to be 218 teraFLOPS and 76 teraFLOPS, respectively. Should Los Alamos' supercomputing systems become unavailable for an extended period of time, neither Livermore nor Sandia possesses sufficient usable supercomputing capacity to achieve its workload and accommodate the additional potential computational workload from Los Alamos. According to laboratory officials, an additional supercomputer will be deployed at Los Alamos in 2011. This supercomputer is a replacement for the single capability supercomputer currently at Livermore that was retired in 2010. Therefore, a significant amount of usable supercomputing capacity will be centralized at Los Alamos. Because the weapons laboratories have not determined the minimum supercomputing capacity requirements for their emergency processing priorities, they may not be able to meet the minimum computational workload required to meet Stockpile Stewardship milestones. Another barrier to sharing supercomputing capacity across the weapons laboratories is that the capability to share usable capacity on an "on- demand"[Footnote 20] basis has not been fully tested in a recovery scenario. According to officials at the laboratories, during normal operating conditions, simulation programs have run on other supercomputing systems. However, consideration has not been given to include and test these abilities in a disaster recovery scenario should a service disruption occur. As a result, NNSA has limited assurance that its disaster recovery approach would work effectively should a service disruption occur. NNSA Does Not Track the Costs for Ensuring Contingency and Disaster Recovery Planning for Its Supercomputing Assets: Although NNSA reported obligating approximately $1.7 billion from fiscal 2007 through 2009 to implement its ASC program activities at the three weapons laboratories, the costs for ensuring the recovery of its classified supercomputing operations are unknown. Under GAO's Standards for Internal Control in the Federal Government, financial information should be recorded and communicated to program managers who need this information to make operational decisions and to effectively allocate resources for program activities. NNSA officials reported obligating approximately $390 million for facility operations and user support activities, which include the funds associated with contingency and disaster recovery planning activities, but they were unable to provide detailed financial information for contingency and disaster recovery planning activities. According to NNSA officials, costs for contingency and disaster recovery planning for classified supercomputing systems are unknown because ASC program expenditures are part of the NNSA ASC operational budget, whose costs are tracked at an aggregate level. As a result, neither NNSA nor the three weapons laboratories can track what has been spent since fiscal year 2007 for ensuring the recovery of classified supercomputing operations and, consequently, they do not know whether funding levels for these activities have been adequate. Although certain components of contingency and disaster recovery planning are in place at the three weapons laboratories, NNSA is uncertain as to what funds were spent on these information protection activities. Furthermore, NNSA has not developed contingency planning and disaster recovery cost estimates for its classified supercomputing assets. For fiscal years 2011 through 2014, NNSA projects that it needs about $2.2 billion to implement its ASC activities, which support its Stockpile Stewardship program--$984 million for weapons codes and models, $604 million for computational systems and software environment, and $588 million for facility operations and user support. Although NNSA has developed its out-year funding needs over the next 4 years, it has not developed estimates regarding the future costs for ensuring the recovery of its classified supercomputing assets in the event of service disruptions. Until NNSA develops a means for tracking current contingency and disaster recovery costs and for developing estimates of future costs, the agency will not have the information needed to determine whether it is meeting its goals for effective stewardship of public resources. Conclusions: All three NNSA weapons laboratories have implemented some components of a contingency planning and disaster recovery program. NNSA, however, has not provided effective oversight to ensure that the laboratories have comprehensive and effective contingency and disaster recovery planning and testing. For example, BIAs that identify critical resources and outage impacts have not been developed for all classified supercomputing systems and existing contingency plans at the laboratories have not been thoroughly tested. Although one laboratory's analysis is not comprehensive and the other two laboratories have not completed a BIA, NNSA and the laboratories consider the consequence of loss of availability of the classified supercomputers as a low-risk impact, and do not consider them to be mission critical. However, it is unclear how NNSA made this determination given that (1) the analyses have not been completed; (2) NNSA's mission includes maintaining the safety, security, and effectiveness of the nuclear deterrent without nuclear testing; (3) the classified supercomputing capabilities serve as the computational surrogate to underground nuclear weapons testing and are central to our national security; and (4) NNSA has obligated about $1.7 billion over 3 fiscal years to support the Advanced Simulation and Computing program, which includes classified supercomputing activities. Beyond the activities undertaken by the laboratories, NNSA has not developed a means for identifying, tracking, or re-prioritizing the classified supercomputing workload across the operating environment. In addition, the laboratories have not tested offsite recovery capabilities and the agency has not tested the laboratories' ability to share "on-demand" capacity if needed or determined the minimum capacity needed to meet Stockpile Stewardship Program requirements, particularly in the event that it may need to establish emergency processing priorities. Further, although over a billion dollars have been obligated to support the classified supercomputing capabilities within the last 3 years, NNSA has not tracked the costs for ensuring the recovery of the classified supercomputing systems, data, and supporting resources should a service disruption occur. The classified supercomputing program represents a significant investment, and accountability for these systems is essential. Until NNSA clearly defines its component organizations' roles and responsibilities and fully implements an effective contingency and disaster planning program, it has limited assurance that, in the event of a service disruption, vital information could be recovered and made available to meet national security priorities. Recommendations for Executive Action: To improve the effectiveness of contingency and disaster recovery planning for NNSA's classified supercomputing capabilities, we recommend that the Administrator of NNSA direct the weapons laboratories to take the following four actions, where not already implemented: * Develop business impact analyses that, among other things, (1) identify and prioritize critical systems, data, and supporting resources; (2) identify allowable outage times and impacts for classified supercomputing capabilities; and (3) identify recovery priorities and strategies. * Develop and implement comprehensive contingency and disaster recovery plans for all classified supercomputing systems that identify how each weapons laboratory's classified supercomputing capabilities will be recovered following service disruptions. * Conduct contingency and disaster recovery plan testing. * Test the three weapons laboratories' ability to share "on-demand" classified supercomputing capacity to ensure this capability will work in the event of unexpected service disruptions. In addition, we recommend that the Administrator of NNSA take the following five actions: * Document an agencywide means for reprioritizing the workload across NNSA's classified supercomputing systems should a disruption occur. * Clearly define the oversight responsibilities of the NNSA ASC program office and the NNSA Office of the Chief Information Officer, as they relate to contingency and disaster recovery planning for NNSA's classified supercomputing operations. * Identify, assess, and communicate the minimum classified supercomputing capacity needed to meet Stockpile Stewardship requirements in the event of a service disruption. * Develop, document, and implement a process that identifies and tracks expenditures for contingency and disaster recovery planning for NNSA's classified supercomputing assets. * Develop and document the total anticipated costs for contingency and disaster recovery planning of NNSA's classified supercomputing assets, which includes the replacement costs for these assets. Agency Comments and Our Evaluation: In providing written comments (reprinted in appendix III) on a draft of this report, NNSA's Associate Administrator for Management and Administration agreed that improvements can be made in contingency and disaster recovery planning for supercomputing operations. He indicated that NNSA agreed with six of our nine recommendations and outlined the agency's intent to conduct business impact analyses, develop and test appropriate contingency and disaster recovery plans, document workload prioritization, and clearly define roles and responsibilities. However, NNSA did not agree with our recommendation related to identifying the minimum capacity needed to meet Stockpile Stewardship requirements in the event of a service disruption. The Associate Administrator stated that this recommendation did not take into account that the different types of supercomputers--capacity and capability--serve different functions and are procured and managed differently. In our report, we recognize that different types of supercomputers exist and that they are used for different purposes, they process unique workloads and operate independently. However, as we point out in the report, although the weapons laboratories have identified supercomputing processing needed for normal business operations, they have not identified the minimum capacity needed to achieve processing priorities in the event of a service disruption. We believe that the recommendation appropriately focuses on meeting NNSA's Stockpile Stewardship mission and that capacity planning is essential to ensure that information processing and supporting resources exist during contingency operations, regardless of the type of system used. Although NNSA did not agree with the recommendation, the Associate Administrator stated that the agency will conduct a BIA and build appropriate contingency strategies for both types of supercomputers, as well as enhance capacity sizing actions to account for contingency and disaster recovery operations. Further, NNSA did not agree with two recommendations related to identifying and tracking expenditures for contingency and disaster recovery planning and documenting anticipated recovery planning costs, including replacement costs of the assets. The Associate Administrator asserted that this information would not add significant value to managing contingency and disaster recovery planning. However, we believe such actions reflect good government practices and would add value by providing NNSA program managers with useful expenditure and cost information to aid decision making with regards to contingency and disaster recovery planning. As our report points out, GAO's Standards for Internal Control in the Federal Government states that financial information should be recorded and communicated to program managers to help them make operational decisions and effectively allocate resources for program activities. Strong financial and internal controls are a major part of managing any organization because they help government program managers achieve desired results through effective stewardship of public resources. Accordingly, we believe our recommendations have merit. We are sending copies of this report to the Secretary of Energy; the Administrator of NNSA; and the Directors of Los Alamos, Livermore, and Sandia laboratories. Copies of the report will also be available to others at no charge on the GAO Web site at [hyperlink, http://www.gao.gov]. If you or your staffs have any questions about this report, please contact Gene Aloise at (202) 512-6870, or aloisee@gao.gov; Nabajyoti Barkakati at (202) 512-6415 or barkakatin@gao.gov; or Gregory C. Wilshusen at (202) 512-6244 or wilshuseng@gao.gov. Contact points for our Offices of Congressional Relations and Public Affairs may be found on the last page of this report. GAO staff who made major contributions to this report are included in appendix IV. Signed by: Gene Aloise: Director, Natural Resources and Environment: Signed by: Nabajyoti Barkakoti: Director, Center for Technology and Engineering: Signed by: Gregory C. Wilshusen: Director, Information Security Issues: [End of section] Appendix I: Objectives, Scope, and Methodology: The objectives of our review were to assess the extent to which (1) the National Nuclear Security Administration (NNSA) has implemented contingency and disaster recovery planning and testing for its classified supercomputing assets, (2) the three laboratories are able to share classified supercomputing capacity for recovery operations, should service disruptions occur, and (3) NNSA tracks the costs for ensuring contingency and disaster recovery planning for its classified supercomputing assets. To address these objectives, we focused on contingency and disaster recovery planning activities at NNSA headquarters, as well as the operating environment for the 12 classified supercomputing systems at the three weapons laboratories-- Los Alamos National Laboratory, Livermore National Laboratory, and Sandia National Laboratories. To assess the extent to which NNSA has implemented contingency and disaster recovery planning and testing for its classified supercomputing assets, we examined contingency and disaster recovery planning controls for the systems within the classified supercomputing environment that are critical to NNSA's achievement of its nuclear weapons mission. We collected and reviewed policies, procedures, and guidelines from the National Institute of Standards and Technology, the Committee on National Security Systems, the Department of Energy, and NNSA. We also reviewed contingency plans and business impact analyses provided by the weapons laboratories and compared them to federal guidelines. We interviewed NNSA and laboratory officials to determine whether they had documented critical system, data, and supporting resources and whether contingency plans had been tested. Further, we interviewed NNSA officials to determine to what extent they have provided specific guidance and oversight for the laboratories to ensure that contingency and disaster recovery planning requirements are being met. To determine the extent to which the three weapons laboratories have the ability to share supercomputing capacity for backup and recovery operations, we visited each weapons laboratory and gained an understanding of the overall classified supercomputing infrastructure and identified interconnectivity and control points. We performed technical assessments of supercomputing capabilities at each weapons laboratory, including each laboratory's ability to share supercomputing capacity under normal operating conditions. We reviewed the weapons laboratories' efforts to determine the minimal supercomputing capacity needed to meet NNSA Stockpile Stewardship Program requirements along with the ability of the weapons laboratories to share supercomputing capacity on an "on-demand" basis, including the use of advanced architecture systems. In addition, we obtained documents describing the supercomputing system environment as well as capacity information, along with the views of officials from NNSA and the three weapons laboratories. To assess the extent to which NNSA tracks costs for ensuring contingency and disaster recovery planning for classified supercomputing assets, we interviewed NNSA and weapons laboratory officials to determine how expenditures were tracked for contingency and disaster recovery planning of the classified supercomputing systems at each of the laboratories. We also requested the amount of funds NNSA obligated to the three weapons laboratories, and the amount of funds the laboratories spent, in implementing NNSA's classified supercomputing capabilities from fiscal years 2007 through 2009. Further, we interviewed NNSA and laboratory officials to determine how they projected future cost estimates for ensuring the recovery of these assets for fiscal years 2011 through 2014. To assess the reliability of data provided, we reviewed (1) the fiscal year 2009 financial statement audit for the system and (2) responses NNSA provided to questions about processes and procedures for ensuring the accuracy and completeness of data. Based on this information, we determined the data are sufficiently reliable for the purposes of this report. We conducted this performance audit from December 2009 through December 2010 in accordance with generally accepted government auditing standards. Those standards require that we plan and perform the audit to obtain sufficient, appropriate evidence to provide a reasonable basis for our findings and conclusions based on our audit objectives. We believe that the evidence obtained provides a reasonable basis for our findings and conclusions based on our audit objectives. [End of section] Appendix II: NNSA Annual Obligations for Its Advanced Simulation and Computing Program, Fiscal Years 2007 through 2009: NNSA reported obligating approximately $1.7 billion from fiscal years 2007 through 2009 to support Advanced Simulation and Computing (ASC) program activities at the three weapons laboratories. The $1.7 billion was used mainly for three efforts: Weapons codes and models. This effort is intended to develop and improve weapons simulation codes and models for predicting the behavior of weapons systems and devices in the nuclear stockpile. Computational systems and software environment. This effort is intended to provide ASC users a stable, seamless computing environment for ASC-deployed platforms. It is responsible for procuring, delivering, and deploying ASC computational systems and user environments via technology development and integration across the three weapons laboratories. Facility operations and user support. This effort is intended to provide both the necessary physical facility and operational support for reliable supercomputing and storage environments, as well as a suite of user services for effective use of the three weapons laboratories' computing resources. Facility operations cover physical space, power and other utility infrastructure, and local-and wide-area networking, as well as system administration, cyber security, and operations services for ongoing support. The user support function includes planning, development, integration and deployment, continuing product support, and quality and reliability activity collaborations. Figure 4 depicts NNSA's annual obligations for each of the three efforts from fiscal years 2007 through 2009. Figure 4: Annual Obligations for NNSA's Advanced Simulation and Computing Program, Fiscal Years 2007 through 2009: [Refer to PDF for image: stacked vertical bar graph] Fiscal year: 2007; Weapons codes and models: $280 million; Computational systems and software environment: $183 million; Facility operations and user support: $128 million; Total: $591 million. Fiscal year: 2008; Weapons codes and models: $249 million; Computational systems and software environment: $183 million; Facility operations and user support: $114 million; Total: $546 million. Fiscal year: 2009; Weapons codes and models: $221 million; Computational systems and software environment: $161 million; Facility operations and user support: $147 million; Total: $529 million. Source: GAO analysis of data provided by NNSA. [End of figure] As shown in figure 4, NNSA annual obligations for its classified supercomputing operations decreased from about $591 million to $529 million between fiscal years 2007 and 2009. The largest obligation for the classified supercomputing program was for weapons codes and models, which accounted for approximately $750 million (or 45 percent) of total obligations. Obligations for computational systems and software environment accounted for approximately $527 million (or 32 percent) of total obligations. For the period, obligations for this effort decreased from $183 million to $161 million. The facility operations and user support activities, which includes, among other things, expenditures for contingency and disaster recovery planning, accounted for $390 million (or 23 percent) of total obligations over the period. These obligations ranged from $114 million to $147 million for the 3 fiscal years. [End of section] Appendix III: Comments from the National Nuclear Security Administration: Department of Energy: National Nuclear Security Administration: Washington, DC 20585: November 23, 2010: Mr. Gregory C. Wilshusen: Director: Information Security Issues: U.S. Government Accountability Office: 441 G Street, NW: Washington, DC 20548: Dear Mr. Wilshusen: The National Nuclear Security Administration (NNSA) appreciates the opportunity to review the Government Accountability Office's (GAO) draft report, GAO-11-67, Information Security: National Nuclear Security Administration Needs to Improve Contingency Planning for Its Classified Supercomputing Operations. I understand the House Committee on Energy and Commerce requested GAO to assess various aspects of NNSA's Continuity of Operations Program to ensure that, in case of service disruptions, the three weapons laboratories can maintain the computer simulation capabilities needed to meet nuclear weapons assessment and certification requirements. Specifically, GAO identified, (1) whether NNSA has Continuity of Operations planning and testing procedures in place across the classified supercomputing environment of the three weapons laboratories; (2) whether the weapons laboratories are able to share capacity for backup and recovery operations; and (3) the past, present and future resources needed to maintain supercomputing capabilities. We are pleased that GAO recognizes the importance of the simulation capabilities of NNSA's supercomputers to address stockpile stewardship and other national security matters. While the draft report implies, without explicitly stating, that the timeframe for reconstitution of supercomputing assets should be similar to that required for Continuity of Operations for national command and control, major financial systems, and health and emergency services, that time urgency is not consistent with existing policies for the recovery of research and development capabilities, nor should it be. Nevertheless, we agree that improvements can be made in contingency and disaster recovery planning for supercomputing operations. After careful review, additional time is required to establish further plans and schedule solutions for addressing the GAO's recommendations. NNSA will provide a more detailed response to the recommendations when the final report is issued. However, we are providing a summary of responses to the recommendations presented in the draft report. Recommendation 1: Develop business impact analyses that, among other things, (I) identify and prioritize critical systems, data, and supporting resources, (2) identiI5, allowable outage times and impacts for classified supercomputing capabilities, and (3) identify recovery priorities and strategies. Concur: NNSA will leverage current Business Impact Analysis (BIA) activities underway at Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratories and perform a national level BIA to provide a consistent assessment across the laboratories for classified supercomputing. Recommendation 2: Develop and implement comprehensive contingency and disaster recovery plans for all classified supercomputing systems that identify how each weapons laboratory's classified supercomputing capabilities will be recovered following service disruptions. Concur: NNSA will develop appropriate plans based on the assessment results of the BIA performed per the GAO's first recommendation. Recommendation 3: Conduct contingency plan testing. Concur: NNSA will conduct contingency plan testing according to contingency and disaster recovery plans to be implemented based on the assessment results of the BIA performed per the GAO's first recommendation. Recommendation 4: Classified supercomputing capacity to ensure this capability will work in the event of unexpected service disruptions. Concur: NNSA will test the three weapons laboratories' ability to share classified capacity supercomputers according to contingency and disaster recovery plans to be implemented based on the assessment results of the BIA. Recommendation 5: Document an agency-wide means for reprioritizing the workload across NNSA's classified supercomputing systems should a disruption occur. Concur: The NNSA will adapt, apply and exercise procedures that are routinely being used for prioritizing workload in capability computing campaigns for use in contingencies and disasters. Recommendation 6: Clearly define the oversight responsibilities of the NNSA ASC program office and the NNSA Office of the Chief Information Officer, as they relate to contingency and disaster recovery planning for NNSA's classified supercomputing operations. Concur: In general, the NNSA Office of the Chief Information Officer provides policy and guidance and the Office of Advanced Simulation and Computing (ASC) has responsibilities for execution. Oversight responsibilities will be clearly defined through the BIA and development and implementation of the contingency and disaster recovery plans. Recommendation 7: Identify, assess, and communicate the minimum classified supercomputing capacity needed to meet Stockpile Stewardship requirements in the event of a service disruption. Nonconcur: This recommendation does not take into account that capacity and capability systems serve different functions under different cost of ownership models, and consequently are procured and managed differently. The ASC program has deployed its supercomputing assets to mitigate single site failures. For the future, the program will enhance capacity sizing actions to account for contingency and disaster recovery operations when planning host sites for capacity computing capabilities. We will conduct a BIA assessment recognizing the differences and build appropriate contingency strategies for both classes of computing. Recommendation 8: Develop, document, and implement a process that identifies and tracks expenditures for contingency and disaster recovery planning for NNSA 's classified supercomputing assets. Nonconcur: Almost all classified supercomputing contingency and disaster recovery planning leverages computing resources and activities funded as part of a production simulation environment for weapons designers and engineers. These expenses are integral to ASC's Facility Operations and User Support Program element and tracking them separately would not add significant value to managing contingency and disaster recovery. Recommendation 9: Develop and document the total anticipated costs for contingency and disaster recovery planning of NNSA 's classified supercomputing assets, which includes the replacement costs for these assets. Nonconcur: As stated in the response to Recommendation 8, these expenses are integral to ASC's Facility Operations and User Support program element and tracking them separately would not add significant value to managing contingency and disaster recovery. If you have any questions related to this response, please contact JoAnne Parker, Director, Office of Internal Controls, at 202-586-1913. Sincerely, Signed by: Gerald L. Talbot, Jr. Associate Administrator for Management and Administration: cc: Acting Chief Information Officer: Deputy Administrator for Defense Programs: [End of section] Appendix IV: GAO Contacts and Staff Acknowledgments: GAO Contacts: Gene Aloise (202) 512-3841 or aloisee@gao.gov Nabajyoti Barkakati (202) 512-6415 or barkakatin@gao.gov Gregory C. Wilshusen (202) 512- 6244 or wilshuseng@gao.gov: Staff Acknowledgments: In addition to the individuals named above, Glen Levis, Edward M. Glagola, Jr., and Jeffrey Knott (Assistant Directors); and Preston S. Heard, Jennifer R. Franks, Kevin Metcalfe, and Zsaroq Powe were key contributors to this report. Neil Doherty, Nancy Glover, Franklin Jackson, and Jonathan Kucskar also made key contributions to this report. [End of section] Footnotes: [1] NNSA was established in 2000 as a separately organized agency within the Department of Energy (DOE) and is responsible for the nation's nuclear weapons, nonproliferation, and naval reactors programs. [2] For nearly half a century, the United States' nuclear program was spearheaded by underground nuclear testing and never had to rely on weapon systems that had exceeded their design life times. The United States last produced a nuclear weapon in 1991 and performed its last underground nuclear test in 1992. [3] The National Defense Authorization Act for Fiscal Year 1994, Pub. L. No. 103-160, § 3138 (1993), directed DOE to establish the Stockpile Stewardship Program. In the absence of underground nuclear testing, the program encompasses a broad range of activities to increase understanding of the basic phenomena associated with nuclear weapons, provide better predictive understanding of the safety and reliability of weapons, and ensure a strong scientific and technical basis for future nuclear weapons policy objectives. The Stockpile Stewardship Program is carried out through the nuclear weapons complex, which includes three nuclear weapons laboratories. [4] Continuity of operations focuses on restoring an organization's mission-essential functions at an alternate site and performing those functions for a short period of time before returning to normal operations. Contingency and disaster recovery planning include a broad scope of activities designed to sustain and recover critical information and information system services for a range of potential service disruptions. Contingency and disaster recovery planning components may include the relocation of information systems and operations to an alternate site, recovery of information system functions using alternate equipment, or the performance of information system functions using alternative methods. For the purposes of this report, the term contingency and disaster recovery planning refer to the interim measures NNSA should use to recover information system services after an unexpected service disruption. [5] Los Alamos is managed and operated by Los Alamos National Security, LLC, which is a consortium of contractors that includes Bechtel National, the University of California, the Babcock and Wilcox Company, and the Washington Division of URS. Livermore is managed and operated by Lawrence Livermore National Security, LLC, which is comprised of a corporate management team that includes Bechtel National, the University of California, the Babcock and Wilcox Company, and the Washington Division of URS. Sandia is managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation. [6] FLOPS are a measure of a supercomputing system's performance. Floating-point performance is the rate at which a computer executes floating-point operations. [7] For additional information regarding budgetary information for the classified supercomputing program from fiscal years 2007 through 2009, see appendix II. [8] 44 U.S.C. § 3544(b); FISMA was enacted as title III, E-Government Act of 2002, Pub. L. No. 107-347, 116 Stat. 2899, 2946 (Dec. 17, 2002). [9] For the purposes of this report, we will refer to "continuity of operations procedures for information systems" as contingency and disaster recovery planning. [10] NIST Special Publication 800-34, Contingency Planning Guide for Information Technology Systems (Washington, D.C.: June 2002) and NIST Special Publication 800-53 Revision 3, Recommended Security Controls for Federal Information Systems and Organizations (Gaithersburg, Md.: August 2009). [11] Formerly known as the National Security Telecommunications and Information Systems Security Committee, CNSS provides a forum for the discussion of policy issues, sets national policy, and provides direction, operational procedures, and guidance for the security of national security systems. DOD chairs the committee under the authorities established by National Security Directive 42, National Policy for the Security of National Security Telecommunications and Information Systems, issued in July 1990. This directive designates the Secretary of Defense and the Director of the National Security Agency as the Executive Agent and National Manager, respectively. The committee has 21 voting representatives from various departments and agencies, including the Department of Energy. [12] National security systems include any information system used or operated by an agency, or by a contractor of an agency, that processes, stores, or transmits national security information. They do not include those systems used for routine administrative and business applications. [13] CNSS Instruction 1253 provides federal government departments, agencies, bureaus, and offices with a process for security categorization of national security systems that collect, generate, process, store, display, transmit, or receive national security information. In addition, this instruction serves as a companion document to NIST Special Publication 800-53, Revision 3. [14] Although NIST guidelines note they shall not apply to national security systems without the express approval of appropriate federal officials exercising policy authority over such systems, CNSS instructions, as well as DOE and NNSA policies for national security systems, refer to the NIST guidelines as being applicable. [15] A contingency plan is designed to maintain or restore business operations, including computer operations, possibly at an alternate location in the event of emergencies, system failures, or disaster. A disaster recovery plan is a written plan for processing critical applications in the event of a major hardware or software failure or destruction of facilities. [16] Outage impacts and allowable outage times enable the organization to develop and prioritize recovery strategies that personnel will implement during contingency plan activation. The effects of the outage may be tracked over time, which will enable the organization to identify the maximum allowable time that a resource may be unavailable before it inhibits the performance of an essential function. The effects of the outage can also be tracked across related resources, identifying any cascading effects that may occur as an effect of a service disruption. [17] The Department of Energy defines "mission critical" as an information system that supports an organization's core missions and goals, and "mission-essential (or business essential)" as an information system whose failure would not preclude organizations from accomplishing core business functions in the long term. [18] The Capability Computing Campaign includes a committee made up of staff from the NNSA ASC program office, as well as ASC executives located at the laboratories at Los Alamos, Livermore, and Sandia. [19] Total usable supercomputing capacity includes the supercomputers that have the ability to run all weapons program codes and could be used in the event of a service disruption, and includes capacity and capability systems. [20] The term "on demand" is defined as the ability to move an application (simulation program/code) from one supercomputer to a different supercomputer at a different physical facility and use the existing computational resources without the need for major modifications. [End of section] GAO's Mission: The Government Accountability Office, the audit, evaluation and investigative arm of Congress, exists to support Congress in meeting its constitutional responsibilities and to help improve the performance and accountability of the federal government for the American people. GAO examines the use of public funds; evaluates federal programs and policies; and provides analyses, recommendations, and other assistance to help Congress make informed oversight, policy, and funding decisions. GAO's commitment to good government is reflected in its core values of accountability, integrity, and reliability. Obtaining Copies of GAO Reports and Testimony: The fastest and easiest way to obtain copies of GAO documents at no cost is through GAO's Web site [hyperlink, http://www.gao.gov]. Each weekday, GAO posts newly released reports, testimony, and correspondence on its Web site. To have GAO e-mail you a list of newly posted products every afternoon, go to [hyperlink, http://www.gao.gov] and select "E-mail Updates." Order by Phone: The price of each GAO publication reflects GAO‘s actual cost of production and distribution and depends on the number of pages in the publication and whether the publication is printed in color or black and white. Pricing and ordering information is posted on GAO‘s Web site, [hyperlink, http://www.gao.gov/ordering.htm]. Place orders by calling (202) 512-6000, toll free (866) 801-7077, or TDD (202) 512-2537. Orders may be paid for using American Express, Discover Card, MasterCard, Visa, check, or money order. Call for additional information. To Report Fraud, Waste, and Abuse in Federal Programs: Contact: Web site: [hyperlink, http://www.gao.gov/fraudnet/fraudnet.htm]: E-mail: fraudnet@gao.gov: Automated answering system: (800) 424-5454 or (202) 512-7470: Congressional Relations: Ralph Dawn, Managing Director, dawnr@gao.gov: (202) 512-4400: U.S. Government Accountability Office: 441 G Street NW, Room 7125: Washington, D.C. 20548: Public Affairs: Chuck Young, Managing Director, youngc1@gao.gov: (202) 512-4800: U.S. Government Accountability Office: 441 G Street NW, Room 7149: Washington, D.C. 20548:

The Justia Government Accountability Office site republishes public reports retrieved from the U.S. GAO These reports should not be considered official, and do not necessarily reflect the views of Justia.