Information Security
National Nuclear Security Administration Needs to Improve Contingency Planning for Its Classified Supercomputing Operations
Gao ID: GAO-11-67 December 9, 2010
In the absence of underground nuclear weapons testing, the National Nuclear Security Administration (NNSA) relies on its supercomputing operations at its three weapons laboratories to simulate the effects of changes to current weapons systems, calculate the confidence of future untested systems, and ensure military requirements are met. GAO was requested to assess the extent to which (1) NNSA has implemented contingency and disaster recovery planning and testing for its classified supercomputing systems, (2) the laboratories are able to share supercomputing capacity for recovery operations, and (3) NNSA tracks the costs for contingency and disaster recovery planning for supercomputing assets. To do this work, GAO examined contingency and disaster recovery planning policies and activities, and analyzed classified supercomputing capabilities at the weapons laboratories, and NNSA budgetary data.
All three NNSA weapons laboratories--Los Alamos, Sandia, and Lawrence Livermore--have implemented some components of a contingency planning and disaster recovery program. NNSA, however, has not provided effective oversight to ensure that the laboratories have comprehensive and effective contingency and disaster recovery planning and testing. Further, due to lack of planning and analysis by NNSA and the laboratories, the impact of a system outage is unclear. Only one of the three laboratories--Los Alamos--had conducted a business impact analysis to assess the criticality of resources and acceptable outage time frames; yet, NNSA and all three laboratories consider the consequence associated with the loss of system availability to be low impact and do not consider the classified supercomputers to be mission critical. Nonetheless, NNSA classified supercomputing capabilities serve as a computational surrogate to nuclear weapons testing and are used to address other areas of national security. Despite the absence of business impact analyses, all laboratories had key components of a contingency planning program in place. However, shortcomings existed. For example, all laboratories had backup processes in place and had developed contingency plans, but the plans were not comprehensive. Specifically, one plan did not address the supercomputing operations, and none of the plans had been tested at the time of GAO's review. In addition, the laboratories addressed disaster recovery to a limited extent, but not specifically for the supercomputers. These shortcomings existed, at least in part, because NNSA's component organizations, including the Office of the Chief Information Officer, were unclear about their roles and responsibilities for providing oversight in the laboratories' implementation of contingency and disaster recovery planning. Until the agency fully implements a contingency and disaster recovery planning program for its weapons laboratories, it has limited assurance that vital information can be recovered and made available to meet national security priorities and requirements. Although the laboratories have the technological capability to share supercomputing capacity across all three weapons laboratories, barriers exist that could impede recovery operations. For example, the laboratories do not know the minimum supercomputing capacity needed to meet program requirements, such as simulating the effects of changes to weapons systems, should a disruption occur. In addition, the laboratories have not tested the technological capability to share the capacity on an on-demand basis for recovery operations. Without having an understanding of capacity needs and subsequent testing, the laboratories have little assurance that they could effectively share capacity if needed. Although NNSA obligated approximately $1.7 billion to help implement its classified supercomputing program from fiscal years 2007 through 2009, the agency has not tracked costs for contingency and disaster recovery planning and is uncertain of actual funds that were spent toward these efforts. GAO recommends, among other things, that NNSA clearly define roles and responsibilities for its component organizations in providing oversight for contingency and disaster recovery planning for the classified supercomputing environment. NNSA agreed with most of GAO's recommendations, but did not concur with recommendations relating to capacity planning and cost tracking.
Recommendations
Our recommendations from this work are listed below with a Contact for more information. Status will change from "In process" to "Open," "Closed - implemented," or "Closed - not implemented" based on our follow up work.
Director:
Gregory C. Wilshusen
Team:
Government Accountability Office: Information Technology
Phone:
(202) 512-6244
GAO-11-67, Information Security: National Nuclear Security Administration Needs to Improve Contingency Planning for Its Classified Supercomputing Operations
This is the accessible text file for GAO report number GAO-11-67
entitled 'Information Security: National Nuclear Security
Administration Needs to Improve Contingency Planning for Its
Classified Supercomputing Operations' which was released on December
9, 2010.
This text file was formatted by the U.S. Government Accountability
Office (GAO) to be accessible to users with visual impairments, as
part of a longer term project to improve GAO products' accessibility.
Every attempt has been made to maintain the structural and data
integrity of the original printed product. Accessibility features,
such as text descriptions of tables, consecutively numbered footnotes
placed at the end of the file, and the text of agency comment letters,
are provided but may not exactly duplicate the presentation or format
of the printed version. The portable document format (PDF) file is an
exact electronic replica of the printed version. We welcome your
feedback. Please E-mail your comments regarding the contents or
accessibility features of this document to Webmaster@gao.gov.
This is a work of the U.S. government and is not subject to copyright
protection in the United States. It may be reproduced and distributed
in its entirety without further permission from GAO. Because this work
may contain copyrighted images or other material, permission from the
copyright holder may be necessary if you wish to reproduce this
material separately.
United States Government Accountability Office:
GAO:
Report to Congressional Requesters:
December 2010:
Information Security:
National Nuclear Security Administration Needs to Improve Contingency
Planning for Its Classified Supercomputing Operations:
GAO-11-67:
GAO Highlights:
Highlights of GAO-11-67, a report to congressional requesters.
Why GAO Did This Study:
In the absence of underground nuclear weapons testing, the National
Nuclear Security Administration (NNSA) relies on its supercomputing
operations at its three weapons laboratories to simulate the effects
of changes to current weapons systems, calculate the confidence of
future untested systems, and ensure military requirements are met.
GAO was requested to assess the extent to which (1) NNSA has
implemented contingency and disaster recovery planning and testing for
its classified supercomputing systems, (2) the laboratories are able
to share supercomputing capacity for recovery operations, and (3) NNSA
tracks the costs for contingency and disaster recovery planning for
supercomputing assets. To do this work, GAO examined contingency and
disaster recovery planning policies and activities, and analyzed
classified supercomputing capabilities at the weapons laboratories,
and NNSA budgetary data.
What GAO Found:
All three NNSA weapons laboratories”-Los Alamos, Sandia, and Lawrence
Livermore”-have implemented some components of a contingency planning
and disaster recovery program. NNSA, however, has not provided
effective oversight to ensure that the laboratories have comprehensive
and effective contingency and disaster recovery planning and testing.
Further, due to lack of planning and analysis by NNSA and the
laboratories, the impact of a system outage is unclear. Only one of
the three laboratories-”Los Alamos-”had conducted a business impact
analysis to assess the criticality of resources and acceptable outage
time frames; yet, NNSA and all three laboratories consider the
consequence associated with the loss of system availability to be low
impact and do not consider the classified supercomputers to be mission
critical. Nonetheless, NNSA classified supercomputing capabilities
serve as a computational surrogate to nuclear weapons testing and are
used to address other areas of national security. Despite the absence
of business impact analyses, all laboratories had key components of a
contingency planning program in place. However, shortcomings existed.
For example, all laboratories had backup processes in place and had
developed contingency plans, but the plans were not comprehensive.
Specifically, one plan did not address the supercomputing operations,
and none of the plans had been tested at the time of GAO‘s review. In
addition, the laboratories addressed disaster recovery to a limited
extent, but not specifically for the supercomputers. These
shortcomings existed, at least in part, because NNSA‘s component
organizations, including the Office of the Chief Information Officer,
were unclear about their roles and responsibilities for providing
oversight in the laboratories‘ implementation of contingency and
disaster recovery planning. Until the agency fully implements a
contingency and disaster recovery planning program for its weapons
laboratories, it has limited assurance that vital information can be
recovered and made available to meet national security priorities and
requirements.
Although the laboratories have the technological capability to share
supercomputing capacity across all three weapons laboratories,
barriers exist that could impede recovery operations. For example, the
laboratories do not know the minimum supercomputing capacity needed to
meet program requirements, such as simulating the effects of changes
to weapons systems, should a disruption occur. In addition, the
laboratories have not tested the technological capability to share the
capacity on an on-demand basis for recovery operations. Without having
an understanding of capacity needs and subsequent testing, the
laboratories have little assurance that they could effectively share
capacity if needed.
Although NNSA obligated approximately $1.7 billion to help implement
its classified supercomputing program from fiscal years 2007 through
2009, the agency has not tracked costs for contingency and disaster
recovery planning and is uncertain of actual funds that were spent
toward these efforts.
What GAO Recommends:
GAO recommends, among other things, that NNSA clearly define roles and
responsibilities for its component organizations in providing
oversight for contingency and disaster recovery planning for the
classified supercomputing environment. NNSA agreed with most of GAO‘s
recommendations, but did not concur with recommendations relating to
capacity planning and cost tracking.
View [hyperlink, http://www.gao.gov/products/GAO-11-67] or key
components. For more information, contact Gregory C. Wilshusen at
(202) 512-6244 or wilshuseng@gao.gov, Gene Aloise at (202) 512-3841 or
aloisee@gao.gov, or Naba Barkakati at (202) 512-6415 or
barkakatin@gao.gov.
[End of section]
Contents:
Letter:
Background:
NNSA Has Not Fully Implemented Contingency and Disaster Recovery
Planning and Testing for Its Classified Supercomputing Assets:
The Laboratories Have the Ability to Share Supercomputing Capacity,
but Barriers Exist:
NNSA Does Not Track the Costs for Ensuring Contingency and Disaster
Recovery Planning for Its Supercomputing Assets:
Conclusions:
Recommendations for Executive Action:
Agency Comments and Our Evaluation:
Appendix I: Objectives, Scope, and Methodology:
Appendix II: NNSA Annual Obligations for Its Advanced Simulation and
Computing Program, Fiscal Years 2007 through 2009:
Appendix III: Comments from the National Nuclear Security
Administration:
Appendix IV: GAO Contacts and Staff Acknowledgments:
Table:
Table 1: Inventory of NNSA-Deployed Classified Supercomputing Systems
(as of October 2010):
Figures:
Figure 1: Common Hardware Components of a Supercomputing System:
Figure 2: NNSA's Classified Supercomputing Network Infrastructure:
Figure 3: Total Usable Supercomputing Capacity at Each Weapons
Laboratory, 2010 and 2011:
Figure 4: Annual Obligations for NNSA's Advanced Simulation and
Computing Program, Fiscal Years 2007 through 2009:
Abbreviations:
ASC: Advanced Simulation and Computing:
BIA: business impact analysis:
CNSS: Committee on National Security Systems:
DISCOM: Distance Computing:
DOD: Department of Defense:
DOE: Department of Energy:
FISMA: Federal Information Security Management Act of 2002:
FLOPS: floating-point operations per second:
Livermore: Lawrence Livermore National Laboratory:
Los Alamos: Los Alamos National Laboratory:
NIST: National Institute of Standards and Technology:
NNSA: National Nuclear Security Administration:
Sandia: Sandia National Laboratories:
[End of section]
United States Government Accountability Office:
Washington, DC 20548:
December 9, 2010:
The Honorable Henry Waxman:
Chairman:
Committee on Energy and Commerce:
House of Representatives:
The Honorable Edward J. Markey:
Chairman:
Subcommittee on Energy and the Environment:
Committee on Energy and Commerce:
House of Representatives:
The Honorable Bart Stupak:
Chairman:
Subcommittee on Oversight and Investigations:
Committee on Energy and Commerce:
House of Representatives:
The National Nuclear Security Administration[Footnote 1] (NNSA)
provides classified supercomputing capabilities for assessing the
performance of nuclear weapons. In the absence of nuclear weapons
testing--which ceased in 1992--the simulation capabilities of NNSA's
supercomputers are a necessary means to determine the effects of
changes to current weapons systems and to determine a level of
confidence in the performance of future untested systems.[Footnote 2]
These simulation capabilities also contribute to the enhancement of
NNSA's ability to predict the performance of weapons systems to ensure
the systems meet all military requirements established by the
Department of Defense (DOD).
NNSA's three nuclear weapons laboratories--Los Alamos National
Laboratory (Los Alamos) in New Mexico, Lawrence Livermore National
Laboratory (Livermore) in California, and the Sandia National
Laboratories (Sandia) with locations in New Mexico and California--use
these supercomputing simulation capabilities to obtain a comprehensive
understanding of the entire nuclear weapons life cycle, from design to
safe processes for dismantlement. These classified supercomputing
capabilities are a considerable investment and serve as a cornerstone
for NNSA's Stockpile Stewardship Program.[Footnote 3] In addition,
classified supercomputing capabilities are essential for informing
critical decisions related to the nuclear stockpile, including all
stockpile modernization and warhead studies. NNSA classified
supercomputing capabilities are also used to address other areas of
national security, including intelligence analyses, nuclear forensics,
and emergency response. Because of the importance of these classified
supercomputing capabilities to issues central to national security,
contingency and disaster recovery planning[Footnote 4] are key to
ensuring that, when unexpected events occur, NNSA can recover and
reconstitute its classified supercomputing systems, data, and
operations.
Our objectives were to assess the extent to which (1) NNSA has
implemented contingency and disaster recovery planning and testing for
its classified supercomputing assets, (2) the three laboratories are
able to share classified supercomputing capacity for recovery
operations, should service disruptions occur, and (3) NNSA tracks the
costs for ensuring contingency and disaster recovery planning for its
classified supercomputing assets. To accomplish these objectives, we
examined contingency and disaster recovery planning controls for the
systems within the classified supercomputing environment that are a
necessary means for NNSA's achievement of its nuclear weapons mission.
In addition, we performed technical assessments of classified
supercomputing capabilities at each weapons laboratory, including each
laboratory's ability to share supercomputing capacity. Further, we
obtained information from NNSA and laboratory officials to determine
how expenditures were tracked for contingency and disaster recovery
planning of the classified supercomputing systems at each of the
laboratories, as well as projected future cost estimates for ensuring
the recovery of these assets.
We conducted this performance audit from December 2009 through
December 2010 in accordance with generally accepted government
auditing standards. Those standards require that we plan and perform
the audit to obtain sufficient, appropriate evidence to provide a
reasonable basis for our findings and conclusions based on our audit
objectives. We believe that the evidence obtained provides a
reasonable basis for our findings and conclusions based on our audit
objectives. A more detailed description of our objectives, scope, and
methodology is contained in appendix I.
Background:
NNSA relies on its Stockpile Stewardship Program to ensure the safety,
security, and effectiveness of the nuclear weapons stockpile. The
Stockpile Stewardship Program is comprised of various elements,
including, but not limited to: (1) the Advanced Simulation and
Computing (ASC) Campaign, which provides the computational science and
simulation tools to understand the behaviors and effects of nuclear
weapons; (2) Directed Stockpile Work, which provides evidence of the
health of the nuclear weapons stockpile and involves day-to-day
maintenance of these weapons, including life extension efforts; (3)
the Science Campaign, which provides tools and capabilities geared
toward advancing the general understanding of all nuclear weapons
systems; and (4) the Engineering Campaign, which provides a sustained
basis for stockpile certification and assessments throughout the life
cycle of each weapon. The coordination among the Stockpile Stewardship
elements is instrumental to increasing NNSA's confidence in the
performance of nuclear weapons.
To help accomplish its Stockpile Stewardship mission, NNSA relies on
the three weapons laboratories--Los Alamos, Livermore, and Sandia. Los
Alamos and Livermore are the two design laboratories that are
responsible for designing the nuclear weapons' explosive package and
conducting research to better understand nuclear weapons phenomena.
Sandia is an engineering laboratory and has principal responsibility
for the research, design, and development of nonnuclear warhead
components; integration of these components with Los Alamos and
Livermore; and overall warhead systems integration with DOD. In
accordance with NNSA, management and operations contractors, who are
responsible for day-to-day operations of the laboratories, are
required to adhere to agency policies.[Footnote 5]
At the time of our review, NNSA's classified supercomputing resources
consisted of 12 classified supercomputing systems. Figure 1 shows the
hardware configuration of a supercomputing system.
Figure 1: Common Hardware Components of a Supercomputing System:
[Refer to PDF for image: illustration]
Compute chip:
Compute card:
Node card:
Cabinet:
System:
Source: GAO, data provided by Los Alamos, Livermore, and Sandia.
[End of figure]
NNSA classified supercomputing systems employ a large number of
interdependent processors, which are the core unit of a computer that
gathers instructions and data. These processors are mounted onto a
compute chip, which is the portion of the system that carries out the
instructions of a computer program. These compute chips are inserted
onto a compute card, which also holds memory for the compute chips to
use. A number of compute cards are attached to a node card, which have
one or more processors with a common memory and are connected by high-
speed interconnection networks. Each node card is inserted into a
single cabinet, and that configuration is repeated many times to build
a single supercomputing system. Each supercomputing system has a peak
performance, which is the maximum rate of floating-point operations
per second (FLOPS) that the system can sustain.[Footnote 6] Currently,
almost all NNSA classified supercomputer systems operate at the
teraFLOP level, which represents a trillion FLOPS.
According to NNSA, the laboratories have three types of classified
supercomputing systems:
Capacity: Small systems that execute parallel problems with more
modest computational requirements. These systems serve as the
workhorse for the ASC program and are responsible for processing the
day-to-day supercomputing workload.
Capability: This type of supercomputer is used to solve the largest
and most demanding problems that other computing systems cannot manage.
Advanced architecture: Research and development systems that assist
the ASC program in preparing to rapidly deploy and exploit the next
generation of supercomputing technology. These systems have a targeted
workload and serve as the foundation for the next generation of NNSA
supercomputers.
Table 1 shows the classified supercomputing systems currently in use
at the three weapons laboratories.
Table 1: Inventory of NNSA-Deployed Classified Supercomputing Systems
(as of October 2010):
Site: Los Alamos:
System name: Roadrunner Base;
System type: Capacity;
Delivery date: 10/2006;
Total processors: 18,432;
Peak performance (TeraFLOPS): 76.0.
System name: Roadrunner Phase-3;
System type: Advanced architecture;
Delivery date: 9/2008;
Total processors: 24,480;
Peak performance (TeraFLOPS): 1,280.0.
System name: Hurricane;
System type: Capacity;
Delivery date: 9/2008;
Total processors: 5,760;
Peak performance (TeraFLOPS): 51.2.
Site: Livermore:
System name: BlueGene/L;
System type: Advanced architecture;
Delivery date: 11/2004;
Total processors: 131,072;
Peak performance (TeraFLOPS): 367.0.
System name: Purple[A];
System type: Capability;
Delivery date: 6/2005;
Total processors: 12,288;
Peak performance (TeraFLOPS): 93.4.
System name: Rhea;
System type: Capacity;
Delivery date: 9/2006;
Total processors: 4,608;
Peak performance (TeraFLOPS): 22.1.
System name: Minos;
System type: Capacity;
Delivery date: 6/2007;
Total processors: 6,912;
Peak performance (TeraFLOPS): 33.2.
System name: Juno;
System type: Capacity;
Delivery date: 5/2008;
Total processors: 18,432;
Peak performance (TeraFLOPS): 162.2.
System name: Dawn;
System type: Advanced architecture;
Delivery date: 1/2009;
Total processors: 147,456;
Peak performance (TeraFLOPS): 501.4.
Site: Sandia-NM:
System name: Red Storm;
System type: Advanced architecture;
Delivery date: 3/2005;
Total processors: 31,680;
Peak performance (TeraFLOPS): 284.0.
System name: Unity;
System type: Capacity;
Delivery date: 3/2009;
Total processors: 4,352;
Peak performance (TeraFLOPS): 38.0.
Site: Sandia-CA;
System name: Whitney;
System type: Capacity;
Delivery date: 3/2009;
Total processors: 4,352;
Peak performance (TeraFLOPS): 38.0.
Source: GAO summary of data from Los Alamos National Laboratory,
Lawrence Livermore National Laboratory, and Sandia National
Laboratories.
[A] Although Purple was the capability system in use at the time of
our site visits, Livermore retired the system in November 2010.
[End of table]
NNSA's classified supercomputing capabilities consist of supporting
resources, including (1) parallel files systems, which store
transitory data; (2) network file systems, which store user and
project data for a calculation; (3) archival storage systems, which
serve as storage for data; and (4) visualization systems, which enable
users to better comprehend the results of their computations.
NNSA's classified supercomputing systems are connected via its
Enterprise Secure Network and the Distance Computing (DISCOM) network,
which function as supporting resources for the classified
supercomputing environment. The Enterprise Secure Network provides
classified communications across the nuclear weapons complex,
including security services and other activities that ensure the flow
of NNSA's data sharing and business missions. DISCOM provides secure,
high-speed remote access for intra-and inter-site file transfers and
enables users, across the three weapons laboratories, to operate on
remote computing resources as if they were local. DISCOM and the
Enterprise Secure Network serve as the backup networks to each other.
Figure 2 shows the composition of NNSA's classified supercomputing
network infrastructure.
Figure 2: NNSA's Classified Supercomputing Network Infrastructure:
[Refer to PDF for image: illustration]
The illustration depicts the following connections:
Connected to DISCOM Network (10 Gigabits per second):
and to NNSA Enterprise Secure Network (1 Gigabit per second):
Lawrence Livermore National Laboratory:
* BlueGene/L;
* Purple;
* Rhea;
* Minos;
* Juno;
* Dawn.
Los Alamos National Laboratory:
* Road Runner;
* Road Runner Phase 3;
* Hurricane.
Sandia National Laboratories, California:
* Whitney.
Sandia National Laboratories, New Mexico:
* Red Storm;
* Unity.
DISCOM Network and NNSA Enterprise Secure Network are interconnected.
Source: GAO, data provided by Los Alamos, Livermore, and Sandia.
[End of figure]
NNSA reported obligating approximately $1.7 billion from fiscal years
2007 through 2009 to support ASC program activities at the three
weapons laboratories.[Footnote 7] The $1.7 billion was predominantly
associated with three efforts:
Weapons codes and models. This effort is intended to develop and
improve weapons simulation models and codes for predicting the
behavior of weapons systems and devices in the nuclear stockpile.
Computational systems and software environment. This effort is
intended to provide ASC users a stable, seamless computing environment
for ASC-deployed platforms. It is responsible for procuring,
delivering, and deploying ASC computational systems and user
environments via technology development and integration across the
three weapons laboratories.
Facility operations and user support. This effort is intended to
provide both the necessary physical facility and operational support
for reliable supercomputing and storage environments, as well as a
suite of user services for effective use of the three weapons
laboratories' computing resources. Facility operations cover physical
space, power and other utility infrastructure, and local-and wide-area
networking, as well as system administration, cyber security, and
operations services for ongoing support. The user support function
includes planning, development, integration and deployment, continuing
product support, and quality and reliability activity collaborations.
To strengthen the security of information and information systems
across the federal government, including those at NNSA's weapons
laboratories, the Federal Information Security Management Act of 2002
(FISMA) requires each agency to develop, document, and implement an
agencywide information security program that supports the operations
and assets of the agency, including those provided or managed by
another agency or contractor on its behalf.[Footnote 8] This security
program is to include plans and procedures to ensure the continuity of
operations for information systems that support the agency's
operations.[Footnote 9] Pursuant to its FISMA responsibilities, the
National Institute of Standards and Technology (NIST) has issued
federal standards and guidelines on information security, such as a
contingency planning guide for federal information systems, and
recommended security controls, which address contingency and disaster
recovery planning and testing.[Footnote 10] To further ensure the
security of national security systems, the Committee on National
Security Systems (CNSS)[Footnote 11] requires federal agencies with
national security systems to implement a comprehensive set of security
controls and enhancements for these systems.[Footnote 12] CNSS
requires that each agency implement a contingency and disaster
recovery planning capability that ensures the integrity and
availability of its national security information and information
systems.[Footnote 13]
FISMA, NIST guidelines,[Footnote 14] and CNSS policies all call for
contingency and disaster recovery planning--also referred to as
continuity of operations for information systems--for critical
components of information protection. DOE and NNSA policies also
regard contingency and disaster recovery plans as being necessary for
information protection. If normal operations are interrupted,
contingency and disaster recovery plans allow senior agency officials
to detect, mitigate, and recover operations. Examples of the key
components that make up contingency and disaster recovery planning
programs include (1) assessing the criticality and sensitivity of
computerized operations and identification of supporting resources
such as developing business impact analyses (BIA), (2) taking steps to
prevent and minimize potential damage and interruption such as
establishing data backup processes, (3) developing comprehensive
contingency and disaster recovery plans,[Footnote 15] and (4)
conducting periodic testing of contingency and disaster plans.
The extent to which controls--such as contingency and disaster
recovery planning--are implemented depends on a level of risk assigned
to the system or information maintained on the system. NIST standards
and guidelines, CNSS instructions, and NNSA policy allow consideration
of risk in determining the level of protection of systems and data.
These standards and policies require that organizations consider the
impact or consequences of loss as it relates to the confidentiality,
integrity, and availability of the information, and assign a value of
low, moderate, or high impact levels. For contingency and disaster
recovery planning, consideration of "availability" is the key element.
NNSA policy defines the values for the consequences of loss associated
with availability as follows:
High: Loss of life might result from loss of availability; information
must always be available on request, with no tolerance for delay; loss
of availability will have an adverse effect on national-level
interests; federal requirement (i.e., requirement for material control
and accountability inventory); or loss of availability will have an
adverse effect on confidentiality.
Moderate: Information must be readily available with minimum tolerance
for delay; bodily injury might result from loss of availability; or
loss of availability will have an adverse effect on organizational-
level interests.
Low: Information must be available with flexible tolerance for delay.
NNSA Has Not Fully Implemented Contingency and Disaster Recovery
Planning and Testing for Its Classified Supercomputing Assets:
Contingency and disaster recovery planning and testing for NNSA's
classified supercomputing systems have not been fully implemented at
each of the three weapons laboratories--Los Alamos, Sandia, and
Livermore. Specifically, NNSA did not ensure that the laboratories (1)
developed BIAs to determine the impact of potential service
disruptions, (2) fully tested data backup processes, and (3) developed
and tested contingency and disaster recovery plans. These shortcomings
existed, at least in part, because NNSA's component organizations were
unclear of their roles and responsibilities for providing oversight in
the laboratories' implementation of contingency and disaster recovery
planning. Until the agency fully implements a contingency and disaster
recovery planning program for its classified supercomputing assets at
the weapons laboratories, it has limited assurance that vital
information can be recovered and made available to meet national
security priorities and requirements.
Not All of the Laboratories Assessed the Criticality and Sensitivity
of Supercomputer Operations and Resources, or Potential Outage Impact:
To assess the criticality and sensitivity of computerized operations
and identification of supporting resources, NIST guidelines state that
agencies should determine their recovery strategies by performing
business impact analyses of their systems. A BIA is an analysis of
information technology system requirements, processes, and
interdependencies used to characterize system contingency requirements
and priorities in the event of a significant disruption. NIST
guidelines state that agencies conduct a BIA to identify critical
information systems to fully characterize the system's requirements,
processes, and interdependencies to determine contingency requirements
and priorities. In addition, according to NIST guidelines, the BIA
process should follow three main steps: (1) identify critical data and
information technology resources, (2) identify outage impacts and
allowable outage times,[Footnote 16] and (3) develop recovery
priorities and strategies. NNSA policy also requires a BIA to identify
systems that provide critical services to site operations and
prioritize these systems and their components.
One of the laboratories--Los Alamos--had conducted a BIA that
addressed its classified supercomputing systems, generally following
the three steps of a BIA. However, the BIA was not always specific.
For example, the laboratory identified critical information technology
resources for each of its classified supercomputing systems, but did
not specifically identify the critical data. Instead, Los Alamos noted
that the systems are not considered mission critical nor mission
essential to the business needs of the laboratory,[Footnote 17] and
that the consequence of loss for system availability is low.
Additionally, it defined a specific number of days for the allowable
time frames for fully and partially disabled systems, but did not
provide specifics on allowable outage impacts. Further, the analyses
indicated high-level recovery priorities, but did not provide
specifics regarding the recovery process or strategies that would be
used for recovery efforts.
The other two laboratories did not conduct BIAs specifically for
classified supercomputing systems, but plan to do so. Livermore has a
BIA in place for its logical assets--the applications and services
that provide basic operational support to the Livermore computing
environment, but the BIA did not address any of the classified
supercomputing systems. However, at the time of our site visit,
Livermore officials stated they were beginning the process of
developing a BIA that would address their information technology needs
for their classified supercomputing systems, but the process was still
in the planning stage. Similarly, according to Sandia officials, the
laboratory has BIAs that address its unclassified information
technology systems, but does not currently have one specifically for
its classified supercomputing systems. However, Sandia officials
indicated that they plan to conduct a BIA for classified
supercomputing systems in 2011.
Although the two laboratories have not conducted any BIAs--in line
with the BIA conducted by Los Alamos--they have considered the risk of
consequence of loss from availability as low impact. NNSA also
considers the consequence of loss as low impact. In addition, NNSA and
the three laboratories do not consider the classified supercomputers
to be "mission critical." One laboratory categorized the systems as
"mission essential," while another referred to them as "mission
support elements, not mission essential elements." However, NNSA's
mission includes maintaining the safety, security, and effectiveness
of the nuclear deterrent without nuclear testing. The supercomputers
provide a necessary means to determine the effects of changes to
current weapons systems and to determine a level of confidence in the
performance of future untested systems. The classified supercomputing
capabilities serve as the computational surrogate to nuclear weapons
testing and are central to national security.
Regarding recovery priorities and strategies, each of the laboratories
indicated that it would likely rely on a process that is currently
being used for the capability system shared among the laboratories.
The laboratories generally rely on the Capability Computing Campaign
to prioritize the workload and develop priorities for jobs that need
to be run on the capability system.[Footnote 18] In the event of a
service disruption or emergency, laboratory officials told us that
they would likely rely on the same process for all of their systems.
However, this process has not been documented as a means for
establishing overall recovery priorities across the laboratories.
Until all of the laboratories have a BIA in place for their classified
supercomputing systems that (1) identifies and categorizes critical
data, (2) identifies acceptable allowable outage impacts and time
frames, and (3) establishes emergency processing priorities and
strategies, the potential impact of a system outage will remain
unclear.
The Laboratories Have Backup Processes in Place, but One Storage Site
May Be Susceptible to Damage:
Data backup processes offer a means of taking steps to prevent and
minimize potential damage and interruption to computerized services.
NIST guidelines, as well as CNSS instructions and NNSA policies, call
for agencies to conduct backups of user-level information, system-
level information, and information system documentation. In addition,
NIST, CNSS, and NNSA all provide that agencies establish an alternate
storage site that is separated from the primary storage site so that
both are not susceptible to the same hazards. To ensure the
availability of data stored in the alternate storage site, NIST and
CNSS require that agencies test the backup information to verify the
integrity of the data.
All of the laboratories had backup processes in place. Each of the
laboratories follows similar data backup processing--both manual and
automated procedures--to back up user-level information, system-level
information, and information system documentation. For example, this
information can include global directories, user home directories,
project directories, desktop systems, and critical systems
documentation. Backups occur in increments: daily incremental backups
to disk, weekly full backups to tape, and monthly full-system backups
to tape (with a 6-month on-site storage retention policy). The
laboratories also have vendor-provided software that takes periodic
snapshots of user directories for storage retention purposes. The
snapshot process can be performed manually or can be set up for
automatic processing. Users are encouraged to maintain their data in a
shared environment on the network and are allowed to make their own
determinations regarding what data should be backed up from the
classified supercomputing systems.
Not all of the laboratories have an alternate storage site
sufficiently separated from the primary site to not be susceptible to
the same hazards. Two of the three laboratories have alternate storage
sites a considerable distance from their primary storage site.
Livermore sends its system backups electronically to Los Alamos every
6 months. Sandia sends its backup data to its alternate site locations
(e.g, the California site sends its data to the New Mexico site and
the New Mexico site sends its data to the California site). However,
Los Alamos maintains its alternate storage facility on-site in a
building located less than 1 mile away from the primary local backup
storage facility. Consequently, both sites could be susceptible to the
same hazards, such as a wildfire.
The laboratories had processes in place to verify the integrity of the
backed up data. However, tests of their backup procedures rely
predominantly on ad hoc recovery, rather than periodically planned
tests. Los Alamos officials indicated that thousands of file
recoveries have been performed over the years by end users as part of
their testing. Livermore officials stated that the laboratory tests
its local backup procedures through actual system usage on almost a
daily basis, and tests their remote backup procedures at least once
annually. Further, Sandia officials told us they had successfully
tested a sample of data at their offsite facility.
Not All Laboratories Had Developed and Tested Contingency and Disaster
Recovery Plans:
NIST guidelines and CNSS policies call for the development and testing
of contingency plans and the development of disaster recovery plans
for each information system to ensure that, in the event of a service
disruption, the work and supporting functions of the agency can
continue to be performed. According to NIST guidelines, at a minimum,
the contingency plan should address the identification and
notification of key personnel, plan activation, system recovery, and
system reconstitution to meet the needs of the agency's critical
supporting operations. The guidelines also state that the plan should
be tested periodically; CNSS specifies that the frequency of testing
should be annually. NIST also notes that the disaster recovery plan
should be designed to restore operability of the targeted system,
application, or computer facility at an alternate site after a major
service disruption. DOE and NNSA policies also require the development
of contingency and disaster recovery plans and the testing of these
plans in line with NIST and CNSS.
Each of the laboratories had developed contingency plans for their
classified supercomputing systems; however, the plans were not always
comprehensive, and at the time of our site visits, these plans had not
been tested. The laboratories addressed disaster recovery planning to
a limited extent; none specifically addressed the supercomputing
environment. For example,
* Two laboratories--Los Alamos and Sandia--had contingency plans that
addressed the classified supercomputing systems. Although Livermore
had an information technology contingency plan and a master security
plan, neither specifically addressed the supercomputers. In addition,
the plans for both Los Alamos and Sandia included key components such
as the identification and notification of key personnel, plan
activation, system recovery, and system reconstitution procedures;
however, the sufficiency of the level of detail varied. For instance,
one plan provided specific details regarding system recovery processes
and the notification and identification of key personnel, but provided
limited details regarding plan activation and system reconstitution
procedures.
* At the time of our site visits, none of the laboratories had tested
their contingency plans, which were less than a year old. One of the
laboratories--Los Alamos--had created testing guides but had not yet
conducted formal testing. Subsequent to our site visit, Los Alamos
officials indicated that the first test of their plan took place in
September 2010 and noted that the results would be finalized in
December. Additionally, although Sandia had a contingency plan in
place, the plan states that testing is not required because, in the
event of a service disruption, the laboratory would either wait until
the equipment was fully operational or simply acquire new equipment.
This is contrary to NIST guidelines, CNSS instructions, and DOE and
NNSA policies.
* Each of the laboratories had addressed disaster recovery planning to
a limited extent. For example,
- Los Alamos included disaster recovery planning as a section within
their classified supercomputing system's contingency plans. Although
it provided high-level instructions such as directing individuals to
call 911 for all emergencies, it did not include information regarding
the specifics for restoring operability of the classified
supercomputing system at an alternate site after a major service
disruption.
- None of the plans submitted by Livermore specifically addressed the
supercomputing environment, although in the disaster recovery section
of the master security plan, the laboratory noted that it had no
mission-essential systems in the computing environment, and systems
may be offline for an extended period for system upgrades.
- Sandia also had disaster recovery plans in place, designed for
emergency preparedness and disease response planning needs for the
laboratory. These plans focused on emergencies involving the
facilities, operations, and activities for the laboratory, and
provided individuals with emergency information should pandemics
plague the laboratory. However, these plans did not include any
information regarding the classified supercomputing systems.
Unless each of the laboratories develops and sufficiently tests
comprehensive contingency and disaster recovery plans in accordance
with applicable policies and guidance for their classified
supercomputing systems, they face a risk of not being able to
successfully recover their supercomputing assets and operations after
a service disruption.
NNSA Component Organizations Were Unclear of Their Roles and
Responsibilities for Providing Oversight:
The aforementioned shortcomings existed, at least in part, because
NNSA's component organizations were unclear of their roles and
responsibilities for providing oversight in the laboratories'
implementation of contingency and disaster recovery planning. FISMA
requires that the chief information officer, in coordination with
other senior agency officials, manage the development and
implementation of an agencywide information security program that
includes plans and procedures to ensure continuity of operations for
information systems that support the operations and assets of the
agency. NIST guidelines and DOE policies call for individuals with
information system or security management and oversight
responsibilities to take responsibility for the development,
implementation, assessment, monitoring, reviewing, and updating of
security planning policies and procedures, which includes contingency
and disaster recovery plans. Further, the NNSA Safety Management
Function and Responsibilities and Authorities Manual states that the
chief information officer is responsible for information technology
programs and initiatives and for ensuring the security of the agency's
information and systems.
Although roles and responsibilities are defined at a high level in
FISMA, NIST guidelines, as well as DOE and NNSA policies, NNSA
component organizations were confused about their roles in providing
oversight of the laboratories' implementation of contingency and
disaster recovery planning for the supercomputing systems. For
example, at the beginning of our review, ASC officials told us that,
although they were responsible for administering and managing the
program that uses the classified supercomputing systems, they were not
responsible for contingency and disaster recovery planning. Instead,
they directed us to the Office of the Chief Information Officer
(OCIO), where officials told us that they were not responsible for
contingency and disaster recovery planning for these systems, and
noted that they would only provide guidance if requested by ASC.
Further, OCIO officials told us that ASC has not requested any
assistance. ASC officials subsequently acknowledged that they had
responsibility for contingency and disaster recovery planning;
however, this organizational responsibility is contrary to NIST
guidelines and DOE policies, as well as NNSA's own manual, which gives
this responsibility to the OCIO. In the absence of effective
oversight, the laboratories did not consistently comply with, or fully
implement, federal requirements and guidance related to contingency
planning and disaster recovery. Until NNSA clearly establishes and
carries out defined roles and responsibilities for OCIO and ASC
pertaining to contingency and disaster recovery planning for the
classified supercomputing environment, it will not be able to
effectively manage and oversee the recovery of its supercomputing
operations should service disruptions occur.
The Laboratories Have the Ability to Share Supercomputing Capacity,
but Barriers Exist:
Technologically, the weapons laboratories have demonstrated the
ability to share classified supercomputing capacity using their
capacity and capability systems under normal operating conditions.
Although these supercomputers process unique workloads and operate
independently, they are designed with a similar operating system,
resource manager, and job scheduler, which is built on a LINUX
foundation. These supercomputers also include application codes that
are portable across supercomputing systems and a data network, which
allows authorized users local and remote access to the systems.
Although the weapons laboratories have the ability to share
supercomputing capacity, barriers exist. One barrier to sharing
supercomputing capacity is that the weapons laboratories do not know
the minimum supercomputing capacity needed to achieve processing
priorities in the event of a service disruption. NIST guidelines
recommend, and NNSA policy requires, that capacity planning be
conducted so that there is adequate capacity for information
processing and supporting resources during contingency operations.
Although the weapons laboratories have identified the supercomputing
processing needed for normal business operations, they have not
identified the minimum supercomputing capacity needed to achieve
processing priorities in the event of a service disruption.
Another barrier to sharing supercomputing processing is the disparity
in usable supercomputing processing across the laboratories. Figure 3
depicts this disparity by identifying the amount of total usable
supercomputing capacity, in teraFLOPS, for each of the three weapons
laboratories for 2010 and 2011.[Footnote 19]
Figure 3: Total Usable Supercomputing Capacity at Each Weapons
Laboratory, 2010 and 2011:
[Refer to PDF for image: vertical bar graph]
TeraFLOPS: Los Alamos;
Total usable capacity: 2010: 127;
Total usable capacity: 2011: 1,427.
TeraFLOPS: Livermore;
Total usable capacity: 2010: 311;
Total usable capacity: 2011: 218.
TeraFLOPS: Sandia;
Total usable capacity: 2010: 76;
Total usable capacity: 2011: 76.
Source: GAO analysis of supercomputing capacity data provided by Los
Alamos, Livermore, and Sandia.
[End of figure]
For example, in 2010, total usable capacity at Livermore has been 311
teraFLOPS, whereas Los Alamos and Sandia have had 127 teraFLOPS and 76
teraFLOPS, respectively. Should Livermore experience service
disruptions for a sustained amount of time, neither Los Alamos nor
Sandia possesses the necessary usable supercomputing capacity to
accommodate the additional workload and NNSA will have to reprioritize
the computational workloads across the other two laboratories. As
previously noted, officials at the laboratories told us that, should
disruptions occur, they would use the Capability Computing Campaign
model for re-prioritizing the workload. However, this process has not
been documented for recovery activities.
Further limiting the ability of the weapons laboratories to recover
from a service disruption, in 2011, there will be a significant
disparity in projected usable supercomputing capacity. For example, in
2011, Los Alamos' usable capacity is projected to be 1,427 teraFLOPS,
whereas usable capacity at Livermore and Sandia is to be 218 teraFLOPS
and 76 teraFLOPS, respectively. Should Los Alamos' supercomputing
systems become unavailable for an extended period of time, neither
Livermore nor Sandia possesses sufficient usable supercomputing
capacity to achieve its workload and accommodate the additional
potential computational workload from Los Alamos. According to
laboratory officials, an additional supercomputer will be deployed at
Los Alamos in 2011. This supercomputer is a replacement for the single
capability supercomputer currently at Livermore that was retired in
2010. Therefore, a significant amount of usable supercomputing
capacity will be centralized at Los Alamos. Because the weapons
laboratories have not determined the minimum supercomputing capacity
requirements for their emergency processing priorities, they may not
be able to meet the minimum computational workload required to meet
Stockpile Stewardship milestones.
Another barrier to sharing supercomputing capacity across the weapons
laboratories is that the capability to share usable capacity on an "on-
demand"[Footnote 20] basis has not been fully tested in a recovery
scenario. According to officials at the laboratories, during normal
operating conditions, simulation programs have run on other
supercomputing systems. However, consideration has not been given to
include and test these abilities in a disaster recovery scenario
should a service disruption occur. As a result, NNSA has limited
assurance that its disaster recovery approach would work effectively
should a service disruption occur.
NNSA Does Not Track the Costs for Ensuring Contingency and Disaster
Recovery Planning for Its Supercomputing Assets:
Although NNSA reported obligating approximately $1.7 billion from
fiscal 2007 through 2009 to implement its ASC program activities at
the three weapons laboratories, the costs for ensuring the recovery of
its classified supercomputing operations are unknown. Under GAO's
Standards for Internal Control in the Federal Government, financial
information should be recorded and communicated to program managers
who need this information to make operational decisions and to
effectively allocate resources for program activities.
NNSA officials reported obligating approximately $390 million for
facility operations and user support activities, which include the
funds associated with contingency and disaster recovery planning
activities, but they were unable to provide detailed financial
information for contingency and disaster recovery planning activities.
According to NNSA officials, costs for contingency and disaster
recovery planning for classified supercomputing systems are unknown
because ASC program expenditures are part of the NNSA ASC operational
budget, whose costs are tracked at an aggregate level. As a result,
neither NNSA nor the three weapons laboratories can track what has
been spent since fiscal year 2007 for ensuring the recovery of
classified supercomputing operations and, consequently, they do not
know whether funding levels for these activities have been adequate.
Although certain components of contingency and disaster recovery
planning are in place at the three weapons laboratories, NNSA is
uncertain as to what funds were spent on these information protection
activities.
Furthermore, NNSA has not developed contingency planning and disaster
recovery cost estimates for its classified supercomputing assets. For
fiscal years 2011 through 2014, NNSA projects that it needs about $2.2
billion to implement its ASC activities, which support its Stockpile
Stewardship program--$984 million for weapons codes and models, $604
million for computational systems and software environment, and $588
million for facility operations and user support. Although NNSA has
developed its out-year funding needs over the next 4 years, it has not
developed estimates regarding the future costs for ensuring the
recovery of its classified supercomputing assets in the event of
service disruptions. Until NNSA develops a means for tracking current
contingency and disaster recovery costs and for developing estimates
of future costs, the agency will not have the information needed to
determine whether it is meeting its goals for effective stewardship of
public resources.
Conclusions:
All three NNSA weapons laboratories have implemented some components
of a contingency planning and disaster recovery program. NNSA,
however, has not provided effective oversight to ensure that the
laboratories have comprehensive and effective contingency and disaster
recovery planning and testing. For example, BIAs that identify
critical resources and outage impacts have not been developed for all
classified supercomputing systems and existing contingency plans at
the laboratories have not been thoroughly tested. Although one
laboratory's analysis is not comprehensive and the other two
laboratories have not completed a BIA, NNSA and the laboratories
consider the consequence of loss of availability of the classified
supercomputers as a low-risk impact, and do not consider them to be
mission critical. However, it is unclear how NNSA made this
determination given that (1) the analyses have not been completed; (2)
NNSA's mission includes maintaining the safety, security, and
effectiveness of the nuclear deterrent without nuclear testing; (3)
the classified supercomputing capabilities serve as the computational
surrogate to underground nuclear weapons testing and are central to
our national security; and (4) NNSA has obligated about $1.7 billion
over 3 fiscal years to support the Advanced Simulation and Computing
program, which includes classified supercomputing activities.
Beyond the activities undertaken by the laboratories, NNSA has not
developed a means for identifying, tracking, or re-prioritizing the
classified supercomputing workload across the operating environment.
In addition, the laboratories have not tested offsite recovery
capabilities and the agency has not tested the laboratories' ability
to share "on-demand" capacity if needed or determined the minimum
capacity needed to meet Stockpile Stewardship Program requirements,
particularly in the event that it may need to establish emergency
processing priorities. Further, although over a billion dollars have
been obligated to support the classified supercomputing capabilities
within the last 3 years, NNSA has not tracked the costs for ensuring
the recovery of the classified supercomputing systems, data, and
supporting resources should a service disruption occur. The classified
supercomputing program represents a significant investment, and
accountability for these systems is essential. Until NNSA clearly
defines its component organizations' roles and responsibilities and
fully implements an effective contingency and disaster planning
program, it has limited assurance that, in the event of a service
disruption, vital information could be recovered and made available to
meet national security priorities.
Recommendations for Executive Action:
To improve the effectiveness of contingency and disaster recovery
planning for NNSA's classified supercomputing capabilities, we
recommend that the Administrator of NNSA direct the weapons
laboratories to take the following four actions, where not already
implemented:
* Develop business impact analyses that, among other things, (1)
identify and prioritize critical systems, data, and supporting
resources; (2) identify allowable outage times and impacts for
classified supercomputing capabilities; and (3) identify recovery
priorities and strategies.
* Develop and implement comprehensive contingency and disaster
recovery plans for all classified supercomputing systems that identify
how each weapons laboratory's classified supercomputing capabilities
will be recovered following service disruptions.
* Conduct contingency and disaster recovery plan testing.
* Test the three weapons laboratories' ability to share "on-demand"
classified supercomputing capacity to ensure this capability will work
in the event of unexpected service disruptions.
In addition, we recommend that the Administrator of NNSA take the
following five actions:
* Document an agencywide means for reprioritizing the workload across
NNSA's classified supercomputing systems should a disruption occur.
* Clearly define the oversight responsibilities of the NNSA ASC
program office and the NNSA Office of the Chief Information Officer,
as they relate to contingency and disaster recovery planning for
NNSA's classified supercomputing operations.
* Identify, assess, and communicate the minimum classified
supercomputing capacity needed to meet Stockpile Stewardship
requirements in the event of a service disruption.
* Develop, document, and implement a process that identifies and
tracks expenditures for contingency and disaster recovery planning for
NNSA's classified supercomputing assets.
* Develop and document the total anticipated costs for contingency and
disaster recovery planning of NNSA's classified supercomputing assets,
which includes the replacement costs for these assets.
Agency Comments and Our Evaluation:
In providing written comments (reprinted in appendix III) on a draft
of this report, NNSA's Associate Administrator for Management and
Administration agreed that improvements can be made in contingency and
disaster recovery planning for supercomputing operations. He indicated
that NNSA agreed with six of our nine recommendations and outlined the
agency's intent to conduct business impact analyses, develop and test
appropriate contingency and disaster recovery plans, document workload
prioritization, and clearly define roles and responsibilities.
However, NNSA did not agree with our recommendation related to
identifying the minimum capacity needed to meet Stockpile Stewardship
requirements in the event of a service disruption. The Associate
Administrator stated that this recommendation did not take into
account that the different types of supercomputers--capacity and
capability--serve different functions and are procured and managed
differently. In our report, we recognize that different types of
supercomputers exist and that they are used for different purposes,
they process unique workloads and operate independently. However, as
we point out in the report, although the weapons laboratories have
identified supercomputing processing needed for normal business
operations, they have not identified the minimum capacity needed to
achieve processing priorities in the event of a service disruption. We
believe that the recommendation appropriately focuses on meeting
NNSA's Stockpile Stewardship mission and that capacity planning is
essential to ensure that information processing and supporting
resources exist during contingency operations, regardless of the type
of system used. Although NNSA did not agree with the recommendation,
the Associate Administrator stated that the agency will conduct a BIA
and build appropriate contingency strategies for both types of
supercomputers, as well as enhance capacity sizing actions to account
for contingency and disaster recovery operations.
Further, NNSA did not agree with two recommendations related to
identifying and tracking expenditures for contingency and disaster
recovery planning and documenting anticipated recovery planning costs,
including replacement costs of the assets. The Associate Administrator
asserted that this information would not add significant value to
managing contingency and disaster recovery planning. However, we
believe such actions reflect good government practices and would add
value by providing NNSA program managers with useful expenditure and
cost information to aid decision making with regards to contingency
and disaster recovery planning. As our report points out, GAO's
Standards for Internal Control in the Federal Government states that
financial information should be recorded and communicated to program
managers to help them make operational decisions and effectively
allocate resources for program activities. Strong financial and
internal controls are a major part of managing any organization
because they help government program managers achieve desired results
through effective stewardship of public resources. Accordingly, we
believe our recommendations have merit.
We are sending copies of this report to the Secretary of Energy; the
Administrator of NNSA; and the Directors of Los Alamos, Livermore, and
Sandia laboratories. Copies of the report will also be available to
others at no charge on the GAO Web site at [hyperlink,
http://www.gao.gov].
If you or your staffs have any questions about this report, please
contact Gene Aloise at (202) 512-6870, or aloisee@gao.gov; Nabajyoti
Barkakati at (202) 512-6415 or barkakatin@gao.gov; or Gregory C.
Wilshusen at (202) 512-6244 or wilshuseng@gao.gov. Contact points for
our Offices of Congressional Relations and Public Affairs may be found
on the last page of this report. GAO staff who made major
contributions to this report are included in appendix IV.
Signed by:
Gene Aloise:
Director, Natural Resources and Environment:
Signed by:
Nabajyoti Barkakoti:
Director, Center for Technology and Engineering:
Signed by:
Gregory C. Wilshusen:
Director, Information Security Issues:
[End of section]
Appendix I: Objectives, Scope, and Methodology:
The objectives of our review were to assess the extent to which (1)
the National Nuclear Security Administration (NNSA) has implemented
contingency and disaster recovery planning and testing for its
classified supercomputing assets, (2) the three laboratories are able
to share classified supercomputing capacity for recovery operations,
should service disruptions occur, and (3) NNSA tracks the costs for
ensuring contingency and disaster recovery planning for its classified
supercomputing assets. To address these objectives, we focused on
contingency and disaster recovery planning activities at NNSA
headquarters, as well as the operating environment for the 12
classified supercomputing systems at the three weapons laboratories--
Los Alamos National Laboratory, Livermore National Laboratory, and
Sandia National Laboratories.
To assess the extent to which NNSA has implemented contingency and
disaster recovery planning and testing for its classified
supercomputing assets, we examined contingency and disaster recovery
planning controls for the systems within the classified supercomputing
environment that are critical to NNSA's achievement of its nuclear
weapons mission. We collected and reviewed policies, procedures, and
guidelines from the National Institute of Standards and Technology,
the Committee on National Security Systems, the Department of Energy,
and NNSA. We also reviewed contingency plans and business impact
analyses provided by the weapons laboratories and compared them to
federal guidelines. We interviewed NNSA and laboratory officials to
determine whether they had documented critical system, data, and
supporting resources and whether contingency plans had been tested.
Further, we interviewed NNSA officials to determine to what extent
they have provided specific guidance and oversight for the
laboratories to ensure that contingency and disaster recovery planning
requirements are being met.
To determine the extent to which the three weapons laboratories have
the ability to share supercomputing capacity for backup and recovery
operations, we visited each weapons laboratory and gained an
understanding of the overall classified supercomputing infrastructure
and identified interconnectivity and control points. We performed
technical assessments of supercomputing capabilities at each weapons
laboratory, including each laboratory's ability to share
supercomputing capacity under normal operating conditions. We reviewed
the weapons laboratories' efforts to determine the minimal
supercomputing capacity needed to meet NNSA Stockpile Stewardship
Program requirements along with the ability of the weapons
laboratories to share supercomputing capacity on an "on-demand" basis,
including the use of advanced architecture systems. In addition, we
obtained documents describing the supercomputing system environment as
well as capacity information, along with the views of officials from
NNSA and the three weapons laboratories.
To assess the extent to which NNSA tracks costs for ensuring
contingency and disaster recovery planning for classified
supercomputing assets, we interviewed NNSA and weapons laboratory
officials to determine how expenditures were tracked for contingency
and disaster recovery planning of the classified supercomputing
systems at each of the laboratories. We also requested the amount of
funds NNSA obligated to the three weapons laboratories, and the amount
of funds the laboratories spent, in implementing NNSA's classified
supercomputing capabilities from fiscal years 2007 through 2009.
Further, we interviewed NNSA and laboratory officials to determine how
they projected future cost estimates for ensuring the recovery of
these assets for fiscal years 2011 through 2014. To assess the
reliability of data provided, we reviewed (1) the fiscal year 2009
financial statement audit for the system and (2) responses NNSA
provided to questions about processes and procedures for ensuring the
accuracy and completeness of data. Based on this information, we
determined the data are sufficiently reliable for the purposes of this
report.
We conducted this performance audit from December 2009 through
December 2010 in accordance with generally accepted government
auditing standards. Those standards require that we plan and perform
the audit to obtain sufficient, appropriate evidence to provide a
reasonable basis for our findings and conclusions based on our audit
objectives. We believe that the evidence obtained provides a
reasonable basis for our findings and conclusions based on our audit
objectives.
[End of section]
Appendix II: NNSA Annual Obligations for Its Advanced Simulation and
Computing Program, Fiscal Years 2007 through 2009:
NNSA reported obligating approximately $1.7 billion from fiscal years
2007 through 2009 to support Advanced Simulation and Computing (ASC)
program activities at the three weapons laboratories. The $1.7 billion
was used mainly for three efforts:
Weapons codes and models. This effort is intended to develop and
improve weapons simulation codes and models for predicting the
behavior of weapons systems and devices in the nuclear stockpile.
Computational systems and software environment. This effort is
intended to provide ASC users a stable, seamless computing environment
for ASC-deployed platforms. It is responsible for procuring,
delivering, and deploying ASC computational systems and user
environments via technology development and integration across the
three weapons laboratories.
Facility operations and user support. This effort is intended to
provide both the necessary physical facility and operational support
for reliable supercomputing and storage environments, as well as a
suite of user services for effective use of the three weapons
laboratories' computing resources. Facility operations cover physical
space, power and other utility infrastructure, and local-and wide-area
networking, as well as system administration, cyber security, and
operations services for ongoing support. The user support function
includes planning, development, integration and deployment, continuing
product support, and quality and reliability activity collaborations.
Figure 4 depicts NNSA's annual obligations for each of the three
efforts from fiscal years 2007 through 2009.
Figure 4: Annual Obligations for NNSA's Advanced Simulation and
Computing Program, Fiscal Years 2007 through 2009:
[Refer to PDF for image: stacked vertical bar graph]
Fiscal year: 2007;
Weapons codes and models: $280 million;
Computational systems and software environment: $183 million;
Facility operations and user support: $128 million;
Total: $591 million.
Fiscal year: 2008;
Weapons codes and models: $249 million;
Computational systems and software environment: $183 million;
Facility operations and user support: $114 million;
Total: $546 million.
Fiscal year: 2009;
Weapons codes and models: $221 million;
Computational systems and software environment: $161 million;
Facility operations and user support: $147 million;
Total: $529 million.
Source: GAO analysis of data provided by NNSA.
[End of figure]
As shown in figure 4, NNSA annual obligations for its classified
supercomputing operations decreased from about $591 million to $529
million between fiscal years 2007 and 2009. The largest obligation for
the classified supercomputing program was for weapons codes and
models, which accounted for approximately $750 million (or 45 percent)
of total obligations.
Obligations for computational systems and software environment
accounted for approximately $527 million (or 32 percent) of total
obligations. For the period, obligations for this effort decreased
from $183 million to $161 million.
The facility operations and user support activities, which includes,
among other things, expenditures for contingency and disaster recovery
planning, accounted for $390 million (or 23 percent) of total
obligations over the period. These obligations ranged from $114
million to $147 million for the 3 fiscal years.
[End of section]
Appendix III: Comments from the National Nuclear Security
Administration:
Department of Energy:
National Nuclear Security Administration:
Washington, DC 20585:
November 23, 2010:
Mr. Gregory C. Wilshusen:
Director:
Information Security Issues:
U.S. Government Accountability Office:
441 G Street, NW:
Washington, DC 20548:
Dear Mr. Wilshusen:
The National Nuclear Security Administration (NNSA) appreciates the
opportunity to review the Government Accountability Office's (GAO)
draft report, GAO-11-67, Information Security: National Nuclear
Security Administration Needs to Improve Contingency Planning for Its
Classified Supercomputing Operations. I understand the House Committee
on Energy and Commerce requested GAO to assess various aspects of
NNSA's Continuity of Operations Program to ensure that, in case of
service disruptions, the three weapons laboratories can maintain the
computer simulation capabilities needed to meet nuclear weapons
assessment and certification requirements. Specifically, GAO
identified, (1) whether NNSA has Continuity of Operations planning and
testing procedures in place across the classified supercomputing
environment of the three weapons laboratories; (2) whether the weapons
laboratories are able to share capacity for backup and recovery
operations; and (3) the past, present and future resources needed to
maintain supercomputing capabilities.
We are pleased that GAO recognizes the importance of the simulation
capabilities of NNSA's supercomputers to address stockpile stewardship
and other national security matters. While the draft report implies,
without explicitly stating, that the timeframe for reconstitution of
supercomputing assets should be similar to that required for
Continuity of Operations for national command and control, major
financial systems, and health and emergency services, that time
urgency is not consistent with existing policies for the recovery of
research and development capabilities, nor should it be. Nevertheless,
we agree that improvements can be made in contingency and disaster
recovery planning for supercomputing operations.
After careful review, additional time is required to establish further
plans and schedule solutions for addressing the GAO's recommendations.
NNSA will provide a more detailed response to the recommendations when
the final report is issued. However, we are providing a summary of
responses to the recommendations presented in the draft report.
Recommendation 1: Develop business impact analyses that, among other
things, (I) identify and prioritize critical systems, data, and
supporting resources, (2) identiI5, allowable outage times and impacts
for classified supercomputing capabilities, and (3) identify recovery
priorities and strategies.
Concur: NNSA will leverage current Business Impact Analysis (BIA)
activities underway at Lawrence Livermore National Laboratory, Los
Alamos National Laboratory, and Sandia National Laboratories and
perform a national level BIA to provide a consistent assessment across
the laboratories for classified supercomputing.
Recommendation 2: Develop and implement comprehensive contingency and
disaster recovery plans for all classified supercomputing systems that
identify how each weapons laboratory's classified supercomputing
capabilities will be recovered following service disruptions.
Concur: NNSA will develop appropriate plans based on the assessment
results of the BIA performed per the GAO's first recommendation.
Recommendation 3: Conduct contingency plan testing.
Concur: NNSA will conduct contingency plan testing according to
contingency and disaster recovery plans to be implemented based on the
assessment results of the BIA performed per the GAO's first
recommendation.
Recommendation 4: Classified supercomputing capacity to ensure this
capability will work in the event of unexpected service disruptions.
Concur: NNSA will test the three weapons laboratories' ability to
share classified capacity supercomputers according to contingency and
disaster recovery plans to be implemented based on the assessment
results of the BIA.
Recommendation 5: Document an agency-wide means for reprioritizing the
workload across NNSA's classified supercomputing systems should a
disruption occur.
Concur: The NNSA will adapt, apply and exercise procedures that are
routinely being used for prioritizing workload in capability computing
campaigns for use in contingencies and disasters.
Recommendation 6: Clearly define the oversight responsibilities of the
NNSA ASC program office and the NNSA Office of the Chief Information
Officer, as they relate to contingency and disaster recovery planning
for NNSA's classified supercomputing operations.
Concur: In general, the NNSA Office of the Chief Information Officer
provides policy and guidance and the Office of Advanced Simulation and
Computing (ASC) has responsibilities for execution. Oversight
responsibilities will be clearly defined through the BIA and
development and implementation of the contingency and disaster
recovery plans.
Recommendation 7: Identify, assess, and communicate the minimum
classified supercomputing capacity needed to meet Stockpile
Stewardship requirements in the event of a service disruption.
Nonconcur: This recommendation does not take into account that
capacity and capability systems serve different functions under
different cost of ownership models, and consequently are procured and
managed differently. The ASC program has deployed its supercomputing
assets to mitigate single site failures. For the future, the program
will enhance capacity sizing actions to account for contingency and
disaster recovery operations when planning host sites for capacity
computing capabilities. We will conduct a BIA assessment recognizing
the differences and build appropriate contingency strategies for both
classes of computing.
Recommendation 8: Develop, document, and implement a process that
identifies and tracks expenditures for contingency and disaster
recovery planning for NNSA 's classified supercomputing assets.
Nonconcur: Almost all classified supercomputing contingency and
disaster recovery planning leverages computing resources and
activities funded as part of a production simulation environment for
weapons designers and engineers. These expenses are integral to ASC's
Facility Operations and User Support Program element and tracking them
separately would not add significant value to managing contingency and
disaster recovery.
Recommendation 9: Develop and document the total anticipated costs for
contingency and disaster recovery planning of NNSA 's classified
supercomputing assets, which includes the replacement costs for these
assets.
Nonconcur: As stated in the response to Recommendation 8, these
expenses are integral to ASC's Facility Operations and User Support
program element and tracking them separately would not add significant
value to managing contingency and disaster recovery.
If you have any questions related to this response, please contact
JoAnne Parker, Director, Office of Internal Controls, at 202-586-1913.
Sincerely,
Signed by:
Gerald L. Talbot, Jr.
Associate Administrator for Management and Administration:
cc: Acting Chief Information Officer:
Deputy Administrator for Defense Programs:
[End of section]
Appendix IV: GAO Contacts and Staff Acknowledgments:
GAO Contacts:
Gene Aloise (202) 512-3841 or aloisee@gao.gov Nabajyoti Barkakati
(202) 512-6415 or barkakatin@gao.gov Gregory C. Wilshusen (202) 512-
6244 or wilshuseng@gao.gov:
Staff Acknowledgments:
In addition to the individuals named above, Glen Levis, Edward M.
Glagola, Jr., and Jeffrey Knott (Assistant Directors); and Preston S.
Heard, Jennifer R. Franks, Kevin Metcalfe, and Zsaroq Powe were key
contributors to this report. Neil Doherty, Nancy Glover, Franklin
Jackson, and Jonathan Kucskar also made key contributions to this
report.
[End of section]
Footnotes:
[1] NNSA was established in 2000 as a separately organized agency
within the Department of Energy (DOE) and is responsible for the
nation's nuclear weapons, nonproliferation, and naval reactors
programs.
[2] For nearly half a century, the United States' nuclear program was
spearheaded by underground nuclear testing and never had to rely on
weapon systems that had exceeded their design life times. The United
States last produced a nuclear weapon in 1991 and performed its last
underground nuclear test in 1992.
[3] The National Defense Authorization Act for Fiscal Year 1994, Pub.
L. No. 103-160, § 3138 (1993), directed DOE to establish the Stockpile
Stewardship Program. In the absence of underground nuclear testing,
the program encompasses a broad range of activities to increase
understanding of the basic phenomena associated with nuclear weapons,
provide better predictive understanding of the safety and reliability
of weapons, and ensure a strong scientific and technical basis for
future nuclear weapons policy objectives. The Stockpile Stewardship
Program is carried out through the nuclear weapons complex, which
includes three nuclear weapons laboratories.
[4] Continuity of operations focuses on restoring an organization's
mission-essential functions at an alternate site and performing those
functions for a short period of time before returning to normal
operations. Contingency and disaster recovery planning include a broad
scope of activities designed to sustain and recover critical
information and information system services for a range of potential
service disruptions. Contingency and disaster recovery planning
components may include the relocation of information systems and
operations to an alternate site, recovery of information system
functions using alternate equipment, or the performance of information
system functions using alternative methods. For the purposes of this
report, the term contingency and disaster recovery planning refer to
the interim measures NNSA should use to recover information system
services after an unexpected service disruption.
[5] Los Alamos is managed and operated by Los Alamos National
Security, LLC, which is a consortium of contractors that includes
Bechtel National, the University of California, the Babcock and Wilcox
Company, and the Washington Division of URS. Livermore is managed and
operated by Lawrence Livermore National Security, LLC, which is
comprised of a corporate management team that includes Bechtel
National, the University of California, the Babcock and Wilcox
Company, and the Washington Division of URS. Sandia is managed and
operated by Sandia Corporation, a wholly owned subsidiary of Lockheed
Martin Corporation.
[6] FLOPS are a measure of a supercomputing system's performance.
Floating-point performance is the rate at which a computer executes
floating-point operations.
[7] For additional information regarding budgetary information for the
classified supercomputing program from fiscal years 2007 through 2009,
see appendix II.
[8] 44 U.S.C. § 3544(b); FISMA was enacted as title III, E-Government
Act of 2002, Pub. L. No. 107-347, 116 Stat. 2899, 2946 (Dec. 17, 2002).
[9] For the purposes of this report, we will refer to "continuity of
operations procedures for information systems" as contingency and
disaster recovery planning.
[10] NIST Special Publication 800-34, Contingency Planning Guide for
Information Technology Systems (Washington, D.C.: June 2002) and NIST
Special Publication 800-53 Revision 3, Recommended Security Controls
for Federal Information Systems and Organizations (Gaithersburg, Md.:
August 2009).
[11] Formerly known as the National Security Telecommunications and
Information Systems Security Committee, CNSS provides a forum for the
discussion of policy issues, sets national policy, and provides
direction, operational procedures, and guidance for the security of
national security systems. DOD chairs the committee under the
authorities established by National Security Directive 42, National
Policy for the Security of National Security Telecommunications and
Information Systems, issued in July 1990. This directive designates
the Secretary of Defense and the Director of the National Security
Agency as the Executive Agent and National Manager, respectively. The
committee has 21 voting representatives from various departments and
agencies, including the Department of Energy.
[12] National security systems include any information system used or
operated by an agency, or by a contractor of an agency, that
processes, stores, or transmits national security information. They do
not include those systems used for routine administrative and business
applications.
[13] CNSS Instruction 1253 provides federal government departments,
agencies, bureaus, and offices with a process for security
categorization of national security systems that collect, generate,
process, store, display, transmit, or receive national security
information. In addition, this instruction serves as a companion
document to NIST Special Publication 800-53, Revision 3.
[14] Although NIST guidelines note they shall not apply to national
security systems without the express approval of appropriate federal
officials exercising policy authority over such systems, CNSS
instructions, as well as DOE and NNSA policies for national security
systems, refer to the NIST guidelines as being applicable.
[15] A contingency plan is designed to maintain or restore business
operations, including computer operations, possibly at an alternate
location in the event of emergencies, system failures, or disaster. A
disaster recovery plan is a written plan for processing critical
applications in the event of a major hardware or software failure or
destruction of facilities.
[16] Outage impacts and allowable outage times enable the organization
to develop and prioritize recovery strategies that personnel will
implement during contingency plan activation. The effects of the
outage may be tracked over time, which will enable the organization to
identify the maximum allowable time that a resource may be unavailable
before it inhibits the performance of an essential function. The
effects of the outage can also be tracked across related resources,
identifying any cascading effects that may occur as an effect of a
service disruption.
[17] The Department of Energy defines "mission critical" as an
information system that supports an organization's core missions and
goals, and "mission-essential (or business essential)" as an
information system whose failure would not preclude organizations from
accomplishing core business functions in the long term.
[18] The Capability Computing Campaign includes a committee made up of
staff from the NNSA ASC program office, as well as ASC executives
located at the laboratories at Los Alamos, Livermore, and Sandia.
[19] Total usable supercomputing capacity includes the supercomputers
that have the ability to run all weapons program codes and could be
used in the event of a service disruption, and includes capacity and
capability systems.
[20] The term "on demand" is defined as the ability to move an
application (simulation program/code) from one supercomputer to a
different supercomputer at a different physical facility and use the
existing computational resources without the need for major
modifications.
[End of section]
GAO's Mission:
The Government Accountability Office, the audit, evaluation and
investigative arm of Congress, exists to support Congress in meeting
its constitutional responsibilities and to help improve the performance
and accountability of the federal government for the American people.
GAO examines the use of public funds; evaluates federal programs and
policies; and provides analyses, recommendations, and other assistance
to help Congress make informed oversight, policy, and funding
decisions. GAO's commitment to good government is reflected in its core
values of accountability, integrity, and reliability.
Obtaining Copies of GAO Reports and Testimony:
The fastest and easiest way to obtain copies of GAO documents at no
cost is through GAO's Web site [hyperlink, http://www.gao.gov]. Each
weekday, GAO posts newly released reports, testimony, and
correspondence on its Web site. To have GAO e-mail you a list of newly
posted products every afternoon, go to [hyperlink, http://www.gao.gov]
and select "E-mail Updates."
Order by Phone:
The price of each GAO publication reflects GAO‘s actual cost of
production and distribution and depends on the number of pages in the
publication and whether the publication is printed in color or black and
white. Pricing and ordering information is posted on GAO‘s Web site,
[hyperlink, http://www.gao.gov/ordering.htm].
Place orders by calling (202) 512-6000, toll free (866) 801-7077, or
TDD (202) 512-2537.
Orders may be paid for using American Express, Discover Card,
MasterCard, Visa, check, or money order. Call for additional
information.
To Report Fraud, Waste, and Abuse in Federal Programs:
Contact:
Web site: [hyperlink, http://www.gao.gov/fraudnet/fraudnet.htm]:
E-mail: fraudnet@gao.gov:
Automated answering system: (800) 424-5454 or (202) 512-7470:
Congressional Relations:
Ralph Dawn, Managing Director, dawnr@gao.gov:
(202) 512-4400:
U.S. Government Accountability Office:
441 G Street NW, Room 7125:
Washington, D.C. 20548:
Public Affairs:
Chuck Young, Managing Director, youngc1@gao.gov:
(202) 512-4800:
U.S. Government Accountability Office:
441 G Street NW, Room 7149:
Washington, D.C. 20548: