Research and Advances
Security and Privacy Review articles

Security in High-Performance Computing Environments

Exploring the many distinctive elements that make securing HPC systems much different than securing traditional systems.
Posted
  1. Introduction
  2. Key Insights
  3. High-Performance Computing Environments
  4. Security Mechanisms and Solutions that Overcome the Constraints of HPC Environments
  5. Leveraging the Distinctiveness of HPC as an Opportunity
  6. Looking to the Future
  7. Summary
  8. Acknowledgments
  9. References
  10. Author
  11. Footnotes
Security in High- Performance Computing Environments, illustrative photo

How is computer security different in a high-performance computing (HPC) context from a typical IT context? On the surface, a tongue-in-cheek answer might be, “just the same, only faster.” After all, HPC facilities are connected to networks the same way any other computer is, often run the same, typically Linux-based operating systems as are many other common computers, and have long been subject to many of the same styles of attacks, be they compromised credentials, system misconfiguration, or software flaws. Such attacks have ranged from the “wily hacker” who broke into U.S. Department of Energy (DOE) and U.S. Department of Defense (DOD) computing systems in the mid-1980s,42 to the “Stakkato” attacks against NCAR, DOE, and NSF-funded supercomputing centers in the mid-2000s,24,39 to the thousands of probes, scans, brute-force login attempts, and buffer overflow vulnerabilities that continue to plague high-performance computing facilities today.

Back to Top

Key Insights

  • High-performance computing systems have some similarities and some differences with traditional IT computing systems, which present both challenges and opportunities.
  • One challenge is that HPC systems are “high-performance” by definition, and so many traditional security techniques are not effective because they cannot keep up with the system or reduce performance.
  • Many opportunities also exist: HPC systems tend to be used for very distinctive purposes, have much more regular and predictable activity, and contain highly custom hardware/software stacks. Each of these elements can provide a toehold for leveraging
  • some aspect of the HPC platform to improve security.

On the other hand, some HPC systems run highly exotic hardware and software stacks. In addition, HPC systems have very different purposes and modes of use than most general-purpose computing systems, of either the desktop or server variety. This fact means that aside from all of the normal reasons that any network-connected computer might be attacked, HPC computers have their own distinct systems, resources, and assets that an attacker might target, as well as their own distinctive attributes that make securing such systems somewhat distinct from securing other types of computing systems.

The fact that computer security is context- and mission-dependent should not be surprising to security professionals—”security policy is a statement of what is, and what is not, allowed,”7—and each organization, will therefore have a somewhat distinctive security policy. For example, a mechanism designed to enforce a particular policy considered essential for security by one site might be considered a denial of service to legitimate users of another site, or how a smartphone is protected is distinct from a desktop computer. Thus, for HPC systems, we must ask what is the desired functioning of the system so that we can establish what the security policies are and better understand the mechanisms with which those policies can be enforced.

On the other hand, historically, security for HPC systems has not necessarily been treated as distinct from general-purpose computing, except, typically, making sure that security does not get in the way of performance or usability. While laudable, this article argues that this assessment of HPC’s distinctiveness is incomplete.

This article focuses on four key themes surrounding this issue:

The first theme is that HPC systems are optimized for high performance by definition. Further, they tend to be used for very distinctive purposes, notably mathematical computations.

The second theme is that HPC systems tend to have very distinctive modes of operation. For example, compute nodes in an HPC system may be accessed exclusively through some kind of scheduling system on a login node in which it is typical for a single program or common set of programs to run in sequence. And, even on that login node, from which the computation is submitted to the scheduler, it may be the case that an extremely narrow range of programs exist compared to those commonly found on general-use computing systems.

The third theme is that while some HPC systems use standard operating systems, some use highly exotic stacks. And even the ones that use standard operating systems, very often have custom aspects to their software stacks, particularly at the I/O and network driver levels, and also at the application layer. And, of course, while the systems may use commodity CPUs, the CPUs and other hardware system components are often integrated in HPC systems in a way (for example, by Cray or IBM) that may well exist nowhere else in the world.

The fourth theme, which follows from the first three themes, is that HPC systems tend to have a much more regular and predictable mode of operation, which changes the way security can be enforced.

As a final aside, many, but by no means all HPC systems are often extremely open systems from a security standpoint, and may be used by scientists worldwide whose identities have never been validated. Increasingly, we are also starting to see HPC systems in which computation and visualization are more tightly coupled and, a human manipulates the inputs to the computation itself in near-real time.

This distinctiveness presents both opportunities and challenges. This article discusses the basis for these themes and the conclusions for security for these systems.

Scope and threat model. I have spent most of my career in or near “open science:” National Science Foundation and Department of Energy Office of Science-funded high-performance computing centers, and so the lens through which this article is discussed tends to focus on such environments. The challenges in “closed” environments, such as those used by the National Security Agency (NSA), Department of Defense (DoD), or National Nuclear Security Administration (NNSA) National Labs, or commercial industry, shares some, but not all of the attributes discussed in this article. As a result, although I discuss confidentiality, a typical component of the “C-I-A” triad, because even in open science, data leakage is certainly an issue and a threat, this article focuses more on integrity related threats,31,32 including alteration of code or data, or misuse of computing cycles, and availability related threats, including disruption or denial of service against HPC systems or networks that connect them.

Computations that are incorrect for non-malicious reasons, including flaws in application code, such as general logic errors, round-off errors, non-determinism in parallel algorithms, unit conversion errors,20 as well as incorrect assumptions by users about the hardware they are running on, are vital issues, but beyond the scope of this article, due to length and the fact those issues are well-covered elsewhere.4,5,6,8,36

Back to Top

High-Performance Computing Environments

Distinctive purposes. The first theme of the distinctiveness of security for HPC systems is that these systems are high-performance by definition, and are made that way for a reason. They are typically used for automated computation of some kind, typically performing some set of mathematical operations. Historically, this has often been for the purpose of modeling and simulation, and increasingly today, for data analysis as well. Given the primary purpose of HPC systems is therefore high-performance, and given that such systems themselves are both few in number, and therefore also that computing time on such systems is quite valuable, there is a reluctance by the major stakeholders—the funding agencies that support HPC systems as well as the users who run computations on them—to agree to any solution that might impose overhead on the system. Those stakeholders might well regard such a solution as a waste of cycles at worst, and an unacceptable delay of scientific results at best. This is an important detail, because it frames the types of security solutions that at least historically might have been considered acceptable to use.

Distinctive modes of operation. The second theme of the distinctiveness of security for HPC systems is that these systems tend to have distinctive modes of operation. The typical mode of operation for using a scientific high-performance machine involves connecting through a login node of some kind. In parallel, at least for data analysis tasks, data that a user wishes to analyze may be copied to the machine via a data transfer node or DTN, and software that a user wishes to install may be copied to the login node as well.

The user is then likely to edit some configuration files, compile their software, and write a “batch script” that defines what programs should be run, along with parameters of how those programs should be run. This is because most significant jobs are not run on the login nodes themselves, because the login nodes have very limited resources. Rather, many institutions use compute nodes, which cannot be logged into directly, but rather have a batch scheduler that determines when jobs should run based on analyzing the batch scripts that have been submitted according to a given optimization policy for the site in question. Thus, after writing their batch script, the user will probably submit their job to a batch queue using a submission program, and then log out and wait for the job to run on the compute nodes.

Following that, the user may run some kind of additional analysis or visualization on the data that was output. This may happen on the HPC system, or the output of the HPC computation may be downloaded to a non-HPC system for analysis in a separate environment such as using Jupyter/IPython.33 This additional analysis or visualization might happen serially, following the completed execution on the HPC system, or, alternatively, may happen in an interactive, tightly-coupled fashion such that the user visualizing the output of the computation can manipulate the computation as it is taking place.37,45 It should be noted that the “coupled” computation/analysis model could involve network connections external to the HPC facility, or, and particularly as envisioned by the “superfacility” model for data-intensive science,50 may involve highly specialized and optimized network connections within a single HPC center. Examples of all three workflows are shown in Figure 1.

f1.jpg
Figure 1. Three typical high-level workflow diagrams of scientific computing. The diagram at top shows a typical workflow for data analysis in HPC; the middle diagram shows a typical workflow for modeling and simulation; and the bottom diagram shows a coupled, interactive compute-visualization workflow.

These use cases are often in stark contrast to the plethora of software that is typically run on a general-purpose desktop system, such as Web browsers, email clients, Microsoft Office, iTunes Music, Adobe Acrobat, personal task managers, Skype, and instant messaging. And, importantly, this is often a much smaller set of programs with a much more regular sequence of events in which the use of one program directly follows from another, as well, rather than the constant attention-span-driven context switching of the use of general-purpose computers. For example, on the NERSC HPC systems, in 2014, for over 5950 unique users that were active in 2014, just 13 applications comprised 50% of the cycles consumed, 25 applications comprised 66% of the cycles, and 50 applications comprised 80% of the cycles.2 The consequences of these distinctive workflows are important, as we will discuss.


For HPC systems, we must ask what is the desired functioning of the system so that we can establish what the security policies are and better understand the mechanisms with which those policies can be enforced.


Custom operating system stacks. The third theme of the distinctiveness of security for HPC systems is that these systems often have highly exotic stacks. Current HPC environments represent a spectrum of hardware and software components, ranging from exotic and highly custom to fairly commodity.

As an example, “Cori Phase 1,”a the newest supercomputer at NERSC, is a Cray XC based on Intel Haswell processors, leveraging Cray Aries interconnects, a Lustre file system, and nonvolatile memory express (NVMe) in the burst buffer that is user accessible. Cori runs a full SUSE Linux distribution on the login nodes and Compute Node Linux (CNL),44 a light-weight version of the Linux kernel and run-time environment based on the SuSE Linux Enterprise Server distribution.

Mira,b at the Argonne Leadership Computing Facility, is a hybrid system. The login nodes are IBM Power 7-based systems. The compute nodes are an IBM Blue Gene/Q system based on PowerPC A2 processors, IBM’s 5D torus interconnect, and a similarly elaborate memory structure. The I/O nodes also use PowerPC A2 processors and are connected using Mellanox Infiniband QDR switches. The login nodes run Red Hat Linux. The compute nodes run Compute Node Kernel (CNK),1 a Linux-like OS for compute nodes, but support neither multi-tasking or virtual memory27 (CNK has no relationship with CNL). The I/O system runs the GPFS file system client.

Aurora,c the system scheduled to be installed at ALCF in 2019, will be constructed by a partnership between Cray and Intel and will run third-generation Intel Xeon Phi processors with second-generation Intel Omni-Path photonic interconnects and a variety of ash memory and NVRAM components to accelerate I/O, including 3DXpoint and 3D NAND in multiple locations, all user accessible. Aurora will run Cray Linux10—a full Linux stack on its login nodes and I/O nodes (though the I/O nodes do not allow general user access), and mOS46 on its compute nodes. mOS supports both a lightweight kernel (LWK) and full Linux operating system to enable users to choose between avoiding unexpected operating system overhead, and the flexibility of a full Linux stack.

Summit,d the system scheduled to be installed at OLCF in 2018, will be based on both IBM POWER9 CPUs and NVIDIA Volta GPUs, with NVIDIA NV-Link on-node networks and dual-rail Mellanox interconnects.

In short, there is certainly some variation on exactly what operating systems are run—in all cases, login nodes run “full” operating systems. And in some cases, full operating systems are also used for compute nodes, while in other cases, lighter-weight but Linux API-compatible versions of operating systems are used, while in some cases entirely custom operating systems are used that are single-user only, and contain no virtual memory capabilities or multitasking.

At least for the full operating systems, it is reasonable to assume the operating systems contain similar or identical capabilities and bugs as standard desktop and server versions of Linux, are just as vulnerable to attack via various pieces of software (libraries, runtime, and application) that are running on the system.


There is a reluctance by major stakeholders—the funding agencies that support HPC systems as well as the users who run computations on them—to agree to any solution that might impose overhead on the system.


Custom hardware and software components may have both positives and negatives. On one hand, they may receive less assurance than more common stacks. On the other hand, some custom stacks may be smaller, more easily verified, and less complex.

Openness. Our final theme is the relative “openness” of at least some HPC systems. That is, scientists from all over the world whose identities have never been validated may use them. For example, many such systems, such as those used by NSF or DOE ASCR, have no traditional firewalls between the data transfer nodes and the Internet, let alone the ability to “air gap” the HPC system (that is, ensure no physical connection to the regular Internet is possible) as some communities are able to do.

Back to Top

Security Mechanisms and Solutions that Overcome the Constraints of HPC Environments

Traditional IT security solutions, including network and host-based intrusion detection, access controls, and software verification work about as well in HPC as traditional IT (often not very), or worse, due to constraints in HPC environments.

For example, traditional host-based security mechanisms, such as those leveraging system call data via audited, as well as certain types of network security mechanisms, like network firewalls and firewalls doing deep packet inspection, may be antithetical to the needs of the system being protected. For example, it has been shown that even 0.0046% packet loss (1 out of 22,000 packets) can cause a loss in throughput of network data transfers of approximately 90%.13 Given that stateful and/or deep-packet inspecting firewalls can cause delays that might lead to such loss, a firewall, as traditionally defined, is inappropriate for use in environments with high network data throughput requirements.

Thus, alternative approaches must be applied. Some solutions exist that can help compensate for these constraints.

The Science DMZ13 security framework defines a set of security policies, procedures, and mechanisms to address the distinct needs of scientific environments with high network throughput needs (HPC security theme #1). While the needs of high throughput networks do not eliminate options for security monitoring or mitigation, those requirements do change what is possible.

In particular, in the Science DMZ framework, the scientific computing systems are moved to their own enclave, away from other types of computing systems that might have their own distinctive security needs and perhaps even distinct regulations—for example, financial, human resources, and other business computing systems. In addition, it directs transfers through single network ingress and egress point that can be monitored and restricted.

However, the Science DMZ does not use “deep packet inspecting” or stateful firewalls. It does leverage packet filtering firewalls that is, firewalls that examine only attributes of packet headers and not packet payloads. And, separately, it also performs deep packet inspection and stateful intrusion detection, such as might be done with the Bro Network Security Monitor.28 However, the two processes are not directly coupled, as, unlike a firewall, the IDS is not used in-line with the network traffic, and as a result, delays are not imposed on transmission of the traffic due to inspection, and thus congestion that might lead to packet loss and retransmission is also not created.

Thus, by moving the traffic to its own enclave that can be centrally monitored at a single point, the framework seeks to maintain a similar level of security to traditional organizations that typically have a single ingress/egress point, rather than simply removing network monitoring without replacing it with an alternative. However, the Science DMZ does so in a very specific way that accommodates the type and volume of network traffic used in scientific and high-performance computing environments. More specifically, it achieves throughput by reducing complexity, which is a theme that we will return to in this article.

The Science DMZ framework has been implemented widely in university and National Lab environments around the world as a result of funding from NSF, DOE ASCR, and other, international funding organizations, to support computing and networking infrastructure for open science. It goes without saying that both the Science DMZ framework and the Bro IDS must also continue to be adapted to more types of HPC environments, such as those requiring environments with greater data confidentiality guarantees, such as medical, defense, and intelligence environments. Steps have been made toward the medical context as well.

The Medical Science DMZ29 applies the Science DMZ framework to computing environments requiring compliance with HIPAA Security Rule. Key architectural aspects include the notion that all traffic from outside compute/ storage infrastructure passes through heavily monitored head nodes, that storage and compute nodes themselves are not connected directly to the Internet, and that traffic containing sensitive or controlled access data is encrypted. However, further work in medical environments, as well as other environments is required.

Back to Top

Leveraging the Distinctiveness of HPC as an Opportunity

The Science DMZ helps compensate for HPC’s limitations—we need more such solutions. As indicated by the four themes enumerated in this article, we also need solutions that can leverage HPC distinctiveness as a strength.

Sommer and Paxson41 point out the fact that anomaly-based detection typically is not used in traditional IT environments is due to the high-level fact that “finding attacks is fundamentally different from . . . other applications” (such as credit card fraud detection, for example). Among other key issues, they note that network traffic is often much more diverse than one might expect. They point out that semantic understanding is a vital component of overcoming this limitation to enable machine-learning approaches to security to be more effective.

On the other hand, as mentioned earlier, HPC systems tend to be used for very distinctive purposes, notably mathematical computations (theme #1). The specific application of HPC systems varies by the organization that uses them (for example, DOE National Lab, DOD lab), but each individual system typically has a very specific use. This is a key point because the result may be that both specification-based and anomaly-based intrusion detection may be more useful in HPC environments than in traditional IT environments. Specifically, given the hypothesis that patterns of behavior in HPC are likely more regular than in typical computing systems, one might expect that one can reduce the error rates when using anomaly-based intrusion detection, and possibly even making specifications possible to construct for specification-based intrusion detection. Thus, such security mechanisms might even fare better in HPC environments than in traditional IT environments (theme #4), though demonstrating the degree to which the increased regularity of HPC environments may be helpful for security analysis is an open research question.

Analyzing system behavior with machine learning. A second, and related key point about HPC systems being used primarily for mathematical computation is that if we can do better analysis of system behavior, the insight that most HPC machines are used for computation focuses our attention on what security risks to care about (for example, users running “illicit computations,” as defined by the owners of the HPC system) and might give us better ability to understand what type of computation is taking place.

An example of a successful approach to addressing this question involved research that I was involved with at Berkeley Lab between 2009–2013.14,30,47,48 In this project, we asked the questions: What are people running on HPC systems? Are they running what they usually run? Are they running what they requested cycle allocations to run, or mining Bitcoins?

Are they running something illegal (for example, classified)? In that work, we developed technique for answering these questions by fingerprinting communication on HPC systems.

Specifically, we collected Message Passing Interface (MPI) function calls via the Integrated Performance Monitoring (IPM)43 tool, which showed patterns of communication between ores in an HPC system, as shown in Figure 2.

f2.jpg
Figure 2. “Adjacency matrices” for individual runs of a performance benchmark, an atmospheric dynamics simulator, and a linear equation solver SUPERLU. Number of bytes sent between ranks is linearly mapped from dark blue (lowest) to red (highest), with white indicating an absence of communication.47,48

Using 1681 logs for 29 scientific applications from NERSC HPC systems, we applied Bayesian-based machine learning techniques for classification of scientific computations, as well as a graphtheoretic approach using “approximate” graph matching techniques (subgraph isomorphism and edit distance). A hybrid machine learning and graph theory approach identified test HPC codes with 95%–99% accuracy.

Our work analyzing distributed memory parallel computation patterns on HPC compute nodes is by no means conclusive that anomaly detection is an unqualified success on HPC systems for intrusion detection. For one thing, the experiments were not conducted in an adversarial environment, and so the difficultly of an attacker intentionally evading detection by attempting to make one program look like another was not explored. In addition, in our “fingerprinting HPC computation” project, we had what we deemed to be a reasonable, though not exhaustive corpus of data representative of typical computations on NERSC facilities to examine. In addition, in examining the data, we focused on a specific set of activity contained within the NERSC Acceptable Use.

Policy as falling outside of “acceptable use.” Other sites will have a different baseline of “typical computation,” and are also likely have somewhat different policies that define what is or is not “illicit use.”

However, regardless, we do believe the approach is an example of the type of techniques that could possibly have success in HPC environments and possibly even greater success than in many non-HPC environments. For example, consider the possibility of a skilled attacker attempting to evade detection something that any security mechanism relying on machine learning is vulnerable to. Not only do there appear to be more regular use patterns in HPC environments, but there also exist certain distinctive security policies in HPC environments that might help improve the usefulness of application-level use monitoring. There are at least two reasons for this.

First, given the organization responsible for security of HPC systems are likely to care more about misuse of cycles if very large numbers of cycles are used, this suggests focusing on the users that use cycles for many hours per day for days at a time. This is a very different practical scenario than network security monitoring where a decision about security might require a response in a fraction of a second in order to prevent compromise. Given the longer time scale, therefore, a human security analyst can be involved rather than requiring the application monitoring, on the level that we have done it, to be conclusive. Rather, that application monitoring might simply serve to focus an analyst’s attention, and to lead to a manual source code analysis, or even an actual conversation with the user whose account was used to run the code.

A second reason why this issue of an attacker evading detection on HPC might be harder is because, users are often given “cycle allocations” to run code. As a result, the more a program running on an HPC system is modified to mask illicit use, the more likely it is that additional cycles must be used to do additional tasks to make it look like the program is doing something different than it actually is. Thus, the faster that a stolen allocation will be used up and/or the longer it will take the HPC system to accomplish whatever illicit use the attacker is attempting.

Collecting better audit and provenance data. It is important to note the success of the work mentioned in the previous section is dependent on availability of useful security monitoring data. It is our observation that the current trend in many scientific environments on collecting provenance data for scientific reproducibility purposes, such as the Tigres workflow system,38 and the DOE Biology Knowledgebase (KBase)21 may help to provide better data that can be used for security monitoring, as might DARPA’s “Transparent Computing” program 11, which seeks to “make currently opaque computing systems transparent by providing high-fidelity visibility into component interactions during system operation across all layers of software abstraction, while imposing minimal performance overhead.”

In line with this, as noted earlier, HPC systems have a lot in common with traditional systems, but also contain a lot of highly custom OS and network-level, and application-level software. A key point here is that such exotic hardware and low-level software stacks may also provide opportunities for monitoring data going forward. An example of the performance counters used in many of today’s HPC machines is an example of this.

Post-exascale systems, as well as more architectures that are still in their early phases of practical implementation, such as neuromorphic computing, quantum computing, and photonic computing may all provide additional challenges and opportunities. For example, though neural networks were previously thought by many to be inscrutable,16 new research suggests this may be actually possible at some point.12,49 If successful, this might give to rise to the ability to interpret networks learned by neuromorphic chips.

Back to Top

Looking to the Future

In the future, it is clear that numerous aspects of HPC will change, both for the good of security and in ways that complicate it.

One key component of the National Strategic Computing Initiative is that software engineering is a key goal of the NSCI, and so perhaps automated static/runtime analysis tools might be developed and used to check HPC code for insecure behaviors.

On the other hand, science is also changing. For example, distributed, streaming sensor data collection is increasingly a source of data used in HPC. In short, science data is getting to us in new ways, and we also have more data than ever to protect.

Another change is that on HPC systems running full operating systems, we are starting to see an increasing shift toward the use of new virtualized environments for additional flexibility. In particular, as Docker containers25 and CoreOS’s Rocket9 become more popular for virtual replication and containment in many IT environments, rather than replicating full virtual operating systems, Docker-like containers that are more appropriate to HPC environments, such as Shifter19 or Singularity23 are also gaining attention and use. This notion of “containerization” may well be a key benefit to security, both because of the way that containerization done properly typically limits the damage that an attacker can do, as well as because it simplifies the operation of the machine, and the reduction of complexity is also often a key benefit to system robustness, including security.

The superfacility model in which computation and visualization are more frequently tightly coupled than they currently are, seems also likely to increase. At the same time, the notion of “science gateways” essentially Web portals, providing limited interfaces to HPC, rather than full-blown UNIX command-line interfaces, may provide a reduction of complexity that super-facility would otherwise introduce. While science gateways still represent vulnerability vectors from arbitrary code, even when it is submitted via Web front-ends, since security tends to benefit from more constrained operation, the general toward science gateways may also enhance security.


In the future, it is clear that numerous aspects of HPC will change, both for the good of security and in ways that complicate it.


Finally, the prospect of new and novel security technologies, such as simulated homomorphic encryption,34,35 differential privacy,15 and cryptographic mechanisms for securing chains of data3,18,40 such as blockchains,26 may also may provide new means for interacting with data sets in a constrained fashion.

For example, there may be cases where the owners of the data want to keep the raw data for themselves for an extended period of time, such as a scientific embargo. Or there may be cases where the owners of the data are unable to share the raw data due to privacy regulations, such as on medical data, system and network data that contains personally identifiable information, or sensor data containing sensitive (for example, location) information. In either case, the data owners may still wish to find a way to enable some limited type of computation on the data, or share data, but only with a certain degree of resolution. With CryptDB34 and Mylar,35 Popa et al. have demonstrated approaches for efficiently searching over encrypted data without requiring fully homomorphic encryption,17 which is currently at least a million times slow to be used practically, let alone in HPC environments. Likewise, differential privacy,15 and perhaps particularly distributed differential privacy22 may provide new opportunities for sharing and analyzing data to be used in HPC environments as well. And in addition, block-chains and similar technologies may provide means for both monitoring the integrity of raw scientific data in HPC contexts, as well as for maintaining secure audit trails of accesses to or modifications of raw data.

Back to Top

Summary

Modern HPC systems do some things very similar to ordinary IT computing, but they also have some significant differences. This article presented both challenges and opportunities.

Two key security challenges are the notions that traditional security solutions often are not effective given the paramount priority of high-performance in HPC. In addition, the need to make some HPC environments as open as possible to enable broad scientific collaboration and interactive HPC also presents a challenge.

There may also be opportunities, as described by the four themes regarding HPC security presented here. The fact that HPC systems tend to be used for very distinctive purposes, notably mathematical computations, may mean the regularity of activity within HPC systems can benefit the effectiveness of machine learning analyses on security monitoring data to detect misuse of cycles and threats to computational integrity. In addition, custom stacks provide opportunities for enhanced security monitoring, and the general trend toward containerized operation, limited interfaces, and reduced complexity in HPC is likely to help in the future much as reduced complexity has benefitted the Science DMZ model.

Back to Top

Acknowledgments

Appreciation to Deb Agarwal, David Brown, Jonathan Carter, Phil Colella, Dan Gunter, Inder Monga, and Kathy Yelick for their valuable feedback and to Sean Whalen and Bogdan Copos for their excellent work underlying the ideas for new approaches described here. Thanks to Glenn Lockwood for his insights on the specifications for the DOE ASCR hardware and software coming in the next few years, and both Glenn Lockwood and Scott Campbell for the time spent providing the data that supported that research.

This work used resources of the National Energy Research Scientific Computing Center and was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author and do not necessarily reflect those of the employers or sponsors of this work.

ins02.gif Watch the author discuss his work in this exclusive Communications video. https://cacm.acm.org/videos/security-in-high-performance-computing-environments

Back to Top

Back to Top

Back to Top

    1. Adiga, N.R. et al. An overview of the Blue-Gene/L supercomputer. In Proceedings of the ACM/IEEE Conference on Supercomputing, 2002.

    2. Austin, B. et al. 2014 NERSC Workload Analysis (Nov. 5., 2015); http://portal.nersc.gov/project/mpccc/baustin/NERSC_2014_Workload_Analysis_v1.1.pdf.

    3. Anderson, R.J. UEPS: A second-generation electronic wallet. In Proceedings of the 2nd European Symposium on Research in Computer Security (Nov. 1992), 411–418.

    4. Bailey, D.H. Resolving numerical anomalies in scientific computation, 2008.

    5. Bailey, D.H., Borwein, J.M. and Stodden, V. Facilitating reproducibility in scientific computing: Principles and practice. Reproducibility: Principles, Problems, Practices. H. Atmanspacher and S. Maasen, Eds. John Wiley and Sons, New York, NY, 2015.

    6. Bailey, D.H., Demmel, J., Kahan, W., Revy, G. and Sen, K. Techniques for the automatic debugging of scientific floating-point programs. In Proceedings of the 14th GAMM-IMACS International Symposium on Scientific Computing, Computer Arithmetic and Validated Numerics (Lyon, France, Sept. 2010).

    7. Bishop, M. Computer Security: Art and Science. Addison-Wesley Professional, Boston, MA, 2003.

    8. Cappello, F. Improving the trust in results of numerical simulations and scientific data analytics. 2015.

    9. CoreOS, Inc. rkt - App Container runtime. https://github.com/coreos/rkt.

    10. Cray, Inc. Cray Linux Environment Software Release Overview, s-2425–52xx edition (Apr 2014); http://docs.cray.com/books/S-2425-52xx.

    11. DARPA. Transparent Computing; http://www.darpa.mil/Our_Work/I2O/Programs/Transparent_Computing.aspx.

    12. Das, A., Agrawal, H., Zitnick, C.L., Parikh, D. and Batra, D. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016.

    13. Dart, E., Rotman, L., Tierney, B., Hester, M. and Zurawski, J. The science DMZ: A network design pattern for data-intensive science. In Proceedings of the IEEE/ACM Annual SuperComputing Conference (Denver CO, 2013).

    14. DeMasi, O., Samak, T. and Bailey, D.H. Identifying HPC codes via performance logs and machine learning. In Proceedings of the Workshop on Changing Landscapes in HPC Security (2013).

    15. Dwork, C. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, Part II. Lecture Notes in Computer Science 4052, (July 2006), 1–12. Springer Verlag.

    16. Gefter, A. Is artificial intelligence permanently inscrutable? Nautilus 40 (Sept. 1, 2016).

    17. Gentry, C. Computing arbitrary functions of encrypted data. Commun. ACM 53, 3 (Mar. 2010), 97–105.

    18. Haber, S. and Stornetta, W.S. How to time-stamp a digital document. J. Cryptology 3, 2 (Jan. 1991), 99–111.

    19. Jacobsen, D.M. and Canon, R.S. Contain this, unleashing docker for HPC. Proceedings of the Cray User Group, 2015.

    20. Jiang, L. and Su, Z. Osprey: A practical type system for validating dimensional unit correctness of c programs. In Proceedings of the 28th International Conference on Software Engineering, (2006), 262–271 ACM, New York.

    21. KBase: The Department of Energy Systems Biology Knowledgebase; http://kbase.us.

    22. Kasiviswanathan, S.P., Lee, H.K., Nissim, K., Raskhodnikova, S. and Smith, A. What can we learn privately? SIAM J. Computing 40, 3 (2011), 793–826.

    23. Kurtzer, G.M. et al. Singularity; http://singularity.lbl.gov.

    24. Marko, J. and Bergman, L. Internet attack is called broad and long lasting. New York Times (May 10, 2005).

    25. Merkel, D. Docker: Lightweight Linux containers for consistent development and deployment. Linux J. 239 (2014).

    26. Nakamoto, S. Bitcoin: A Peer-to-Peer Electronic Cash System (May 24, 2009); http://www.bitcoin.org/bitcoin.pdf.

    27. Nataraj, A., Malony, A.D., Morris, A. and Shende, S. Early experiences with KTAU on the IBM BG/L. In European Conference on Parallel Processing, pp. 99–110. Springer, 2006.

    28. Paxson, V. Bro: A system for detecting network intruders in real time. Computer Networks 31, 23 (1999), 2435–2463.

    29. Peisert, S., et al. The Medical Science DMZ. J. American Medical Informatics Assoc. 23, 6 (Nov. 1, 2016).

    30. Peisert S. Fingerprinting Communication and Computation on HPC Machines. TR LBNL-3483E, Lawrence Berkeley National Laboratory, June 2010.

    31. Peisert, S., et al. ASCR Cybersecurity for Scientific Computing Integrity. TR LBNL-6953E, U.S. Department of Energy Office of Science, Feb. 2015.

    32. Peisert, S. et al. ASCR Cybersecurity for Scientific Computing Integrity|Research Pathways and Ideas Workshop. TR LBNL-191105, U.S. Department of Energy Office of Science, Sept. 2015.

    33. Pérez, F. and Granger, B.E. IPython: A System for interactive scientific computing. Computing in Science and Engineering 9, 3 (May 2007), 21–29.

    34. Popa, R.A., Redfield, C., Zeldovich, N. and Balakrishnan, H. Cryptdb: Processing queries on an encrypted database. Commun. ACM 55, 9 (Sept. 2012), 103–111.

    35. Popa, R.A., Stark, E., Helfer, J., Valdez, S., Zeldovich, N., Kaashoek, M.F. and Balakrishnan, H. Building Web applications on top of encrypted data using Mylar. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (2014), 157–172.

    36. Rubio-Gonzàlez, C. Precimonious: Tuning assistant for floating-point precision. In Proceedings of the International Conf. on High Performance Computing, Networking, Storage and Analysis. ACM, 2013, 27.

    37. Reubel, O. WarpIV: In situ visualization and analysis of ion accelerator simulations. IEEE Computer Graphics and Applications 36, 3 (2016), 22–35.

    38. Ramakrishnan, L., Poon, S., Hendrix, V., Gunter, D., Pastorello, G.Z. and Agarwal, D. Experiences with user-centered design for the Tigres workflow API. In Proceedings of 2014 IEEE 10th International Conference on e-Science, vol 1. IEEE, 290–297.

    39. Singer A. Tempting fate. ;login: 30, 1 (Feb. 2005), 27–30.

    40. Schneier, B. and Kelsey, J. Automatic event-stream notarization using digital signatures. In Proceedings of the 4th International Workshop on Security Protocols. Springer, 1996, 155–169.

    41. Sommer, R. and Paxson, V. Outside the closed world: On using machine learning for network intrusion detection. In Proceedings of the 31st IEEE Symposium on Security and Privacy, Oakland, CA, May 2010.

    42. Stoll, C. Stalking the wily hacker. Commun. ACM 31, 5 (May 1988), 484–497.

    43. Skinner, D., Wright, N., Fürlinger, K., Yelick, K.A. and Snavely, A. Integrated Performance Monitoring; http://ipm-hpc.sourceforge.net/.

    44. Wallace, D. Compute node Linux: New frontiers in compute node operating systems. Cray User Group, 2007.

    45. Whitlock, B., Favre, J.M. and Meredith, J.S. Parallel in situ coupling of simulation with a fully featured visualization system. In Proceedings of the 11th Eurographics Conference on Parallel Graphics and Visualization, 2011, 101–109.

    46. Wisniewski, R.W., Inglett, T., Keppel, P., Murty, R. and Riesen, R. mOS: An architecture for extreme-scale operating systems. In Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers. ACM, 2014.

    47. Whalen, S., Peisert, S. and Bishop, M. Network-theoretic classification of parallel computation patterns. In Proceedings of the First International Workshop on Characterizing Applications for Heterogeneous Exascale Systems (Tucson, AZ, June 4, 2011).

    48. Whalen, S., Peisert, S. and Bishop, M. Multiclass Classification of Distributed Memory Parallel Computations. Pattern Recognition Letters 34, 3 (Feb. 2013), 322–329.

    49. Yosinski, J., Clune, J., Fuchs, T. and Lipson, H. Understanding neural networks through deep visualization. In Proceedings of the Deep Learning Workshop, International Conference on Machine Learning, 2015.

    50. Yelick, K. A Superfacility for Data Intensive Science. Advanced Scientific Computing Research Advisory Committee, Washington, DC, Nov. 8, 2016; http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201609/Yelick_Superfacility-ASCAC_2016.pdf.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More