Research Highlights
Artificial Intelligence and Machine Learning

Technical Perspective: Unsafe Code Still a Hurdle Copilot Must Clear

"Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions," by Hammond Pearce et al., systematically analyzes the conditions under which the technical preview of Copilot may recommend insecure code.

Posted
cockpit view of P51 Mustang turning
Read the related Research Paper

In recent years, enormous progress has been made in the field of large language models (LLMs). Based on neural network architectures, specifically transformer models, they have proven highly effective in natural language processing (NLP). The models are designed to understand, generate, and work with human language. Trained on large datasets consisting of text from the Internet, books, articles, and many other data sources, the model learns to predict the next word in a sentence based on previous words.

LLMs are not only able to generate human language but can also generate source code to support humans in the implementation of software systems. A well-known example of such an LLM for code is GitHub Copilot, a machine learning (ML)-powered code completion tool developed by GitHub in collaboration with OpenAI. Copilot is designed to assist software developers by suggesting lines or even blocks of code as they type, effectively acting as a pair programmer. Copilot, specifically, is trained on code from all public repositories hosted on GitHub. Unfortunately, this code has not been checked for security best practices—the typical “garbage in, garbage out” principle we know from other areas applies here. As a result, the model may inadvertently learn and replicate insecure code patterns (and even critical bugs) from the training data. But can we somehow quantify how often, and under which conditions, an LLM generates unsafe code?

In the accompanying paper, the authors explore this question in detail. They systematically analyzed the conditions under which the technical preview of Copilot may recommend insecure code, focusing on the high-risk vulnerabilities listed in MITRE’s “Top 25” Common Weakness Enumeration (CWE). As part of the study, a total of 89 different scenarios were created that represent typical programming tasks (for example, trim the whitespace from a string buffer or check whether an index is within the bounds of an array before reading from it), resulting in around 1,700 programs. Each of them was examined for potential security vulnerabilities using a combination of automated analysis with GitHub’s CodeQL tool and manual analysis. Overall, around 40% of the code snippets generated by Copilot were found to be vulnerable. In addition, outdated and obsolete coding practices persisted in the training dataset, which was reflected in the code proposed by Copilot. These are disappointing findings: Copilot can help with rapid coding, but users must take a closer look at the code suggested by the LLM, especially when it comes to security.

The authors also examined the behavior of Copilot in different programming languages and paradigms. Copilot tended to perform better in Python than in C, especially in input validation. The experiments show that Copilot has difficulty creating meaningful and syntactically correct Verilog programs. Verilog is a hardware description language (HDL) used for modeling electronic systems. Copilot probably had problems with this task because there are fewer repositories on GitHub with Verilog code, so less training data is available. The authors also briefly investigated how developers could formulate prompts (“prompt engineering”) to get the tool to suggest more secure code, which is still an open challenge and not well understood in practice.

This work is significant because, according to GitHub, Copilot has been activated by more than one million developers and adopted by over 20,000 organizations. This LLM has become an indispensable tool for developers worldwide, and we need to be aware of the potential harm if the generated code is used unchanged. The study presents a method that can be used to systematically analyze a particular LLM for potential problems. The same technique can be used to analyze other LLMs for code (for example, Code Llama, StarCoder, CodeGen, and others), as they, too, exhibit undesirable behavior in some edge cases due to inherent properties of the model itself and the massive amount of unsanitized training data. To avoid potential security risks in the generated code, established secure coding practices such as automated testing (for example, fuzzing or static analysis) and secure coding standards should be followed.

One caveat applies: the programs examined are synthetic test cases developed by the authors. In practice, the amount of unsafe code generated by an LLM might differ. Nevertheless, this work emphasizes the importance of manual oversight and conventional security practices that must be adopted when using powerful tools such as GitHub Copilot or other LLMs for code. In a future where developers rely more on language models as pair programmers, we must ensure the generated code is secure and maintainable. Many open challenges still must be solved, but an ML model could become a reliable and trustworthy pair programmer in the future!

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More