Abstract
There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described “AI pair programmer,” GitHub Copilot, a language model trained over open source GitHub code. However, code often contains bugs—and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis, we prompt Copilot to generate code in scenarios relevant to high-risk cybersecurity weaknesses, for example, those from MITRE’s “Top 25” Common Weakness Enumeration (CWE) list. We explore Copilot’s performance on three distinct code-generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,689 programs. Of these, we found approximately 40% to be vulnerable.
1. Introduction
With increasing pressure on software developers to produce code quickly, there is considerable interest in tools and techniques for improving productivity. The most recent entrant into this field is machine learning (ML)-based code generation, in which large models originally designed for natural language processing (NLP) are trained on vast quantities of code and attempt to provide sensible completions as programmers write code. In June 2021, GitHub released Copilot,1 an “AI pair programmer” that generates code in a variety of languages given some context, such as comments, function names, and surrounding code. Copilot is built on a large language model that is trained on open-source code5 including “public code…with insecure coding patterns”, thus giving rise to the potential for “synthesize[d] code that contains these undesirable patterns”.1
Although prior research has evaluated the functionality of code generated by language models,2,5 there is no systematic examination of the security of ML-generated code. As GitHub Copilot is the largest and most capable such model currently available, it is important to understand: Are Copilot’s suggestions commonly insecure? What is the prevalence of insecure generated code? What factors of the “context” yield generated code that is more or less secure?
We systematically experiment with Copilot to gain insights into these questions by designing scenarios for Copilot to complete and by analyzing the produced code for security weaknesses. As our corpus of well-defined weaknesses, we check Copilot completions for a subset of MITRE’s Common Weakness Enumerations (CWEs), from their “2021 CWE Top 25 Most Dangerous Software Weaknesses”21 list. The AI’s documentation recommends that one uses “Copilot together with testing practices and security tools, as well as your own judgment”. Our work attempts to characterize the tendency of Copilot to produce insecure code, giving a gauge for the amount of scrutiny a human developer might need to do for security issues.
We study Copilot’s behavior along three dimensions: (1) diversity of weakness, its propensity for generating code that is susceptible to weaknesses in the CWE “top 25”, given a scenario where such a vulnerability is possible; (2) diversity of prompt, its response to the context for a particular scenario (SQL injection), and (3) diversity of domain, its response to the programming language/paradigm.
For diversity of weakness, we construct three different scenarios for each applicable “top 25” CWE and use the CodeQL software scanning suite10 along with manual inspection to assess whether the suggestions returned are vulnerable to that CWE. Our goal here is to get a broad overview of the types of vulnerabilities Copilot is most likely to generate, and how often users might encounter such insecure suggestions. Next, we investigate the effect different prompts have on how likely Copilot is to return suggestions that are vulnerable to SQL injection. This investigation allows us to better understand what patterns programmers may wish to avoid when using Copilot, or ways to help guide it to produce more secure code.
Finally, we study the security of code generated by Copilot when it is used for a domain rarer in its training data. Copilot’s marketing materials claim that it speaks “all the languages one loves.” To test this claim, we focus on Copilot’s behavior when tasked with generating register-transfer level (RTL) code in the hardware description language Verilog, then compare the results against the MITRE hardware CWEs—recently added in 2020.23 As with the software CWEs, hardware designers can be sure that their designs meet a certain baseline level of security if their designs are free of hardware weaknesses.
Our contributions include the following. We perform automatic and manual analysis of Copilot’s software and hardware code completion behavior in response to “prompts” handcrafted to represent security-relevant scenarios and characterize the impact that patterns in the context can have on the AI’s code generation and confidence. We discuss implications for software and hardware designers, especially security novices when using AI pair programming tools. All scenarios and results are released as open source.
2. Background and Related Work
2.1 Code Generation
Developers refine specifications into code to create functional products through iterative refinement. Prior efforts in natural language programming14 include formal models for automatic code generation (for example, Dreschsler,8and Harris11) or via machine-learned NLP.18 deep learning (DL)-based NLP include LSTMs,20 RNNs,13 and Transformers24 that have led to models like BERT,7 GPT-2,17 and GPT-3.4 These models can perform tasks like translation and answering CoQA19 dataset questions; after fine-tuning on specialized datasets, the models can perform code completion5 and hardware design.16 State-of-the-art models have billions of learnable parameters and are trained on millions of software repositories.5
Copilot builds on the OpenAI Codex family of models5 that are GPT-3 models4 fine-tuned on code from GitHub. Its tokenization step is nearly identical to GPT-3: byte pair encoding converts the source text into a sequence of tokens, but the GPT-3 vocabulary was extended by adding dedicated tokens for whitespace (that is, a token for two spaces, a token for three spaces, up to 25 spaces). This allows the tokenizer to encode source code (with lots of whitespace) both more efficiently and with more context.
OpenAI published a technical report evaluating various aspects of “several early Codex models, whose descendants power GitHub Copilot”,5 with a small discussion (in Appendix G.3) of insecure code generation. However, this investigation was limited to one type of weakness (insecure crypto parameters, namely short RSA key sizes and using AES in ECB mode). The authors note that “a larger study using the most common insecure code vulnerabilities” is needed, and we supplied such an analysis here.
An important feature that Codex and Copilot inherit from GPT-3 is that, given a prompt, they generate the most likely completion for that prompt based on what was seen during training. In the context of code generation, this means that the model will not necessarily generate the best code (by whatever metric you choose—performance, security, etc.) but rather the one that best matches the code that came before. As a result, the quality of the generated code can be strongly influenced by semantically irrelevant features of the prompt (see Section 5.3).
2.2 Evaluating Code Security
Numerous elements determine the quality of code. Code generation literature emphasizes functional correctness, measured by compilation and checking against unit tests, or using text similarity metrics to desired responses.5 Unlike metrics for functional correctness of generated code, evaluating the security of code contributions made by Copilot is an open problem. Aside from manual assessment by a human security expert there are myriad tools and techniques to perform security analyses of software.15
In this work, we gauge the security of Copilot’s contributions using a mix of automated analysis using GitHub’s CodeQL tool10 alongside our manual code inspection. CodeQL is open-source and supports the analysis of software in languages such as Java, JavaScript, C++, C#, and Python. Through queries written in its QL query language, CodeQL can find issues in codebases based on a set of known vulnerabilities/rules. Developers can configure CodeQL to scan for different code issues and make it available for academic research (also, it seems fair to use one GitHub tool to test the other). Prior work used CodeQL to identify vulnerable code commits in the life of a JavaScript project.3
There are common patterns in various classes of insecure code. Such patterns can be considered weaknesses, as taxonomized by the Common Weakness Enumeration (CWE) database maintained by MITRE.22 CWEs capture weaknesses in a spectrum of complexity; some CWEs manifest as fairly “mechanical” implementation bugs that can be caught by static analysis tools (such as CodeQL). Other CWEs cannot be adequately tested for by examining only the source code in isolation, thus necessitating other approaches like fuzzing12 for security analysis. Examining if Copilot introduces weaknesses that require reasoning over such a broader context (that is, outside the single code file) is beyond the scope of this study.
3. Using Github Copilot
Copilot functions via a propriety closed-source black box plugin which supports a number of code editors. It is used as follows. The software developer (user) works on some program, editing and writing code.
As the user adds lines of code to the program, Copilot continuously scans the program, periodically uploading some subset of lines and the location of the user’s cursor. From this, it uses a Large language model to generate candidate suggestions for the user as a kind of ‘smart autocomplete’. Copilot’s ‘most confident’ suggestion is presented in-line in the text, the other suggestions are behind an additional window. An example of this process is depicted in Figure 1. Here, the user has begun to write the login code for a web app. Their cursor is located at line 15, and based on other lines of code in the program, Copilot suggests an additional line of code that can be inserted.
4. Experimental Method
4.1 Problem Definition
We focus on evaluating the potential security vulnerabilities of code generated by Copilot. As discussed in Section 2, determining if code is vulnerable sometimes requires knowledge (context) external to the code itself. Furthermore, determining that a specific vulnerability is exploitable requires framing within a corresponding attacker model.
As such, we constrain ourselves to the challenge of determining if specific code snippets generated by Copilot are vulnerable: that is, if they definitively contain code that exhibits characteristics of a CWE. We do not consider the exploitability of an identified weakness in our experimental setting as we reduce the problem space into a binary classification: Copilot generated code either contains code identified as (or known to be) weak or it does not.
4.2 Evaluating Copilot with Static Analysis
In this paper, we use Github CodeQL.10 To demonstrate CodeQL’s functionality, assume that the suggestion from Copilot in Figure 1 is chosen to build a program. Using CodeQL’s python-security-and-quality.qls
testing suite, which checks 153 security properties, it outputs feedback like that shown in Figure 2—reporting that the SQL query is written in a way that allows for insertion of malicious SQL code by the user. In the CWE nomenclature this is CWE-89 (SQL Injection).
4.3 Generalized Evaluation Process
We focus in this work on CWEs in MITRE’s taxonomy, with particular attention on the “2021 CWE Top 25” list.21 These guide the creation of a Copilot prompt dataset, which we call the ‘CWE scenarios’. We feed each prompt through Copilot to generate code completions (Section 3) and determine if the generated code contains the CWE (Section 4.2). Our overall experimental method is depicted in Fig. 3.
In step 1, for each CWE, we write a number of ‘CWE scenarios’ 2. These are small, incomplete program snippets in which Copilot will be asked to generate code. The scenarios are designed such that a naive functional response could contain a CWE, similar to that depicted in Figure 1. We restrict ourselves to three programming languages: Python, C, and Verilog. Python and C are extremely popular, supported by CodeQL, and between them, can realistically instantiate the complete list of the top 25 CWEs. We use Verilog to explore Copilot’s behavior in a less popular domain in Section 5.4 as an additional set of experiments. In developing the scenarios, we used three different sources. These were in order of preference (a) the CodeQL example/documentation repository (already ready for evaluation), (b) examples listed in the CWE entry in MITRE’s database, (c) bespoke scenarios designed by the authors for this study. Note that each scenario does not contain the weakness from the outset; it is Copilot’s completion that determines if the final program is vulnerable.
Next, in 3, Copilot is asked to generate up to 25 options for each scenario. Each option is then combined with the original program snippet to make a set programs in 4a—with some options discarded 4b if they have significant syntax issues (that is, they are not able to be compiled/parsed). That said, where simple edits (for example, adding or removing a single brace) would result in a compilable output, we make those changes automatically using a regex-based tool.
Then, in 5a evaluation of each program occurs. Where possible, this evaluation is performed by CodeQL 5b using either built-in or custom queries. For some CWEs that require additional context or could not be formed as properties examinable by CodeQL, this evaluation needed to be performed by the authors manually 5c. Importantly, CodeQL is configured in this step to only examine for the specific CWE this scenario is designed for. In addition, we do not evaluate for correctness, only for vulnerabilities. This decision is discussed further in Section 5.1.1. Finally, in 6 the results of the evaluations of each Copilot-completed program.
5. Investigation of GitHub Copilot
5.1 Study Overview
Our analysis is framed along three different axes of diversity. Diversity of Weakness (DOW) examines Copilot’s performance within the context of differing software CWEs. Diversity of Prompt (DOP) performs a deeper examination of Copilot’s performance under a single at-risk CWE scenario with prompts containing subtle variations. Diversity of Domain tasks Copilot with generating register transfer level (RTL) hardware specifications in Verilog to investigate its performance within the hardware CWE23 context.
5.1.1 Vulnerability Classification
We take a conservative view on vulnerable classification. Specifically, we mark a Copilot output as vulnerable only if it definitively contains vulnerable code. While this might sound tautological, this distinction is critical; as sometimes Copilot only provides a partial code completion. For example, Copilot may generate the string for an SQL query in a vulnerable way (for example, via string construction), but then stop the code suggestion before the string is used. It is likely that if the code were continued, it would be vulnerable to SQL Injection, but as the string is never technically passed to an SQL connection, it is not. As such, we mark these kinds of situations as non-vulnerable.
For a given scenario we check only for the specific CWE under investigation. This is important as many generated files are vulnerable in more than one category—for instance, a poorly-written login/registration function might be simultaneously vulnerable to SQL injection (CWE-89) and feature insufficiently protected credentials (CWE-522). Finally, we did not evaluate for functionally correct code generation, only vulnerable outputs.
5.2 Diversity of Weakness
5.2.1 Overview
The first axis of investigation involves checking Copilot’s performance when prompted with several different scenarios where the completion could introduce a software CWE. For each CWE, we develop three different scenarios. As described previously in Section 4.3, these scenarios may be derived from any combination of the CodeQL repository, MITRE’s own examples, or they are bespoke code created specifically for this study. As previously discussed in Section 2.1, not all CWEs could be examined using our experimental setup due to complexity or context requirements. Our results are presented in Table 1 and Table 2.
Rank reflects the ranking of the CWE in the MITRE “top 25”. CWE-Scn. is the scenario program’s identifier in the form of ‘CWE number’-‘Scenario number’. L is the language used, ‘c’ for C and ‘py’ for Python. Orig. is the original source for the scenario, either ‘codeql’, ‘mitre’, or ‘authors’. Marker specifies if the marker was CodeQL (automated analysis) or authors (manual analysis). # Vd. specifies how many ‘valid’ (syntactically compliant, compilable, and unique) program options that Copilot provides. While we requested 25 suggestions, Copilot did not always provide 25 distinct suggestions. # Vln. specifies how many ‘valid’ options were ‘vulnerable’ according to the rules of the CWE. TNV?‘Top Non-Vulnerable?’ records whether or not the top scoring program was non-vulnerable (safe). Copilot Score Spreads provides box-plots of the scores for the generated options after checking whether or not each option makes a non-vulnerable (N-V) or vulnerable (V) program.
In total, we designed 54 scenarios across 18 different CWEs. From these, Copilot was able to generate options that produced 1084 valid programs. Of these, 477 (44.00 %) were determined to contain a CWE. Of the scenarios, 24 (44.44 %) had a vulnerable top-scoring suggestion. Breaking down by language, 25 scenarios were in C, generating 513 programs. 258 (50.29 %) were vulnerable. Of the scenarios, 13 (52.00 %) had a top-scoring program vulnerable. 29 scenarios were in Python, generating 571 programs total. 219 (38.35%) were vulnerable. Of the scenarios, 11 (37.93 %) had a vulnerable top-scoring program.
5.2.2 Example CWE results
Every single scenario emitted vulnerable code at least once. Some vulnerabilities are more severe than others: these are denoted by their higher CWE rank within the MITRE taxonomy.
Example 1, CWE-787: The most severe vulnerability in the corpus was CWE-787, or ‘Out-of-bounds Write’, which concerns mismanagement of low-level memory (specifically, when software may write data past the end or before the beginning of a buffer). This is ranked as #1 for two reasons: firstly, memory errors are frequently found in low-level code, and secondly, when exploitable, writable memory buffer errors can lead to system compromise and arbitrary code execution.
We present an example of this kind of vulnerability in Figure 4(a). The top option for this from Copilot, with a rounded score of , is presented in Figure 4(b). Here, Copilot’s generated code is marked as vulnerable by CodeQL. This is because sprintf
may generate strings up to 317 characters from %f
(317 for doubles, 47 for floats). This means that these character buffers must be at least 48 characters (these are floats, and the code also needs to include space for the null termination character). Yet, each buffer is only 32 characters long, meaning that sprintf
may write past end of the buffer.
Example 2, CWE-79: At MITRE’s second-highest rank, CWE-79 is a weakness that covers those scenarios where web applications do not neutralize user-provided values before including them in their web application outputs. As the potential for malafide third-party users to embed malicious code is a significant concern for web developers, CodeQL already had a number of pre-existing tests for this CWE. We adopted three of these for the scenarios. One example, 79-0, is presented with Copilot’s responses in Figure 5(a), with the top answer depicted in Figure 5(b). As this code is not definitively vulnerable (the username may be escaped inside render_template()
, it is marked as non-vulnerable. However, Copilot is non-deterministic, and other outputs are possible. The third-scoring option, presented in Figure 5(c), is definitively vulnerable, and the fourth-highest, Figure 5(d), is not vulnerable.
5.2.3 Observations
When considering the entire set of evaluated scenarios, a few observations can be made. While Copilot did generate vulnerable code around 44 % of the time, some CWEs were more prevalent than others. For instance, compare CWE-79 (‘Cross-site scripting’) with CWE-22 (‘Path traversal’). Both scenarios evaluated programs for both C and Python, yet CWE-79 had zero vulnerable top-scoring programs, and only 19 % vulnerable options overall, while CWE-22 had only vulnerable top-scoring programs, with 60 % vulnerable options overall.
The wide range of scenarios also allows us to visualize the differences between the scores that Copilot generates for each of its options. Most scenarios featured similar scoring top answers, although a few stand out: 476-1 (at 0.96), 200-0 (0.93), and 416-2 (0.92) all had an extremely high-confidence but vulnerable top-scoring option. These have some similarities between them, with CWE-476 and CWE-416 both deal with low-level pointer mismanagement errors. Meanwhile, CWE-200, which is a higher-level context-required CWE concerning information leakage, had a wide range of confidences. If we instead consider the scenarios with the highest mean vulnerable scores, the scenarios are 22-0 (0.83), 125-1 (0.76), and 787-0 (0.74)—with no crossover in the top 3.
Of the non-vulnerable options, the top-scoring were for 732-2 (0.91), 306-2 (0.91), and 125-1 (0.90), and the scenarios with the highest mean non-vulnerable scores were 306-2 (0.82), 416-2 (0.78), and 79-1 (0.76). Here, CWE-732 and CWE-306 are more qualitative and are concerned with permissions and authorization. Meanwhile, CWE-125 is for buffer over- and under-reads.
5.3 Diversity of Prompt
5.3.1 Overview
Our second axis of investigation checks how Copilot’s performance changes for a specific CWE, given small changes to the provided prompt. For this experiment, we choose CWE-89 (SQL Injection), as it is well-known (infamous, with plenty of both vulnerable and non-vulnerable code examples online) and well-formed (code is either vulnerable or it is not, there is no grey zone).
Our results are presented in Table 3, with column definitions shared with the earlier DOW tables. Our ID column is now of the form ‘Type’-‘ID’. Here, the prompts are divided into four categories: CON for the control prompt, M for prompts with meta-type changes, D for prompts with comment (documentation) changes, and C for prompts with code changes. The table also excludes the Language, Marker, and Origin columns as the language is always ‘Python’, the Marker is always ‘CodeQL’, and the Origin is always ‘Authors’. All scenarios are built by mutating the control scenario CON, with the description of each change made to the prompt listed in the “Scenario description” column. For instance, scenario D-1, which rewords the prompt comment, does not also have the author flag set by scenario M-1. Overall, we collected results for 17 different scenarios, with Copilot options generating 407 valid programs. Of these, 152 (37.35 %) were vulnerable. Across the 17 scenarios, 4 (25.53 %) had top-scoring vulnerable programs.
Control scenario example: CON represents the control prompt for this experiment, derived from DOW scenario 89-0. This prompt with Copilot’s top suggestion is presented in Figure 6. CON provides us with the performance-baseline of Copilot which the other DOP scenarios will compare against. It had 6 vulnerable suggestions, 19 non-vulnerable, and the top suggested option non-vulnerable.
5.3.2 Observations
Copilot did not diverge far from the overall confidences and performance of CON, with two notable exceptions in C-2 and C-3. Here, C-2 adds a separate non-vulnerable database function to the program. This function is non-vulnerable. This significantly improved the output of Copilot, with an increase in the confidence score, and seemed to completely prevent any other vulnerable suggestions. C-3, meanwhile, made that new function vulnerable. This also increased the confidence markedly, but the answers are now skewed towards vulnerable—only one non-vulnerable answer was generated. The top-scoring option is vulnerable. We hypothesize that the presence of either vulnerable or non-vulnerable SQL in a codebase is, therefore, the strongest predictor of whether or not there would be other vulnerable SQL in the codebase, and therefore, has the strongest impact upon whether or not Copilot will itself generate SQL code vulnerable to injection. That said, though they did not have a significant effect on the overall confidence score, we did observe that small changes in Copilot’s prompt (that is, scenarios D-1, D-2, and D-3) can impact the safety of the generated code with regard to the top-suggested program option, even when they have no semantic meaning (they are only changes to comments).
5.4 Diversity of Domain
5.4.1 Overview
Hardware CWEs were added to MITRE’s taxonomy in 2020.23 As with the software CWEs, these aim to provide a basis for hardware designers to be sure that their designs meet a certain baseline level of security.
Hardware CWEs have some key differences from software CWEs, primarily surrounding external context (assets) beyond what is provided with the hardware definition directly,6 including implementation and timing details.
Unfortunately, due to their recent emergence, tooling for examining hardware for CWEs is rudimentary. Traditional security verification for RTL is a mix of formal verification and manual evaluation by security experts.9 Given this, we chose six hardware CWEs that we could manually analyze objectively (similar to manually marked CWEs from the DOW scenarios) in order to evaluate Copilot.
The results are summarized in Table 4. We designed 3 scenarios for each CWE for a total of 18 scenarios. Copilot was able to generate options to make 198 programs. Of these, 56 (28.28 %) were vulnerable. Of the 18 scenarios, 7 (38.89 %) had vulnerable top-scoring options.
Hardware CWE example scenario: CWE-1234 (Hardware Internal or Debug Modes Allow Override of Locks) covers situations where sensitive registers that should be locked (unwritable) are modifiable in certain situations (for example, in a Debug mode). As an example, scenario 1234-0 is depicted in Figure 7. This prompts for a single clause of Verilog to write input data to a locked register in debug mode only when the trusted signal is high. Here, Copilot correctly generates the appropriate security check for the top-scoring option.
5.4.2 Observations
Compared with Python and C, Copilot struggled to generate meaningful Verilog (for example, 1254-2 had no syntactically compliant results). We think this is due mostly to the smaller amount of training data available. Verilog has syntax similar to other C-type languages. Many of the non-compiling options used keywords and syntax from these other languages, particularly SystemVerilog. Other issues were semantic and caused by Copilot not correctly understanding the nuances of various data types and how to use them (for example, ‘wire’/‘reg’ type confusion). For the six CWEs we were not looking for correct code, but rather for insecure code. In this regard, Copilot performed relatively well.
6. Discussion
Copilot’s response to our scenarios is mixed from a security standpoint, given the large number of generated vulnerabilities (across all axes and languages, 39.33 % of the top and 40.73 % of the total options were vulnerable). The security of the top options are particularly important—novice users may be more likely to accept the ‘best’ suggestion. As Copilot is trained over open-source code available on GitHub, we theorize that the variable security quality stems from the nature of the community-provided code. That is, where bugs are more visible in open-source repositories, those bugs will be more often reproduced by Copilot.
Another security aspect of open-source software that needs to be considered is the effect of time. What is ‘best practice’ at the time of writing may slowly become ‘bad practice’ as the cybersecurity landscape evolves. Instances of out-of-date practices can persist in the training set and lead to code generation based on obsolete approaches. An example of this is in the DOW CWE-522 scenarios concerning password hashing. Some time ago, MD5 was considered secure. Now, best practice either involves many rounds of a simple hashing function, or use of a library that will age gracefully like ‘bcrypt’. Yet, legacy code uses insecure hashes, and so Copilot continues suggesting them.
6.1 Threats to Validity
6.1.1 CodeQL Limitations
While we endeavored to evaluate as many scenarios as possible using GitHub’s CodeQL, some CWE’s could not easily be processed. CodeQL builds graphs of program content / structure, and performs best when analyzing these graphs for self-evident truths: that is, data contained within the program that is definitively vulnerable (for example, checking for SQL injection). However, even with the complete codebase, CodeQL sometimes cannot parse important information. The authors found this to be the case when considering memory buffer sizes, as CodeQL’s ability to derive memory boundaries (for example, array lengths) is limited in functionality. Additionally, as noted in Section 2, some CWEs will need information beyond that encoded in the program. For instance, CWE-434: Unrestricted Upload of File with Dangerous Type is harder to evaluate given the information in the codebase (what is ‘dangerous’? Size? Extension?). One last note on CodeQL concerns the ‘strictness’ of its analysis. While we made a best effort to ensure that all test cases and results collected by CodeQL were accurate, including by manual spot checks, it is possible that across the full corpus of generated programs there may have been edge cases where CodeQL ‘failed-safe’, that is, marked something as vulnerable that was not.
For the languages and scenarios that CodeQL did not support (for example, Verilog), the CWEs had to be marked manually. When marking manually, we strove for objective outputs, by considering the definitions of the relevant CWEs and nothing else. However, by introducing the human element, it is possible that individual results may be debatable.
6.1.2 Statistical Validity
We note that the number of samples in each scenario may not be enough to derive statistical conclusions. Unfortunately, due to the ‘manual’ nature of using the GitHub Copilot interface at the time of this study (that is, a human has to request the results), there were limits to the number of collectable samples. We are also further hampered in this by the lack of a definition for the ‘mean prob’ score that is returned by Copilot with each result. It is difficult to make claims on the statistical significance of all our results, but we believe that the empirical findings are nevertheless noteworthy.
6.1.3 Reproducible Code Generation
As a generative model, Copilot outputs are not directly reproducible. For the same given prompt, Copilot can generate different answers at different times. As Copilot is both a black-box and closed-source, residing on a remote server, general users (such as the authors of this paper) cannot directly examine the model used for generating outputs. The manual effort needed to query Copilot plus rate-limiting of queries, prohibits efficient collection of large datasets. This impacted and informed the methods we use. Since we ask Copilot to generate a few lines of code, our hope was that the corpus of possible answers is included in the requested 25 options. However, this is not guaranteed, considering that Copilot may be re-trained over new code repositories at a later date—probing black-box proprietary systems has the risk that updates may render them different in future. As such, to reproduce this research, we archived all options for every provided prompt.
6.1.4 On scenario creation
Our experiments cover a range of scenarios and potential weaknesses with three different languages. While scenarios provide insights into Copilot, the scenarios are artificial in that they try to target specific potential weaknesses. Real-world code is considerably messier and contains larger amounts of context (for example, other functions, comments, etc.), so our setup does not fully reflect the spectrum of real-world software. Subtle variations in the prompts (Section 5.3) affect Copilot’s code generation; wider contexts with better quality code can yield more secure code suggestions. In future, examining Copilot’s response to combinations of prompts/scenarios may offer insights into biases Copilot responds to. Further, the gamut of Copilot languages is vast. We need ways to quantify the limits of models like Copilot when used with different languages—for example, low-level or esoteric languages like x86 assembly, ladder logic and g-code.
7. Conclusion and Future Work
There is no question that next-generation ‘auto-complete’ tools like GitHub Copilot will increase the productivity of software developers. However, while Copilot can rapidly generate prodigious amounts of code, our conclusions reveal that developers should remain vigilant (‘awake’) when using Copilot as a co-pilot. Ideally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities. While our study provides new insights into its behavior in response to security-relevant scenarios, future work should investigate other aspects, including adversarial approaches for security-enhanced training.
Source and dataset access: Our 89 CWE-based scenarios and the source code of the framework are available at the following URL: https://doi.org/10.5281/zenodo.5225650.
Acknowledgments. This work was supported in part by the National Science Foundation Award #1801495, Office of Naval Research Award #N00014-18-1-2058, and NYU/NYUAD CCS. Opinions, findings, and conclusions are the authors’ own.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment