TL;DR GitHub Copilot trains on GPL code and its nonpermissive filters don’t actually work, while we at Codeium have removed GPL licensed code from our training data, guaranteeing peace of mind to our users.
To start, what is GPL (General Public License)? By definition, open source software (OSS) is, well, open. But just because some code is public does not automatically mean it can be used for commercial purposes without permission. OSS licenses define acceptable use of the particular library – permissive licenses (ex. MIT, BSD, Apache) means you actually can legally use that code for whatever, including commercial reasons, but non-permissive licenses such as GPL mean that you cannot without consent.
There are a bunch of reasons why an OSS maintainer would choose non permissive over permissive or vice versa, but that’s not the point – this is simply what the law is. There are legal ramifications with a company’s developers violating GPL licenses, independent of intent.
But things have been muddled thanks to generative AI and LLMs. Clearly a developer copy-pasting GPL code without consent is bad and grounds for legal action, but what about a generative code model? Is it wrong for such a model to “learn” from this data? The argument to do so is clear – GPL-licensed OSS is some of the highest quality code that is publicly available, and just like any machine learning model, better quality training data almost always means better quality LLMs. The argument to not do so is perhaps less clear – researchers say LLMs rarely spit out training data verbatim unless interacted with adversarially, but theoretically, they could. In which case, who is responsible for this clear legal infringement? The developer of the LLM or the user who unknowingly ends up accepting the LLM’s suggestions and committing the code to their team’s codebase? Honestly, there is no clear answer, but that’s the scary part – no user or company should be subject to legal action, even potentially, just for using an AI code assistant tool.
This is why GitHub Copilot should be scary, especially to enterprises. GitHub Copilot uses models trained on non-permissively licensed code (and perhaps even private user code), and is being sued over this exact practice.
To prove that GitHub Copilot trains on non permissive licenses, we just disable any post-generation filters and see what GPL code we can generate with minimal context.
We can very quickly generate the GPL license for a popular GPL-protected repo, such as ffmpeg, from a couple lines of a header comment.
But if that isn’t convincing enough, let’s take a code snippet from an LGPL OSS library and see if GitHub Copilot will regurgitate it. Tim Davis claims that GitHub Copilot regurgitates their code:
So let’s just take a particular function from their GPL-licensed repository, copied below:
csi cs_gaxpy (const cs *A, const double *x, double *y)
csi p, j, n, *Ap, *Ai ;
double *Ax ;
if (!CS_CSC (A) || !x || !y) return (0) ;
n = A->n ; Ap = A->p ; Ai = A->i ; Ax = A->x ;
for (j = 0 ; j < n ; j++)
for (p = Ap [j] ; p < Ap [j+1] ; p++)
y [Ai [p]] += Ax [p] * x [j] ;
return (1) ;
Indeed, with just a function header, GitHub Copilot completely regurgitates the code:
It should be worrisome how easily GitHub Copilot spits out GPL code without being prompted adversarially, independent of what they and other researchers claim.
GitHub Copilot has post-generation filters that they claim will catch these potential issues, so let’s test them out!
Enabling these filters (which are by default disabled), GitHub Copilot produces nothing after a couple lines of regurgitation. Not just no license-violating suggestions – no suggestions at all:
So yes, no non-permissive code suggestions are produced after a couple lines, but also, nothing is. This seems ultra conservative, since we would expect an AI to have at least some helpful suggestions here, and it is jarring to get no suggestions at all. We have even personally hit a lot of cases where, with these filters enabled, we were not getting any suggestions even for non-GPL-licensed code! GitHub probably knows that these non-permissive license filters degrade performance, which is why none of them are enabled by default or publicized actively. We are pretty confident that most Copilot users have exposed themselves to legal risk since they are unaware of these filters.
But at least the filters seem promising at doing what they say they do (at the expense of large service degradation), right? The next natural question obviously is “how good are they actually?” If you switch around some statements like variable declarations or rename a variable, that is still plagiarism, and still a violation of the non-permissive licenses. Is GitHub Copilot doing some smart syntax parsing and logic, or is it just some naive exact string matcher?
To test this, we tried to get GitHub Copilot to regurgitate the GPL code with the filters on by making small, non-logic specific edits, and GitHub Copilot happily produced the GPL code:
We actually thought we would have to get much more adversarial, so this was surprising. It is not unreasonable for someone to declare the variables differently (which then somehow produces the declarations of the remaining variables), and to rename a variable due to local naming conventions. We did not have to manually write any of the actual logic for this function – that was all provided by GitHub Copilot.
It seems like the current GitHub Copilot non-permissive license filters are doing a somewhat exact match for the previous 150 characters or so of (generated code + code before cursor), and fully suppressing any matches. But our point is not that this is a bad filter (which it is) – our point is that this proves that any post-generation filter is an imperfect solution to the licensing problem, and GitHub Copilot, no matter how advanced they make their filters, will still carry this legal risk.
Meanwhile, we at Codeium are big fans of the OSS community, and don’t intend to profit off of anyone’s hard work, legal worries aside. This is one of the reasons we have been giving Codeium for free, but simultaneously, we did not want to be just another OpenAI wrapper product. We wanted to have a say in our data and training so that we could create models that align with our values. From early conversations with the open source community, it was clear GPL code had to go, and we have put in this work.
This was harder than we thought. First, there is a nontrivial amount of legwork to build all of the data collection and training pipelines to build your own model. That is why there are very few generative AI companies that actually do this, and at least for code, any open source models that you could bootstrap from are also trained on non permissive code (ex. Codegen). Second, we found that it was easy to remove repos from our training data that had explicit GPL licenses, but the sad reality of the world is that many other public repos have very clearly copy-pasted code from GPL-licensed repos but aren’t explicitly GPL-licensed themselves. To actually get the result we want, we needed to remove that code as well, since the model wouldn’t know the difference during training. To this end, we implemented a bunch of string-based filters (ex. removing any repo that contains the string “General Public License”).
This isn’t perfect, but it goes a really long way, as the examples in the next section will show.
Let’s revisit the examples that GitHub Copilot failed on:
Codeium doesn’t generate the GPL license since it hasn’t seen this license before, just some general legalese text:
More importantly, Codeium produces non-license-violating code suggestions. We can see how easy it is to get Codeium to produce the GPL code by just explicitly typing out the GPL code verbatim and seeing when the suggestions actually match what appears next:
Without enough context, Codeium is providing useful, but not license-violating, suggestions via small snippets. Only with enough context of all variables defined and an understanding of which variables are used for what (ex.
j for the outer loop and
p for the inner loop) does Codeium’s intrinsic understanding of sparse matrix-vector multiplication come to coincidentally provide exact suggestions. Codeium clearly has not seen the GPL code, as evident by not knowing about the
CS_CSC and not adding otherwise random comments. The key is that it requires a lot of user input verbatim from the GPL licensed code (i.e. behaving adversarially by typing the GPL code character by character) to get Codeium to coincidentally suggest code that matches licensed code.
Codeium’s suggestions might not be perfect, or even always correct, but they aren’t violating licenses and are still available where you expect to receive them. Because Codeium does not train on non-permissively licensed code, all Codeium users get peace of mind by default, without needing to enable some hidden settings or worrying about accidentally passing through post-generation filters.
We know the work isn’t done. We are committed to keep improving our data sanitization and filtering processes as well as maintaining a fresh training dataset (with up-to-date license metadata). We’re also going to be taking this approach to remove potentially insecure code practices from our training data. This is possible because we are one of the very few companies that are building AI applications in a fully integrated manner independent of OpenAI – the training, the models, the serving, the integrations, and the product.
This legal risk is not something we want to expose to developers that are writing code for commercial purposes. This is why our Codeium for Enterprises offering makes sense for companies, and coupled with self-hosting for unbeatable data security, it is clear why it is already trusted by companies in industries such as fintech, defense, and biotech. If you work for a company that wants this AI productivity boost without the security and legal risks, contact us: