Copilot Thoughts
July 05, 2021

The furore surrounding Github Copilot interesting.

I’m no lawyer (nor do I play one on TV), but my feeling is that it may expose a flaw in the FLOSS community’s ideas about ownership of code.

If so, this is a good thing. The flaw (if it exists) has not been created by Copilot. It was already there, it just hadn’t come to light.

Is This OK?

Anyone who’s been coding for a while will have come across the situation where you’ve found some code with a license you can’t use, you’ve used the act of reading (or maybe even debugging) the code to teach yourself the solution to the underlying problem, and then you’ve written new code.

Then maybe you’ve felt uneasy and wondered if you’ve broken the rules.

Maybe all you did was cynically copy & paste and change a few variable names - in which case you probably did break the rules. Maybe though you genuinely rewrote it all from scratch. Maybe after rewriting you pretty much ended up with the same code because - well - that’s the best expression of the underlying solution to the problem you’re trying to solve.

Who Owns What, Exactly?

For any such situation, there’s going to be a blurry line. What did I copy here, and what did I create myself? The implementation? The algorithm? The expression of the algorithm in the context of the particular languague I’m using? The implementation in the context of the problem I’m applying it to?

Furthermore, how is this process essentially different from the one undertaken by the author of the GPL’d code?

Can I be sure in any way that they themselves weren’t just re-expressing something that has prior art?

Granularity

To look at it another way:

For any sufficiently small fragment of code, there’s likely to be a canonical way to express it. Taken to an extreme, a single line may well be infintely rewriteable, but one formulation is probably clearer, more compact, or better meets your particular criteria than any other.

In most cases it would be self-evidently ridiculous to assert that the GPL license applied to a body of code actually applies to each line in isolation.

If the line includes variable names, function names, comments, or other incidental metadata, it could be argued that they are not directly related to the pure meaning of that line. They do have meaning and value, but probably only in the context in which the line exists.

These names can be replaced, rewritten, or even randomised; this may obfuscate the meaning of the code, but it doesn’t stop it working.

Once you get to a small enough granularity, the same line of code almost definitely exists in countless other programs, both open and closed source, GPL’d or liberally licensed. The names might be different, but the meaning of the code is the same. The machine code instructions emitted by the compiler will probably be the same.

What Is Knowledge Anyway?

So what exactly are we arguing about here?

If something like Copilot is taking chunks of GPL’d source and pasting them into someone else’s program, how many contiguous lines does it have to paste before there’s a problem? Is there an arbitrary N number of lines that’s ok, where N + 1 is not ok?

If Copilot applied some natural language processing to infer the context that the lines are pasted into, and then automatically renamed the variables (or even rewrote comments) to use words appropriate to the new context, would that now be ok? Same code - different names?

If it randomised the names and stripped all comments, would that be ok?

Maybe we arrive at some formulation that states that a line is ok, but a whole function is not. Are we then allowed to apply the copying process to each line in turn? Refactor the function into multiple smaller functions and use them?

Known Unknowns

This all feels very wooly to me. The code represents a muddle of knowledge, experience, style, and algorithm.

Any assertion of the right to control how each of these things are used in isolation, or even recombined into a larger whole, feels like over-reach.

Worse, it may well be politically dangerous. If an entity can assert their right to apply copyleft to small fragments of code, doesn’t that logically mean that they are claiming ownership of the underlying meaning of those fragments?

Doesn’t that put us into territory where another entity can assert ownership of the underlying meaning of other fragments and choose to patent them or in other ways suppress their use by others? Isn’t that sort of what the Free/Libre side of the community is trying to avoid in the first place?

Nothing Is Simple

I’m not claiming any great insights here, and certainly not offering any solutions.

It just seems to me that the problem is a lot knottier than some people are making out.

It’s not self-evidently the case that what Github Copilot is doing is breaking the rules, any more than it is clear that what happens when I read someone’s GPL’d code and learn something from it is following the rules…

« The Matchable Protocol
Got a comment on this post? Let us know at @elegantchaoscom.