Future Tech

Coders' Copilot code-copying copyright claims crumble against GitHub, Microsoft

Tan KW
Publish date: Tue, 09 Jul 2024, 12:30 PM
Tan KW
0 459,726
Future Tech

Claims by developers that GitHub Copilot was unlawfully copying their code have largely been dismissed, leaving the engineers for now with just two allegations remaining in their lawsuit against the code warehouse.

The class-action suit against GitHub, Microsoft, and OpenAI was filed in November 2022, with the plaintiffs claiming the Copilot coding assistant was trained on open source software hosted on GitHub and as such would suggest snippets from those public projects to other programmers without care for licenses - such as providing appropriate credit for the source - thus violating the original creators' intellectual property rights.

Microsoft owns GitHub and uses OpenAI's generative machine-learning technology to power Copilot, which auto-completes source code for engineers as they type out comments, function definitions, and other prompts.

Ergo, the plaintiffs are unhappy that, in their view, portions of their copyrighted open source code might be provided - copied, rather - by Copilot to other programmers to use, without due credit given and other requirements of the original licenses honored.

The case started with 22 claims in all, and over time this has been whittled down as the defending corporations motioned to have the accusations thrown out of court, requests that Judge Jon Tigar has mostly sustained.

In an order [PDF] unsealed on Friday, July 5, Judge Tigar ruled on yet another batch of the plaintiffs' claims, and overall it was a win for GitHub, Microsoft, and OpenAI. Three claims were dismissed as requested and just one allowed to continue. According to a count by Microsoft and GitHub's lawyers, that leaves just two allegations standing in total.

The most recently dismissed claims were fairly important, with one pertaining to infringement under the Digital Millennium Copyright Act (DMCA), section 1202(b), which basically says you shouldn't remove without permission crucial "copyright management" information, such as in this context who wrote the code and the terms of use, as licenses tend to dictate.

It was argued in the class-action suit that Copilot was stripping that info out when offering code snippets from people's projects, which in their view would break 1202(b).

The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply. Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.

The plaintiffs won't be able to offer a new section 1202(b) DMCA copyright claim as Judge Tigar dismissed the allegation with prejudice.

The anonymous programmers have repeatedly insisted Copilot could, and would, generate code identical to what they had written themselves, which is a key pillar of their lawsuit since there is an identicality requirement for their DMCA claim. However, Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.

The amended complaint argued that unlawful code copying was an inevitability if users flipped Copilot's anti-duplication safety switch to off, and also cited a study into AI-generated code in attempt to back up their position that Copilot would plagiarize source, but once again the judge was not convinced that Microsoft's system was ripping off people's work in a meaningful way.

Specifically, the judge cited the study's observation that Copilot reportedly "rarely emits memorized code in benign situations, and most memorization occurs only when the model has been prompted with long code excerpts that are very similar to the training data."

"Accordingly, plaintiffs’ reliance on a study that, at most, holds that Copilot may theoretically be prompted by a user to generate a match to someone else’s code is unpersuasive," he concluded.

The DMCA argument was, as we said, one of three claims just now tossed out. The other two were claims for unjust enrichment and punitive damages, though not with prejudice, meaning it's possible these claims could be amended and resubmitted. Until then, however, that leaves the standing claims at just two: an open source license violation allegation, and a breach of contract complaint that was previously reintroduced after being dismissed initially.

"We firmly believe AI will transform the way the world builds software, leading to increased productivity and most importantly, happier developers," GitHub said in a statement to The Register.

"We are confident that Copilot adheres to applicable laws and we’ve been committed to innovating responsibly with Copilot from the start. We will continue to invest in and advocate for the AI-powered developer experience of the future."

We also approached all parties in the lawsuit and their legal teams.

Both sides squabble during discovery

Also filed for the case on Friday was a joint case management statement [PDF] chock full of various grievances and complaints each side made against the other over the discovery process, with both saying the other hasn't given up all the documents they were supposed to.

The plaintiffs accuse the defendants of deliberately dragging their feet, saying the documents that have been produced so far were already publicly available or should have been disclosed a long time ago. Much of the focus is on Microsoft and its single submitted document so far, something that the plaintiffs say makes no sense.

"That Microsoft employees were involved in many of these GitHub-sourced conversations demonstrates that Microsoft's production of one document thus far has been a function of delay and obfuscation, and nothing else," the anonymous developers said. "Microsoft has known but failed to disclose that its employees were directly involved in the creation, operation, and management of Copilot and its underlying models."

The lack of documents from the Windows maker is apparently down to "technical difficulties" in collecting Slack messages, something the plaintiffs aren't convinced by. Similarly, the programmers say that OpenAI should have also submitted lots more information by now, pointing out that it had submitted tens of thousands as a defendant in the Authors Guild lawsuit.

Microsoft and GitHub, however, counter that the plaintiffs have been asking for way too much info, accusing them of having "failed to pursue relevant discovery of these topics efficiently and in good faith." One of these topics includes Microsoft's 2018 acquisition of GitHub.

Meanwhile, OpenAI says the plaintiffs haven't been following proper procedure in respect to asking for emails, saying it can't (or won't) produce any until it receives a correct request.

The corporate trio also say that the dismissal of the above DMCA copyright claim has fundamentally changed the case and argue that the scope of discovery should now be narrowed. This is something the plaintiffs dispute on the grounds that the open source license violation claim pertains to pretty much the same documents as the DMCA issue should bring up.

GitHub, Microsoft, and OpenAI say the plaintiffs haven't properly responded to their discovery requests, arguing that their documents include "JSON files, a blank HTML file, emails without any metadata, and improperly redacted PNG files of Slack and other messages."

The plaintiffs have asked for more time for discovery, and although the defendants argue this isn't necessary, the three tech titans say they're open to a "reasonable extension." ®

 

https://www.theregister.com//2024/07/08/github_copilot_dmca/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment