Monday, May 20, 2024
Artificial IntelligenceTech

OpenAI Lawsuit: A Dubious Argument On A Monumentally Complex Question

News came out this week about a lawsuit against OpenAI, the developer of ChatGPT. The company is being sued by two authors who claim that the LLM artificial intelligence violated copyright laws by training on their work. It’s an interesting case, certainly, and joins a litany of other lawsuits filed against artificial intelligence platforms in recent months. Were I a betting man, I would wager that the case is not going to go anywhere on its own merits for some very clear reasons: the lawsuit, at least at face value to me as a non-lawyer who has spent a lot of time reading law stuff, seems to lack some of the necessary ingredients. But the questions of the ethics here? Far murkier.

OpenAI, the developer of ChatGPT, is being sued by two authors in an attempted class action lawsuit. The lawsuit alleges that the company violated the authors’ copyrights in using their works to train the model.

Artificial Intelligence vs. Copyright

Recall that the US Copyright Office has ruled that works generated by AI can’t be copyrighted– for the most part (the agency offered clarification around a kinda-exception that makes very little sense). While this is easy enough to apply this standard to images created in Midjourney, it’s harder to apply to text generated by ChatGPT. I frequently use ChatGPT for research, for example, in part because I’ve gotten pretty good at and I know what to ask it (and how to ask it), but also because I’ve found that the model kinda sucks at writing complete prose. So, since I have never been able to use its finished product in this department without editing it substantially for style and clarity, this would never even be a question for me. A lot of ChatGPT’s prose emanates “Freshman Seminar Paper After Being Laboriously Edited By A Writing Tutor” vibes.

Beyond the quality question, though, it’s harder to prove unequivocally that prose was created by ChatGPT, as opposed to images created by Midjourney, whose works often bear far more in the way of telltale signs of robot craft. (More on that subject here). Of course, some authors have tried– and have even made some money- using ChatGPT to write whole stories. Whether those stories are any good or not, ahem, is a question outside of the scope of this article.

To this end, I enjoy using AI as a tool that helps make my job easier. I do not view it as a replacement for being a professional writer, and the only reason why professional writers are threatened by generative AI models is because the internet is increasingly run by people who don’t care about the integrity of information, truth, or democracy, and this means that the internet is increasingly inundated with poor quality, AI-generated content with no controls on veracity, authorship, or content. This is also a story for another day, but it’s worth mentioning in the context of the lawsuit as far as the ethics we should be thinking about.

Midjourney AI


So. Did An LLM “Steal” The Works In Question?

In a lawsuit like this, the plaintiffs usually have to demonstrate a couple of things. It is easy to figure out whether the AI has indeed been trained on their works. You can simply ask it whether it’s familiar with the work, and to summarize it. Demonstrating that the AI has copied their work, on the other hand, might well be impossible. AI models like GPT-4 do not store direct copies of the data on which they are trained. Instead, they extract patterns from the data and generate new outputs based on these patterns.

This is a critical point, because it’s currently a grey area in copyright law: Does learning from data constitute copying in the traditional sense?

I personally think it doesn’t, and that’s the argument I expect OpenAI will make, by saying that the model isn’t actually copying or reproducing large sections of the text verbatim. I expect the court will agree. This is one of the key things that defines the question of fair use doctrine, and it’s often recognized in artists that create parodies of copyrighted imagery or texts. Much as I openly hate Jamie Dimon, for comparison, I appreciate that he didn’t try suing the pants off Alex Schaefer, who painted a series of works featuring Chase Bank branches in flames. Nor did Guy Fieri, to my knowledge, sue the genius behind the fake menu lampooning the donkey sauce-blasted platters at the former’s since-closed Times Square restaurant. These are both examples of parody, of course.

The question of how the AI got a hold of the texts is another question that we can also consider here. Does it matter whether OpenAI paid for the book, checked it out from a local library, or “stole” it, say, downloading it from a source like LibGen? Downloading copyrighted material from sources like LibGen is generally illegal. But proving the illegality of that seems outside of the scope of this lawsuit.

Personally, I believe that we should make sure that as much information as possible is free and easily accessible, whether I’m thinking about Elsevier, JSTOR, or the Little Free Library down the block. Speaking of libraries? Pretty much every book that has ever been written is in some public library somewhere. This seems to undermine to some degree the question in the case around whether the training content was “stolen” to begin with. It is immaterial to the robot whether the book was obtained from an “illicit” source like LibGen, or whether it was an eBook checked out from a local public library, but there is an ethical question if its trainers did in fact “steal” the book.

Another interesting question here that seems to render the claim of theft somewhat less relevant is the question of how much of the product was “copied,” and to what end. For example, if I walked around my neighborhood and stopped the guy selling paletas on the bike cart, and I asked him what he could tell me about the 1955 novel Pedro Páramo, he might be able to recite entire passages from the book verbatim. Or, he might simply be familiar with it as a novel. Does this matter for the purpose of copyright infringement? Not to get too academic or flowery here, but I don’t think so. The transmission of knowledge cannot simply be restrained by statute, nor by increasingly draconian DRM (even if the law would love to ensure that said DRM is as powerful and legally protected as it can be). That idea is at least supported by the singular characteristic of our boundless human curiosity, and the plasticity of our respective brains (and, indeed, our collective consciousness as a society).


Quantifying Damages

Key in a lawsuit like this is also successfully convincing the court that you suffered damages personally as a plaintiff. The paleta guy isn’t hurting the estate of Juan Rulfo by telling me about the novel, whether he’s reciting whole passages to me from memory, or just giving me a plot synopsis. Summarizing the gist of something isn’t stealing it. The question of how the novel was obtained in the first place seems outside the scope of the case, but I’d be willing to bet that this lawsuit is probably going to be more beneficial for the authors as a matter of visibility than ChatGPT would be injurious to their livelihoods. That’s not to say that I think they don’t have an interesting argument, of course.

For what it’s worth, there is a fascinating array of legal precedent in this area, some of which goes back to the era of telephones (Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 [1991]) and Betamax (Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417 [1984]). The latter case had huge implications for the following decade’s showdown over .mp3 distribution through services like Napster and LimeWire, and extraordinarily complex and clever legal arguments continue to be used in defending alleged facilitators of piracy against litigation by software and game developers. In MGM Studios v. Grokster (2005), for example, the Ninth Circuit held that a company can’t be held liable for its users using its product for illegal purposes if it’s actively discouraging use for those purposes, just like Kia wouldn’t be able to sue the manufacturers of USB cables. But the Supreme Court later reversed the decision of the Ninth in Grokster, finding that the platform was more or less liable for facilitating the distribution of material that it knew to be illegal.

Courtroom sketch from the possible upcoming trial of two authors, who have sued OpenAI, alleging that ChatGPT was trained on their works, in violation of their copyright protection.

Conclusion: Seeking The Regulation That Is Needed

Irrespective of the outcome of the case, we need to have a nuanced and sweeping conversation about regulating AI. Before we even get to the question of what the regulation is going to look like, the first question is then who’s going to lead it, and I hope the answer is not “big tech.” (Congressional Democrats are unlikely to be leading the charge, and Republicans don’t believe in regulation unless it involves restricting what you can do in your energy markets or investment management, apparently).

In general, I trust tech bros and tech companies slightly less than I trust bankers, building inspectors, and human resources professionals, which is to say I don’t have a great deal of faith that they are committed to doing the right thing. Tech is, after all, a profoundly messed up sector run by some profoundly awful human beings– like Elon Musk, Mark Zuckerberg, Larry Ellison, Sundar Pichai, and ever so many more like them, who don’t really care about whether they destabilize American democracy or the integrity of truth itself, just so long as they make a lot of money in the process.

OpenAI’s CEO Sam Altman has, for his part, at least been quite vocal in his effort to be present in these conversations, even publicly pleading for regulation of AI. And, if you want to make the better part of a a cool quarter million dollars in salary, OpenAI is currently hiring someone to run their policy advocacy efforts in DC. I am desperately hoping that whatever heavily credentialed technocrat they hire (look, I didn’t say I wasn’t going to try and apply!) at least has some basic level of human empathy and emotional intelligence thinking about this stuff, which is probably going to define a good portion of what happens with technology for the rest of the 21st century.

Some specific ideas I personally had around what regulation might look like:

  • Transparency in training. I find myself thinking about this all the time, mostly with regard to Midjourney and how bad it is at imagining urban infrastructure (plus an upcoming article with an equity/ethics focus). But it has come up a lot, too, in academic and journalistic contexts– going back to books like Cathy O’Neil’s book Weapons of Math Destruction, which points out how our algorithmically-obsessed society allows robots to make decisions that impede racial and economic justice and, indeed, undermine democracy itself. Models should be transparent in how they are trained to ensure that trainers aren’t excluding specific viewpoints or, for example, making LLMs that are racist or misogynistic.
  • User privacy protection. This has come up a lot, whether we’re thinking about the Samsung workers who kinda-sorta leaked trade secrets to ChatGPT, or whether we’re generally thinking about what kinds of things we definitely don’t want technology doing (stealing our private data and disseminating it in whatever way).
  • Require webpages to provide indicators when content has been generated substantially or entirely by artificial intelligence. Clearly, I don’t think we need to get rid of AI, but I do think we need a level playing field. This is especially significant for things like journalism and other content-heavy fields that are being completely derailed by unchecked AI content generation. Over the past few years, we’ve seen algorithmically-generated content displace human-generated content. It’s by and large of a much lower quality, and it by and large indicates to create passive advertising revenue. How about we don’t?
  • Interoperability and Operational Transparency. It doesn’t make sense for everybody and their moms to have their own AI language model running on a server in their basement. But nor should there be One AI To Rule Them All– competition is, within reason and within a reasonably regulated market, a good thing. And this happens by creating even the loosest sets of standards around things like APIs, data formats, and more.

Separate stuff I’d wish for might be transparency around how the actual computing gets done, as AI has a truly obscene carbon footprint, and we’re going to have to talk about that at some point. As someone who’s hoping that one day Congress will enact a carbon tax, this would solve most of those problems through Mr. Market. And that’s the approach I hope will work for most of this once we figure out the general guidelines, which I hope will create a sandbox for AI to play in, productively and collaboratively, but under the watchful eye of a mostly-hands-off parent (i.e. the State).


ChatGPT isn’t taking the stand in a trial, but its progenitors are– over the question of whether the artificial intelligence large language model “stole” copyrighted content from two authors.

The “proper” regulatory attitude

I am thinking back to my extensive study of emergent fintech, in which many companies were proactive about their approach to regulation, rather than taking the “forgiveness rather than permission” approach that many tech companies take. I think the latter approach has a higher probability of resulting in bad regulatory outcomes, because it forces regulators into a reactive position rather than enlisting them to help solve the problems of regulation.

I always say that you’re more exposed to legal liability if you piss people off, which is why I suspect Donald Trump keeps getting convicted while many of his quieter colleagues have not. Tech companies have likely avoided the worst of antitrust regulation because Big Tech are huge Democratic donors and Republicans think monopolies are just a free market phenomenon, I guess, but the absolute fury at the irresponsibility of Big Tech seems to be a wholly bipartisan phenomenon. Ultimately, you’re far less likely to piss people off– and less likely to be sued by them- if you talk to them about what you’re thinking, and what you’re trying to accomplish. Companies should take an “early and often” approach to this, but regulators should reward this by not creating meaningless bureaucratic obstacles and instead focus on things like consumer protection and ensuring the functioning of fair and relatively free markets.

I got called into the principal’s office while writing this piece a couple of years ago on the same subject but in utilityworld, but the idea was the same. In it, one of the headers said that “A Regulatory No-Mans Land Sparks No Innovation,” and that’s kind of the gist here. I doubt that Mona Awad and Paul Tremblay are going to slay OpenAI in court with their lawsuit, but they and others will continue to force us to have challenging and utterly critical conversations about how to make sure AI works for us and works for a free, democratic, and prosperous society at large.

(This article was written by a human being. Midjourney prompts created the images. They cannot be copyrighted, I guess).

Nat M. Zorach

Nat M. Zorach, AICP, MBA, is a city planner and energy professional based in Detroit, where he writes about infrastructure, sustainability, tech, and more. A native of Lancaster, Pennsylvania, he attended Grinnell College in Iowa, the Kogod School of Business at American University, the POCACITO transatlantic program, the SISE program at the University of Illinois Chicago, and he is also a StartingBloc Social Innovation Fellow. He enjoys long walks through historic, disinvested Rust Belt neighborhoods at sunset. (Nat's views and opinions are his own and do not represent those of his employer).

Leave a Reply