A recent paper published by researchers from Stanford University, Cornell University, and West Virginia University reveals that Meta's Llama3.1 AI model can reproduce large amounts of copyrighted book content verbatim, posing a potential legal risk for the tech giant. The study found that the Llama3.170B model could reproduce up to 42% of the text from "Harry Potter and the Philosopher's Stone" during testing, far surpassing the 4.4% achieved by the first-generation Llama model.

AI models such as OpenAI's ChatGPT and Meta's Llama are typically trained on massive datasets to identify and generate new patterns. However, the key finding of this research is that Meta's Llama model appears not just to learn language patterns but can almost "fully remember" certain books, such as "Harry Potter" and "1984." Mark Lemley, a technology law expert at Stanford, stated that if an AI can generate complete excerpts from its training data, it no longer qualifies as a "transformative work" based on learning but instead resembles a "giant .ZIP file" containing copyrighted works, allowing users to copy freely.

Copyright

Copyright Controversy: Verbatim Reproduction vs. Learning Patterns

In testing AI models from companies like OpenAI, DeepSeek, and Microsoft, Lemley's research team discovered that Meta's Llama was the only model capable of accurately recounting book content. Besides the first book in the "Harry Potter" series, the model also demonstrated significant memory capabilities for F. Scott Fitzgerald's "The Great Gatsby" and George Orwell's "1984."

The use of copyrighted materials to train Meta's AI has been highly controversial. The company is currently facing multiple copyright lawsuits, including one filed by notable authors (such as comedian Sarah Silverman), accusing Meta's models of being trained using the illegally obtained "Books3" dataset, which contains nearly 200,000 copyrighted publications. Court documents show that a Meta engineer once commented, "It felt wrong downloading torrents with a company laptop."

Lemley estimates that if only 3% of the content in the "Books3" dataset is deemed infringing, Meta could face statutory damages of nearly $1 billion, not including profit sharing. If the infringement ratio is higher, Meta's potential legal liabilities would be even more severe.

Legal Experts Shift Stance, Meta Refuses Comment

Notably, Lemley himself represented Meta in previous generative AI copyright litigation (Kadrey v Meta Platforms). However, as he led this research on AI model memory and reproduction of copyrighted content, he announced earlier this year that he would no longer represent Meta to protest certain behaviors of the company and its CEO, Mark Zuckerberg. Although he previously believed Meta should win, the new research findings seem to have changed his view.

Meta declined to comment on Lemley's latest research findings.