Global chip giant NVIDIA is recently embroiled in a legal dispute over the source of AI model training data. A newly filed amended complaint submitted to a California court revealed shocking details: NVIDIA is accused of proactively contacting global well-known pirated e-book sites Anna’s Archive to obtain millions of copyrighted books in order to stay ahead in the competition.

The plaintiffs, including several authors such as Abdi Nazemian, claim that during the pressure of delivering at the 2023 Developer Conference, members of NVIDIA's internal strategy team directly asked Anna’s Archive what resources it could provide and expressed willingness to include them in large language models (LLMs) for pre-training. The accusation states that despite being clearly informed that their collections were illegally obtained, NVIDIA's management gave the green light to continue the project within a week, thereby obtaining access to about 500TB of massive data.

In addition to Anna’s Archive, the complaint also mentions that NVIDIA may have used data sources from other "shadow libraries" such as LibGen, Sci-Hub, and Z-Library. Furthermore, the company is accused of distributing tools to enterprise customers to help them automatically acquire datasets containing pirated works, thus facing allegations of "substitute infringement" and "joint infringement." NVIDIA had previously tried to defend itself by citing "fair use," but with the release of key evidence such as internal emails, the case is now trending more favorably toward copyright holders.

Key points:

  • ⚖️ Involved in a class-action lawsuit: Several renowned authors jointly accuse NVIDIA of using pirated books on a large scale to train its core models, such as NeMo and Megatron.

  • 📑 Proactively contacting pirated sources: Internal emails show that NVIDIA actively contacted Anna’s Archive, even asking how to gain high-speed download rights for 500TB of data by paying fees.

  • 🛡️ Infringement charges escalated: The plaintiffs not only accuse NVIDIA of violating rules during internal training, but also charge that NVIDIA provided customers with automated scripts, indirectly facilitating the secondary spread of pirated data.