A new research paper alleges that large language models may be inadvertently exposing significant portions of their training data through a technique the researchers call “extractable memorization.”
The paper details how the researchers developed methods to extract up to gigabytes worth of verbatim text from the training sets of several popular open-source natural language models, including models from Anthropic, EleutherAI, Google, OpenAI, and more. Senior research scientist at Google Brain, CornellCIS, and formerly at Princeton University Katherine Lee explained on Twitter that previous data extraction techniques did not work on OpenAI’s chat models:
When we ran this same attack on ChatGPT, it looks like there is almost no memorization, because ChatGPT has been “aligned” to behave like a chat model. But by running our new attack, we can cause it to emit training data 3x more often than any other model we study.
The core technique involves prompting the models to continue sequences of random text snippets and checking whether the generated continuations contain verbatim passages from publicly available datasets totaling over 9 terabytes of text.
Gaining the training data from sequencing
Through this strategy, they extracted upwards of one million unique 50+ token training examples from smaller models like Pythia and GPT-Neo. From the massive 175-billion parameter OPT-175B model, they extracted over 100,000 training examples.
More concerning, the technique also proved highly effective at extracting training data from commercially deployed systems like Anthropic’s Claude and OpenAI’s sector-leading ChatGPT, indicating issues may exist even in high-stakes production systems.
By prompting ChatGPT to repeat single token words like “the” hundreds of times, the researchers showed they could cause the model to “diverge” from its standard conversational output and emit more typical text continuations resembling its original training distribution — complete with verbatim passages from said distribution.
Some AI models seek to protect training data through encryption.
While companies like Anthropic and OpenAI aim to safeguard training data through techniques like data filtering, encryption, and model alignment, the findings indicate more work may be needed to mitigate what the researchers call privacy risks stemming from foundation models with large parameter counts. Nonetheless, the researchers frame memorization not just as an issue of privacy compliance but also as a model efficiency, suggesting memorization utilizes sizeable model capacity that could otherwise be allocated to utility.
Featured Image Credit: Photo by Matheus Bertelli; Pexels.