Offline “Studying” Shrinks the Cost of Contextually Aware AI

By having AI study a user’s context offline, researchers dramatically reduce the memory and cost required to make AI contextually aware.
Imagine you are a lawyer using an AI chatbot to help you prepare a brief. You upload some relevant legal documents to your favorite bot and start issuing prompts. Soon, the bot is chugging away. What you don’t see, however, is the tremendous memory, compute power, and energy being consumed in the background.
“The AI’s internal representation of a 70,000-word legal document might consume more than 100 gigabytes of precious GPU memory. To put into context how large this representation is, the raw text from that same document is only 400 kilobytes — 250,000 times smaller,” explains Sabri Eyuboglu, a doctoral student in computer science at Stanford University. “All of this memory consumption makes it really costly and slow to produce the chatbot’s response.”
Eyuboglu is the first author of a new preprint paper with an intriguing solution to this problem. He calls them "Cartridges." A Cartridge is a compact memory module that is trained offline to represent a document or other text, allowing an AI bot to answer queries quickly without reanalyzing the full document. Eyuboglu says that Cartridges work for any large body of textual information — legal documents, computer code, personal files, textbooks, patient medical records, and more.
“Today’s AI systems do a good job adapting their responses to a small amount of context — think a few pages of text. But, unfortunately, the performance and efficiency of today’s systems degrade as context grows,” says author Simran Arora. “With Cartridges, we’re exploring ways to more efficiently and effectively scale up the amount of context we can provide to the model.”
By storing context in these compact Cartridges, Arora and Eyuboglu along with co-authors and their advisor Professor Chris Ré found they could shrink memory requirements by orders of magnitude. Cartridges, they say, use almost 40 times less memory, boosting the bot’s words-per-second output by more than 25 times compared with conventional in-context learning (ICL) methods. The research was partially funded by the Stanford Institute for Human-Centered AI.
New Horizons
The innovation sprung from a relatively modest concept. “Since the same documents are often referenced by many queries, let’s invest a ton of compute up front to prepare the Cartridges,” Eyuboglu says. “Then, as we get more queries down the line, we can respond very quickly.”
This is not the first time researchers have tried to lighten AI’s memory load, but prior efforts invested comparatively little compute in the compression process. Memory footprints were smaller, yes, but those gains came at a high cost — the answers were worse. In contrast, Cartridges consume less memory while still producing high-quality answers. This is possible because they are produced in a very compute-intensive process. “This trade-off is desirable when contexts are shared across many queries and the cost of producing the Cartridges can be shared,” Eyuboglu says.
In effect, Cartridges train themselves through a key innovation the team calls “self-study.” With self-study, the model doesn’t simply memorize the text, it carries on a conversation with itself about the document, essentially simulating the queries a real user might ask. These conversations are then baked into the Cartridge using standard training algorithms. In this way, Cartridges can be used across diverse prompts, which saves time, effort, and memory down the road, Eyuboglu says.
“If you just train only on the context with a simple objective like next-token prediction, you could memorize the document, but you’d only be able to regurgitate it,” Eyuboglu says. “What the synthetic conversations do — what self-study does — is critical for allowing the model to actually answer general questions and tasks quickly and accurately at a later point in time.”
Next Steps
Cartridges are by no means free. Self-study requires the use of a powerful multi-GPU system. They actually require more energy to produce/train initially, but then regular use will require less energy. A key part is that training Cartridges can happen offline when compute power is cheap or in lower demand. And Cartridges can be reused across countless queries of any large body of text.
Future directions Eyuboglu hints at include more efficient training of the Cartridges, real-world deployment in specific domains like medicine and law, and perhaps even standard libraries of Cartridges for public use. Eyuboglu notes that those trained on different texts can be combined, an intriguing finding that could propel future research on Cartridges.
What’s most exciting to the team is that Cartridges may present a scalable and sustainable path toward AI systems that are personalized and continually learn from the user’s context.
“The recent history of AI has been all about building huge monolithic models that are the same for everyone,” Eyuboglu concludes. “I think we’re starting to see the limits of that approach. With this work, we’re providing evidence that self-study techniques could present a scalable path forward.”





