I Compressed LLM Context 10x and Kept 89% of the Facts
Sentence-level pooled embeddings as a drop-in replacement for raw tokens in LLM context.
The experiment
This is a writeup of one small experiment, not a finished result. The question I wanted to answer: can old conversation history inside an LLM be replaced with something an order of magnitude smaller, while the model still extracts the facts it needs?
The system I built is autoencoder-shaped but not actually an autoencoder. The encoder is Google's EmbeddingGemma, frozen. The decoder is Gemma 4 E2B, frozen, with a small LoRA. The only thing actually trained from scratch is a 2.2M-parameter linear adapter that bridges them. That's 0.04% of the base model. The whole thing runs end to end on a single GPU in about two hours.
The honest framing matters. EmbeddingGemma was trained for retrieval, not to produce vectors that another model decodes back into facts. So really we're asking a different question: are off-the-shelf retrieval encoders unintentionally close to being the encoder half of an autoencoder? The fact that a downstream LLM can be trained to read facts out of these vectors at all is what makes the experiment worth running. The fact that it has a clear ceiling, and a clear next experiment that addresses that ceiling (training the encoder too), is what makes it worth writing up.
The KV cache problem
Every time you chat with an LLM, it stores your entire conversation history in GPU memory as a KV cache. Every token from every previous message sits there, consuming the same memory whether it's the insightful question you asked 30 turns ago or the word "the."
The math is brutal. A 10,000-token conversation on Gemma 4 31B consumes about 1.5 GB per user. An 80GB H100 fits maybe 40 concurrent users. Serving 10,000 users requires roughly 250 GPUs at about $490K/month.
Most of that memory is OLD context. Stuff from 20, 30, 50 turns ago that the model already processed and understood. But it's stored word-for-word, at full precision, burning GPU memory that could be serving more users.
The approach: treat old context like memory, not tape
Humans don't remember conversations word-for-word. We remember the gist. "She's an engineer at Google, lives in Lagos, mentioned something about telescopes." We compress automatically, keeping what matters and dropping the filler.
I wanted to build the same thing for LLMs. Take an old conversation turn, split it into sentences, compress each sentence into a single vector with a sentence encoder, then feed those vectors to the LLM alongside the current conversation.
A 5-sentence paragraph becomes 5 vectors instead of about 50 tokens. 10x compression. The LLM learns to read these compressed vectors and pull facts out of them, like reading notes instead of re-reading the full transcript.
What I built
The architecture is deliberately simple:
Old text → split into sentences
→ each sentence encoded by EmbeddingGemma (frozen) → 768-dim vector
→ projected by a trained linear adapter → prefix positions in the LLM
→ LLM (Gemma 4 E2B, frozen + LoRA) reads the prefix and generates The LLM doesn't know it's reading compressed text. It just sees vectors at certain input positions, the same way multimodal models process image patches. The adapter learns to project sentence embeddings into a format the LLM's attention can naturally process. Everything is frozen except the adapter and a small LoRA on the decoder.
The journey: why curriculum training was essential
This didn't work on the first try. Or the second. Or the ninth. The failures were as informative as the successes.
Phase 1: can the LLM read ONE embedding?
Simplest possible task. One sentence, compressed to one vector. Can the LLM extract basic facts from it?
After training on 5,000 synthetic sentences: 60.3% exact match on held-out test data. The LLM was reading the embedding and extracting specific words from it. Not perfect, but proof the mechanism works.
Phase 1.5: what about a whole paragraph?
Same idea, but compress a 5-sentence paragraph into one vector. Result: 50.0% exact match. Ten points lower. The single vector was trying to represent 5 sentences worth of facts and details were getting diluted. City extraction dropped to 21%. The city and country were competing for representation in the same vector.
Phase 3: the breakthrough, chunk it
Instead of cramming everything into one vector, each sentence gets its own:
[pool: "Marcus Chen is an engineer at Google."] → vector 1
[pool: "He lives in Lagos, Nigeria."] → vector 2
[pool: "He graduated from University of Nairobi."] → vector 3
[pool: "He speaks English and Yoruba."] → vector 4
[pool: "His team works on cloud infrastructure."] → vector 5
[question tokens: "Q: city? A:"]
→ model searches across 5 vectors, finds vector 2 → "Lagos" Result: 89.3% exact match. Same text, same total information, but organized as 5 focused vectors instead of 1 diluted vector. The model learned to select the right vector for each question through standard attention.
The results
| Setup | Exact match | Compression |
|---|---|---|
| 1 sentence → 1 vector | 60.3% | baseline |
| 5 sentences → 1 vector | 50.0% | 10x (lossy) |
| 5 sentences → 5 vectors | 89.3% | 10x (accurate) |
Per-fact breakdown for the chunked approach:
| Fact type | Single vector | Chunked (5 vectors) | Gain |
|---|---|---|---|
| Profession | 75.5% | 96.2% | +20.7pp |
| Country | 73.5% | 97.3% | +23.8pp |
| Employer | 62.5% | 94.3% | +31.8pp |
| City | 21.0% | 88.3% | +67.3pp |
| Name | 48.0% | 70.2% | +22.2pp |
City went from 21% to 88% just from chunking. What I thought was a fundamental limitation of the sentence encoder turned out to be pool dilution. When city and country were in the same vector, they competed. Separate vectors solved it.
Scaling: does this hold at 500+ pools?
The real question for production: can the model handle not just 5 pools but hundreds? I ran a scaling study from 5 to 1,000 pools, where each pool belongs to a different person and the question names which person to extract from.
| Pools | Overall | Profession | Country | City |
|---|---|---|---|---|
| 5 | 76.0% | 91.2% | 90.7% | 42.7% |
| 10 | 75.8% | 93.2% | 88.9% | 42.9% |
| 50 | 73.5% | 93.5% | 88.4% | 43.4% |
| 100 | 73.8% | 90.4% | 87.7% | 41.1% |
| 250 | 72.5% | 89.4% | 75.4% | 50.8% |
| 500 | 78.8% | 92.9% | 91.9% | 60.0% |
| 1000 | 67.0% | 81.5% | 81.5% | 17.6% |
Near-flat scaling from 5 to 500 pools. Going from 5 to 500 different people in the prefix (100x more pools) caused essentially zero accuracy loss on profession and country. The architecture genuinely scales.
At 1,000 pools, city collapses (60% to 18%) because multi-token city names decoded from pools become ambiguous when there are 1,000 candidates. Single-token facts like profession and country hold up at 81.5% even at 1,000 pools.
Reconstruction: the autoencoder result
I also tested whether the compressed pools contain enough information to reconstruct the original text. I trained the model on a 50/50 mix of fact-extraction questions and "repeat the text from these pools" tasks.
The result: the model can regenerate source text from pools near-perfectly on straightforward examples while maintaining the same 89% QA extraction accuracy. The pools genuinely contain the information. They aren't just triggering memorized patterns.
What doesn't work
Numbers
Pooled embeddings can't preserve numeric precision. The sentence encoder treats "owns 880 radios" and "owns 881 radios" as nearly identical (cosine similarity 0.96). Great for search. Terrible for fact extraction. Ages, counts, salaries: all essentially 0% extraction from pools alone.
The fix is straightforward: keep number tokens as literal tokens in the prefix alongside the pools. In a separate experiment, literal token passthrough achieved 95.5% exact match on 3-digit counts.
Code
I tested the architecture on code (function names, parameters, dependencies extracted from pool embeddings of code snippets). It didn't work: 6% accuracy versus 76% on persona text. The sentence encoder (EmbeddingGemma) was trained on web text, not code. The pooled representation of a Go function doesn't preserve specific identifiers the way a pooled sentence about a person preserves their profession. A code-specific encoder would likely fix this.
The painful lessons
Curriculum training is non-negotiable. Training multi-vector extraction directly failed every time. The model needs to learn pool reading before pool selection. Each phase teaches exactly one new skill.
LoRA will find every shortcut you leave open. The model learned to predict answer "shapes" without reading the vectors: 94.5% first-token accuracy with 0% exact match. Contrastive training (multiple passages with the same fact type but different values) was required to close this shortcut.
A tokenization bug silently killed 9 experiments. Special tokens were being split into fragments by the tokenizer instead of being recognized as single tokens. The embedding overrides never triggered. Nine full training runs before a targeted diagnostic exposed the two-line fix.
Pool dilution looks like an encoder limitation but isn't. I spent days trying to fix "the encoder can't represent cities" with better adapters, learned queries, and Q-former attention. The real problem was cramming too much text into one vector. Chunking solved it instantly.
What this means at scale
For a chat product compressing old conversation turns:
| Standard | Sentence-level pools | |
|---|---|---|
| Memory per user (10K tokens) | ~1.5 GB | ~150 MB |
| Concurrent users per H100 | ~40 | ~400 |
| Monthly cost (10K users) | ~$490K | ~$49K |
| First token latency | ~500ms | ~50ms |
What's next
The honest reading of these results: the decoder learned to read pooled sentence embeddings produced by an encoder that was never trained to be read. That's a strong baseline for an architecture that's basically getting the encoder for free. The system has the shape of an autoencoder, but the encoder half was optimized for retrieval, not for being decoded.
The obvious next experiment is to train the other half. Either fine-tune EmbeddingGemma on this exact objective (be decodable into facts by a downstream LLM), or train a small new encoder from scratch with that loss. The current setup is a one-sided contract: the decoder is optimized to read the encoder, and the encoder is oblivious. Closing that loop should move every number in this writeup, and probably attack the failure modes too (numbers, code).
A few side experiments worth running once the autoencoder is end-to-end:
- Capture from a middle layer of the decoder rather than projecting into prefix positions.
- Joint training of encoder + adapter + decoder LoRA with both a reconstruction loss and a fact-extraction loss.
- Re-test whether literal-token passthrough for numbers (the workaround used here) becomes unnecessary once the encoder is trained for the task.
If you're working on efficient LLM inference, model compression, or learned compression for context, I'd be glad to compare notes. isaiah@ballah.ai.
All experiments ran on a single RTX 4500 Ada (24GB). Total compute cost for the full research arc: about $50.