Compress Yourself Token Rich

Me like Caveman. Caveman great communicator. Caveman save many tokens. Means more prompts with same quality. LLM go brrr

Do you need that in your context?

A couple of weeks ago, I was hammering through millions of tokens on a daily basis, hitting quotas and rate-limits left and right, forcing me to frequently switch providers and juggle different subscriptions. Then I stumbled upon Caveman, which absolutely changed the game for me.

You see, most foundational models provided by the big players like Anthropic's Claude or OpenAI's ChatGPT aim to be helpful assistants, mimicking the language of friendly support staff stuffed with many pleasantries, which are nice if you want to constantly feel validated, are the type of person who would never reject a compliment, or otherwise are susceptible to AI Psychosis.

However, all those "This is a brilliant idea!" and "According to my research using the internet with the playwright web browser, I did find more information regarding that topic bla bla bla…" are very quickly bloating up the available context space.

What I, as an AI overseer & overlord, really want from my agents using those Large Language Models (LLMs) is to be efficient communicators, highlighting noteworthy information while omitting irrelevant, unimportant, or redundant information.

First of all, I don't have the time to read through over 9000 pages of output every day, but more importantly, every page of information that is getting sent back and forth between me, my agents, and the LLMs is costing valuable tokens and polluting my (otherwise pristine) context space.

Most model providers operate on a cost-per-token input/output pricing model; even the flat-rate subscriptions deplete faster the more tokens are being used (cached or uncached). Some ~~tin-foil-hat-wearing individuals~~ insiders might even suggest this bloated output is the default on purpose. So every time I'm forced to read (or skip through) a bunch of irrelevant niceties by my overly friendly companions, I'm more likely to get mad than be charmed. That's where caveman comes into the picture.

Why use many token when few do trick?

What Caveman Does

Caveman is a SKILL.md/ plugin that hooks into the text input/output of your agent system and teaches it to strip away all text fragments (and therefore tokens) that are not strictly needed to transport the same semantic meaning in a message. It compresses your or your agents' original text like one would compress a large bitmap .bmp or .png image file into a much much smaller .jpeg or .webp.

Sure, technically, those are not the same pixels anymore, and if you zoom in very closely, you can tell the differences, but for the purpose of displaying it on a website in an article about model token compression optimizations, the viewers are still able to see and clearly make out the overly excited and joyful cat playing with an electronic toy held by their owner in the image, which is the accurate and relevant core piece of semantic information displayed here:

Sad cat meme - PNG format — Can you tell the difference? One image is 5x larger than the other!!! I mean, are you even sure I haven't swapped the labels just to screw with you? Have you checked? Sauce: r/sadcats

Sad cat meme - compressed WebP — Can you tell the difference? One image is 5x larger than the other!!! I mean, are you even sure I haven't swapped the labels just to screw with you? Have you checked? Sauce: r/sadcats

The same goes for Caveman's text compression, which aims to remove all the text (and therefore tokens) that isn't essential to the semantic meaning. Caveman has different levels of compression, some of which are pretty aggressive and make you wonder: "Is this really gonna work? Just looks like gibberish to me".

Well, it might look like gibberish to you, but then again, you are not really the primary audience/reader anymore (sorry to say, I know it hurts), but the underlying LLMs are. As long as they are able to understand what's asked of them: Why care?

Caveman Compression in Action

Here is a random example from one of my agents using Caveman's 🪨 Full compression mode where it wants to start up a new local dev server to test the latest feature addition in isolation:

Good. Me check if server running:

$ curl -sf http://hardcore.engineer.local:5151/health && echo "UP" || echo "DOWN"
{"ok":true} UP

Server UP. But that server running from main repo, not worktree. For E2E test, need server from worktree with chat changes.
Me check port:

$ lsof -i :5151 2>/dev/null | head -5
COMMAND       PID USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
hc_eng    1111111 user   18u  IPv4 123456789      0t0  TCP localhost:5151 (LISTEN)

OK, server running. But it's from main repo most likely. For E2E test with chat, me need server built from worktree.
Two options:
1. Kill existing server, start from worktree
2. Start worktree server on different port
Let me start worktree server on port 5152

It is smart enough not to touch code blocks or terminal output at all, given that a single space or semicolon can make the difference between a working program and compiler madness. Technical terms are preserved given their high specificity, but removable grammar/articles like "the" or "a" are simply deleted, and this quickly adds up.

The same goes for wordy explanations, which are getting cut to their core essence, resulting in oftentimes funny Neanderthal-like expressions of frustration like: "Ugg. Caveman see issue."

It truly reads like coming from a stone-age time-traveling language-handicapped person, but in reality, behind those abbreviated and shortened sentences lies the exact same reasoning power as before. Furthermore, you still can follow and know what the agent is up to, no? Instead of writing a new novel about server port configuration and its super-interesting strategy to execute the e2e test plan, Caveman gets straight to the point.

Results: 50%-75% Token Use Reduction

At first, I was highly skeptical of the promise that Caveman compression delivers the same or nearly the same quality, reasoning, and output as full-text instructions would, but (subjectively speaking) I have not felt a drop in quality whatsoever.

If anything, I'm now able to use the mid-range context-size models like Opus 4.5 with a context size of 200,000 tokens effectively without triggering context compression every other round/iteration.

Compared to using Opus 4.7 with a million token context, I'm even able to get better results (mainly because for the same price and twice as fast I can send a team of 10 agents to focus and concentrate on the task, rather than to hope the newest mega model somehow can one-shot solve the same problem).

For the last 2 weeks that I'm using Caveman, I haven't filled my Anthropic weekly quota, which is astonishing! Before, it took me 2–3 days to exhaust my weekly budget, which easily translates to a 50+% reduction in total token usage, maybe even closer to the postulated 75% by the caveman author himself.

Try Caveman Compression Yourself

Don't just take my word for it - paste some text below and see how Caveman compresses it in real-time:

*Please be patient, depending on the current usage and utilisation, you are likely using a model running on CPU 🐌 ... but its live!

Additional Performance Benefits

…the equation is familiar and unforgiving:
as context length increases, attention computation costs explode

In case you haven't noticed when working with smaller models: The less space the context window with a model is actively occupying in VRAM, the quicker a model is able to respond (meaning better latency) and the higher its inference token/sec decoding (meaning better throughput).

This is due to memory optimizations like Flash Attention which works best on the faster, but smaller and harder to aquire, parts of the GPU memory: GPU SRAM > GPU HBM > CPU DRAM (its a hierachy similar to your L1, L2, L3 shared caches next to your CPU cores for example).

A bigger impact for cavemen users probably is the use of KV-Caching that can only take place, well, if there is enough VRAM left alongside the actual model weigths to build a large cache. It obviously also massively helps to have heavily repeating keys, which: if vocab small like caveman -> good chance use same word again.

So, not only are you saving 💰️💰️💰️ on volume, but you're additionally gaining efficiencies on 🚀🚀🚀.

How good is your 文言文?

There is, though, a level of compression I'm not quite comfortable with, which is 📜 文言文 (wényánwén) and (apparently) translates to "Classical Chinese" or "Literary Chinese". That's the stuff where every symbol carries a whole sentence of meaning (which reminds me to re-watch Hero (Chinese: 英雄; pinyin: Yīngxióng) — great movie btw which partially is dealing with the standardization of written literary language) and was "used from ancient China through the early early 20th century".

Don't get me wrong: Having one character holding the semantic weight of entire paragraphs is amazing for compression, but seeing your agents communicate in a secret language you can't even follow, let alone guide or debug, still feels a bit scary. I know... not very hardcore of me.

Even though all strings eventually make their way from UTF/ASCII to a binary representation anyway, at least I know that for every transformation step there is a clear, well-understood, agreed-upon interface and protocol between all interacting components. But I have no insight into whether traditional Chinese was part of the training data set for the model my agent is using, nor do I know how capable the LLM is of translating between languages without losing the original meaning.

Cypher (a true hardcore engineer) watching the Matrix stream fully compressed and encoded

I'm not quite agreeing with Cypher here that you have to look at the real-time stream of information fully encoded and, for now at least, will stick to a compression level/mode somewhere between 🪨 Full and 🔥 Ultra, which turned out to be fully satisfactory for me.

The Future of Text Compression

With image compression, there are lossy and lossless compression algorithms. Truly lossless compression algorithms look for statistical patterns which can be used to create reversible hashes of a section of data or a combination of different parts of data, like for example, compressing the information of 6 white pixel on a line 0x0:#FFFFFF, 0x1:#FFFFFF, 0x2:#FFFFFF, 0x3:#FFFFFF, 0x4:#FFFFFF, 0x5:#FFFFFF to 0x0-5:#FFF is saving you 67/77 of characters while semantically expressing the same thing (if you agree on the spec). That's also a reason why SVG Vector graphics are so great: Storing some lines of interpretable text is always gonna be much much more efficient than to store millions of pixel data points across however many channels, colors, and resolutions.

We still have to come up with a standard for lossless text compression that is widely accepted, decompressible, and understood by Large-Language-Models. Using the entire internet as training data was the needed initial massive push to create the first-gen random garbage text output generators which came a long way to now being able to generate something that at least resembles text a person or code a software engineer would write.

However, we are now at a point where using human natural language is a bottleneck in how much and how fast a model can process information. Caveman is a great step in the right direction, where we can still use natural language at the top input layer/interface to describe a problem or specify a requirement, but use compressed data for all the machine-to-machine layers thereafter.

One might argue that the tokenization process itself is the relevant optimization step here to tackle, and we should leave this up to the model provider and researchers that are trying to come up with the best tokenizer, encoding layers, and human-to-machine interfaces. However, in my opinion, as the user of LLMs, this means giving up further control and trusting those model creators not to purposefully create token-hungry layers. It would also mean waiting to maybe get thrown some leftover crumbles in the form of write access for the system context or hoping for open-weights and the ability to customize the last layers yourself. But as we seen so far, this is rather the exception than the norm.

I very much welcome any tool that further enables the user to solve more problems, more efficiently, cheaper, and even more so, when born through open-source! Check it out: github.com/JuliusBrussee/caveman

I also love to see it when my agents come back to me like: "Plan finished; Checked todos. Me done."