Tiny Press
Compress any text to a token budget locally.
When we use AI, it feels like we are just talking to a smart system. π€ But under the hood, it is not really reading words the way we do. It reads tokens. Tokens can be whole words, parts of words, punctuation, or symbols. Different AI models can split the same sentence into different tokens, depending on the tokenizer they use.
That small detail matters more than it seems.
Every AI model has a token limit, also called a context window. This is the amount of text the model can handle at one time. Your instructions, chat history, examples, documents, and questions all need to fit inside that limit. And in transformer-based models, longer input also means more work for the model, because attention becomes more expensive as the text gets longer. π₯±
So, token limits are not just random rules! They exist because models have limits on memory, speed, and cost. And for people using AI, that means tokens affect three important things:
That is what got me thinking. What if we could make prompts smaller, but still keep the meaning strong? π€
When people first start using AI, the natural habit is to add more. π More instructions. π More examples. π More background. π More βjust in caseβ context. Iβve done that too. π
But over time, I learned that AI does not always need more. It needs to be clearer. βοΈ A good prompt is not the one with the most words. βοΈ A good prompt is one that helps the model understand the task quickly and clearly.
That is why being mindful of tokens matters. Each token takes up space in the modelβs working memory. So every extra line in a prompt should help the model do better work. If it does not, it may only add noise.π±
Every token should earn its place.
This was one of the biggest lessons for me. Many people think the final answer depends only on the main AI model. But in many systems, there is another important layer before the model even starts answering: embeddings. π§
Embeddings turn text into vectors so systems can compare meaning, not just keywords. They are used in search, retrieval, clustering, recommendations, and related tasks.
So if the embedding model changes, the retrieved context can change. And if the retrieved context changes, the final answer can also change β even if the main language model stays the same.
That means the same LLM can behave differently depending on the embedding model used behind it.
Some embedding models even work better when text is formatted in a certain way. For example, Sentence Transformers documentation shows that some retrieval models perform best when queries and passages use different prefixes, such as query: and passage:
The output is not only shaped by how the model writes. It is also shaped by what information reaches the model in the first place. That is a big deal.
This is where things became really interesting for me. If we want better compressed prompts, we should not only think about making prompts shorter. We should think about keeping the most useful meaning in fewer tokens. That is where the combination of a generation model and an embedding model becomes powerful.π
A few patterns stand out:
1. Small embedding model + stronger LLM β A light embedding model can help find the most useful context quickly. Then a stronger LLM can work on that smaller, cleaner prompt. This can save both cost and time. OpenAIβs embeddings documentation also shows that embedding models differ in size, cost, and benchmark performance.
2. Better retrieval = better prompt β Sometimes the problem is not the LLM. Sometimes the problem is that the wrong context was sent to it. Better embeddings can improve what gets selected, and that can improve the final answer, too.
3. Reranking helps β A system may first retrieve many useful pieces, then rerank them to choose the best ones. Sentence Transformers clearly separates embedding models and rerankers, which is helpful when building cleaner prompts.
4. Flexible embeddings help with compression β Googleβs work on Matryoshka Representation Learning shows that embeddings can be made flexible, so smaller dimensions can still keep useful meaning. Google also notes this idea in Gemini Embedding, where output dimensions can be reduced depending on the use case.
Good compression is not about losing meaning. It is about carrying meaning better.
Its just not a way to shorten the prompt, but a simple way to explore the problem practically.
A lot of people know prompts can become too long. But it is harder to clearly see: π΅βπ«
TinyPress helps make those changes visible. It is a small tool, but it helped me think more clearly about prompts, token limits, and semantic compression. And yes, the name is a bit of a pun too. Because sometimes the best ideas arrive when we give them a little press.
Because in the end, good compression is useful, but a good connection is even better.
If this speaks to your work or your curiosity, I would be very happy to hear your thoughts. Please feel free to share ideas, feedback, improvements, or even better puns. Letβs build smaller, think more clearly, and help prompts say a little more with a little less π
β Harsha
Compress any text to a token budget locally.