TinyPress, Tokens, and Learning to Say More with Less

Community Article Published June 15, 2026

Background
Why users should care about token limits
Why embeddings matter too
How model and embedding combinations help with compressed prompts
Why I built TinyPress
Background

When we use AI, it feels like we are just talking to a smart system. 🤖 But under the hood, it is not really reading words the way we do. It reads tokens. Tokens can be whole words, parts of words, punctuation, or symbols. Different AI models can split the same sentence into different tokens, depending on the tokenizer they use.

That small detail matters more than it seems.

Every AI model has a token limit, also called a context window. This is the amount of text the model can handle at one time. Your instructions, chat history, examples, documents, and questions all need to fit inside that limit. And in transformer-based models, longer input also means more work for the model, because attention becomes more expensive as the text gets longer. 🥱

So, token limits are not just random rules! They exist because models have limits on memory, speed, and cost. And for people using AI, that means tokens affect three important things:

Cost — many AI APIs charge by token count, including embeddings.
Speed — more tokens usually mean slower responses.
Quality — long prompts are not always better prompts. Sometimes they are just longer.

That is what got me thinking. What if we could make prompts smaller, but still keep the meaning strong? 🤔

Why users should care about token limits

When people first start using AI, the natural habit is to add more. 👉 More instructions. 👉 More examples. 👉 More background. 👉 More “just in case” context. I’ve done that too. 😁

But over time, I learned that AI does not always need more. It needs to be clearer. ✔️ A good prompt is not the one with the most words. ✔️ A good prompt is one that helps the model understand the task quickly and clearly.

That is why being mindful of tokens matters. Each token takes up space in the model’s working memory. So every extra line in a prompt should help the model do better work. If it does not, it may only add noise.😱

Every token should earn its place.

Why embeddings matter too

This was one of the biggest lessons for me. Many people think the final answer depends only on the main AI model. But in many systems, there is another important layer before the model even starts answering: embeddings. 🧐

Embeddings turn text into vectors so systems can compare meaning, not just keywords. They are used in search, retrieval, clustering, recommendations, and related tasks.

So if the embedding model changes, the retrieved context can change. And if the retrieved context changes, the final answer can also change — even if the main language model stays the same.

That means the same LLM can behave differently depending on the embedding model used behind it.

Some embedding models even work better when text is formatted in a certain way. For example, Sentence Transformers documentation shows that some retrieval models perform best when queries and passages use different prefixes, such as query: and passage:

The output is not only shaped by how the model writes. It is also shaped by what information reaches the model in the first place. That is a big deal.

How model and embedding combinations help with compressed prompts

This is where things became really interesting for me. If we want better compressed prompts, we should not only think about making prompts shorter. We should think about keeping the most useful meaning in fewer tokens. That is where the combination of a generation model and an embedding model becomes powerful.😎

A few patterns stand out:

1. Small embedding model + stronger LLM — A light embedding model can help find the most useful context quickly. Then a stronger LLM can work on that smaller, cleaner prompt. This can save both cost and time. OpenAI’s embeddings documentation also shows that embedding models differ in size, cost, and benchmark performance.

2. Better retrieval = better prompt — Sometimes the problem is not the LLM. Sometimes the problem is that the wrong context was sent to it. Better embeddings can improve what gets selected, and that can improve the final answer, too.

3. Reranking helps — A system may first retrieve many useful pieces, then rerank them to choose the best ones. Sentence Transformers clearly separates embedding models and rerankers, which is helpful when building cleaner prompts.

4. Flexible embeddings help with compression — Google’s work on Matryoshka Representation Learning shows that embeddings can be made flexible, so smaller dimensions can still keep useful meaning. Google also notes this idea in Gemini Embedding, where output dimensions can be reduced depending on the use case.

Good compression is not about losing meaning. It is about carrying meaning better.

Why I built TinyPress

Its just not a way to shorten the prompt, but a simple way to explore the problem practically.

A lot of people know prompts can become too long. But it is harder to clearly see: 😵‍💫

What can be removed?
What should stay?
How much difference does that make in token usage?

TinyPress helps make those changes visible. It is a small tool, but it helped me think more clearly about prompts, token limits, and semantic compression. And yes, the name is a bit of a pun too. Because sometimes the best ideas arrive when we give them a little press.

Hugging Face Space: https://huggingface.co/spaces/build-small-hackathon/tiny-press
GitHub: https://github.com/SriharshaCR/tiny-press

Because in the end, good compression is useful, but a good connection is even better.

If this speaks to your work or your curiosity, I would be very happy to hear your thoughts. Please feel free to share ideas, feedback, improvements, or even better puns. Let’s build smaller, think more clearly, and help prompts say a little more with a little less 😉

— Harsha

References:

Spaces mentioned in this article 1

I built a world you can talk into existence

June 15, 2026

6ixPulse: piloting agentic neighbourhood research, starting with Toronto

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

TinyPress, Tokens, and Learning to Say More with Less

Background
Why users should care about token limits
Why embeddings matter too
How model and embedding combinations help with compressed prompts
Why I built TinyPress
Background

Why users should care about token limits

Why embeddings matter too

How model and embedding combinations help with compressed prompts

Why I built TinyPress

References:

Spaces mentioned in this article 1

Tiny Press

I built a world you can talk into existence

6ixPulse: piloting agentic neighbourhood research, starting with Toronto

Community

Spaces mentioned in this article 1

Tiny Press

TinyPress, Tokens, and Learning to Say More with Less

Background Why users should care about token limits Why embeddings matter too How model and embedding combinations help with compressed prompts Why I built TinyPress Background

Why users should care about token limits

Why embeddings matter too

How model and embedding combinations help with compressed prompts

Why I built TinyPress

References:

Spaces mentioned in this article 1

I built a world you can talk into existence

6ixPulse: piloting agentic neighbourhood research, starting with Toronto

Community

Spaces mentioned in this article 1

Background
Why users should care about token limits
Why embeddings matter too
How model and embedding combinations help with compressed prompts
Why I built TinyPress
Background