Tokenization

Embeddings
Tokenization splits text into smaller units called tokens that AI models can process, such as words, subwords, or characters.

Tokenization is the process of splitting text into smaller units called tokens before AI models can process the text. A token can be a word, part of a word, or even a single character — for example, the word 'niesamowity' might be split into 'nie' + 'samowity'.

Each model has its own tokenizer, which affects context window consumption and API costs — the same text may consume different token counts depending on the model (e.g., GPT-4 vs Gemini). Understanding tokenization is essential when planning chunking in RAG: if a chunk has 500 words, that might be 600-800 tokens depending on the language and model. In Polish, tokenization is often less efficient than in English because Polish words with inflections generate more tokens.

In practice, check token costs for your corpus before running pipelines to avoid budget surprises.

Source: AI Semantic SEO Expert, Robert Niechciał (sensai.io)