Embedding Normalization
EmbeddingsEmbedding normalization is the process of scaling embedding vectors to uniform length, ensuring comparability between vectors. 3072-dimensional embeddings (like OpenAI text-embedding-3-large) are normalized 'out of the box': each vector has unit length, so cosine similarity and dot product yield identical results.
However, 768-dimensional embeddings (like Gemini text-embedding-004) may require manual normalization, for example through UMAP (reduction to 5 dimensions) before K-means clustering. Without normalization, clustering can produce uneven clusters because the algorithm is sensitive to data scale.
In practice: Gemini normalizes vectors for cosine similarity, so when comparing pairs it's OK, but for K-means clustering additional UMAP normalization improves results. When you compare pairs (similarity, duplicates), normalization is optional; when you cluster — normalization is recommended.