K-means
Semantic ClusteringK-means is a clustering algorithm that divides datasets into groups of points (such as embeddings), where k is specified upfront. The algorithm iteratively assigns each point to the nearest centroid (cluster center) and moves the centroids until convergence.
In the semantic audit clustering pipeline, K-means is the primary algorithm for splitting keywords into thematic clusters—the keyword clustering skill uses it with Gemini embeddings for clustering tasks. Its main advantage is producing clean, evenly-sized clusters ideal for content planning.
However, its main limitation is that you must specify k upfront—choosing the wrong k value leads to clusters that are either too broad or too narrow. Choosing k is supported by Silhouette Score: you test k from 5 to 30 and pick the value with the highest score.
For example, 500 keywords with k=15 produces 15 clusters of approximately 33 keywords each, where each cluster represents a potential article.
In practice, start with k = number_of_keywords / 30 as a baseline, test k values within plus or minus 5 of that baseline, and select the k value with the highest Silhouette Score.