Duplicate Detection (similarity 1.0)
EmbeddingsDuplicate Detection is the application of embeddings to automatically detect identical content based on cosine similarity near 1.0. In SEO site audits, it instantly identifies pages with the same content—often pagination pages, parameterized URLs, or copied articles. Similarity 1.0 means a duplicate for consolidation or canonical setup, while similarity 0.9-0.99 indicates cannibalization (a different phenomenon requiring different strategy). The process works on thousands of URLs in seconds—incomparable to manual page review in Screaming Frog.
In practice, generate embeddings of title tags for the entire site and calculate the cosine similarity matrix—duplicates appear as values 0.99-1.0. Sponsored articles and boilerplate content often have high similarity, making them excellent candidates for pruning.