Semantic Audit Pipeline
Semantic Audit PipelinesThe Semantic Audit Pipeline is a complete, multi-step automated site analysis process. It runs from initial crawling to final report generation through ten distinct steps. The process begins with site crawling and content parsing to Markdown using Jina Reader. Next, it generates embeddings through Gemini or Jina models, then performs topical clustering with K-means algorithms. The system analyzes content relationships by detecting duplicates (99%+ similarity) and cannibalization (90-99% similarity).
Advanced analysis includes computing Site Focus Score and Radius metrics, identifying outliers for content pruning, and conducting detailed content gap analysis comparing the knowledge graph against existing site content. The process concludes with automated report generation through the audit report generator. Each step saves results to files for persistence, making the pipeline resumable.
The system follows the principle of 'LLM for reasoning, Python for computation'—this hybrid approach ensures accurate calculations while maintaining intelligent interpretation of results. In practice, a 200-page site audit completes in 2–4 hours of automated processing and produces actionable reports with duplicates, cannibalization cases, gaps, and recommendations. Best practice is to test the pipeline on a small sample (20 pages) before processing the full site.