MUM (Multimodal)

Theoretical Foundations

MUM

MUM (Multimodal) is Google's Multitask Unified Model that processes text, images, and video content while supporting 75 languages.

Multitask Unified Model — Google's multimodal system that understands text, images, and video across 75 languages simultaneously. MUM analyzes complex multi-step queries like 'are trekking boots from Fuji suitable for Kilimanjaro?'

In practice, MUM represents the direction of SEO — from textual optimization to multimodal. Embeddings, which are MUM's foundation, allow semantic comparison of text with images, video, or audio. MUM is 1000x more powerful than BERT and operates in 75 languages, meaning content in one language can influence rankings in another.

In practice, optimize not just text but also image alt attributes and video descriptions — MUM analyzes them together. In the multimodal AI Search era, image or video content can be cited together with text, opening new possibilities for Information Gain.

Source: AI Semantic SEO Expert, Robert Niechciał (sensai.io)