Author Introduction
I’m Andrew McPherson (LinkedIn), and in my work helping organisations get cited by AI engines, I keep seeing the same blind spot: brilliant video, diagrams and PDFs that AI simply cannot read. In this article I’ll show you how to structure multi-modal assets so ChatGPT, Gemini, Claude and Perplexity can confidently discover, understand and cite them.
Outline
- What multi-modal signals are and why they matter
- How AI systems process images, video, PDFs, audio
- ImageObject, VideoObject, AudioObject schema essentials
- Alt text, transcripts, chapters and metadata basics
- PDF text layers, tagging and accessibility fundamentals
- Podcast structure, speaker attribution and RSS feeds
- Cross-modal consistency and content freshness practices
- CiteCompass perspective on multi-modal citation gains
Key Takeaways
- Non-text assets are citable when properly structured
- Schema markup unlocks images, video, audio and PDFs
- Alt text and transcripts are primary machine signals
- Chapters enable granular, timestamp-specific video citation
- PDFs need embedded text layers and PDF/UA tags
- Podcasts require episode schema and speaker attribution
- Cross-modal consistency strengthens AI trust signals
- Multi-modal optimisation lifts Share of Model coverage
What Are Multi-Modal Signals?
Multi-modal signals are non-text content formats that AI systems can process and extract information from. While traditional search optimisation focused almost exclusively on text, modern AI systems use specialised models to interpret images, videos, PDFs, audio files and other alternative formats. Typical examples include product images with structured metadata, demonstration videos with transcripts, technical specification PDFs with accessible text layers, webinar recordings with timestamped captions, and audio podcasts with machine-readable transcriptions.
AI systems process these signals through several parallel mechanisms. Vision models analyse image content directly, extracting visual features, detecting objects and reading embedded text. Schema markup provides structured metadata that helps AI systems understand context, purpose and relationships. Transcription engines convert audio and video into searchable text. Accessibility features such as alt text, captions and PDF metadata serve a dual purpose: they improve human accessibility while providing machine-readable signals that AI systems use for content understanding.
For B2B companies, multi-modal content represents substantial informational assets that often remain under-optimised for AI retrieval. Product demonstration videos contain detailed feature explanations. Technical specification PDFs include precise engineering data. Webinar recordings capture expert insights and customer questions. When properly structured and annotated, these assets become citable sources that AI systems can reference with confidence.
Why Multi-Modal Signals Matter for AI Visibility
AI systems increasingly incorporate multi-modal capabilities into their response generation. Google AI Overviews can cite images alongside text sources. ChatGPT with vision can analyse product screenshots and diagrams. Perplexity retrieves video content for how-to queries. Claude can process PDF documentation to answer technical questions. As AI models become more sophisticated in multi-modal understanding, B2B companies that optimise non-text content gain citation advantages across diverse query types.
Multi-modal signals provide evidence types that text alone cannot deliver. A product comparison query benefits from structured product images with specifications. A troubleshooting question may be best answered with a diagnostic flowchart or video tutorial. A compliance inquiry might require referencing certification documents and audit reports. When AI systems can retrieve and cite appropriate multi-modal content, they generate more comprehensive, accurate responses, and that increased utility translates into higher citation rates.
The technical reality is that AI systems use different retrieval mechanisms for different content types. Vision models scan for ImageObject schema to understand image context. Video retrieval systems prioritise content with VideoObject markup, transcripts and chapter markers. PDF extraction tools rely on text layers, metadata fields and accessibility tags. Audio processing pipelines depend on AudioObject schema and synchronised transcriptions.
Multi-modal optimisation also addresses content discovery challenges. Search engines and AI systems cannot reliably interpret unmarked images or videos without additional context. An unlabeled product image provides minimal signal. The same image with descriptive alt text, ImageObject schema specifying dimensions and usage context, and structured captions becomes a rich, citable source. This structured approach transforms passive visual assets into active citation opportunities.
For B2B contexts, multi-modal signals carry specific strategic value. Product images with detailed specifications help AI systems answer comparison queries. Demo videos with timestamped transcripts allow precise citation of specific features. Technical documentation PDFs with searchable text layers support detailed implementation questions. Webinar recordings with speaker attribution become authoritative sources for industry insights. Companies that systematically structure these assets build multi-modal Citation Authority.
How AI Systems Process Multi-Modal Content
AI systems employ specialised processing pipelines for each content type, and each pipeline has distinct technical requirements and optimisation opportunities.
Image Processing
Vision-enabled AI models analyse images through multiple layers of interpretation. Computer vision algorithms detect objects, read embedded text (OCR), identify logos and branding, recognise faces and products, and extract visual features such as colour palettes and composition patterns. These capabilities enable AI systems to answer visual queries without relying solely on surrounding text.
Vision models operate more accurately when combined with structured metadata. The ImageObject schema type provides essential context: image subject and purpose, dimensions and file format, copyright and licensing information, associated products or services, and creation or modification dates. When an AI system retrieves an image with comprehensive schema markup, it can cite that image with greater confidence and specificity.
Alt text serves as the primary text-based signal for images. Effective alt text describes image content accurately, provides relevant context, uses precise terminology and avoids keyword stuffing. For B2B applications, alt text should reference product model numbers, technical specifications, use case contexts and industry-specific terminology. Generic descriptions like “product image” provide minimal signal compared with “Model X-500 industrial valve cross-section showing internal seal assembly and flow path”.
Image sitemaps enable systematic discovery of visual assets. Google and other search engines use image sitemaps to identify images that may not be linked directly from HTML content. Sitemaps can include metadata beyond what appears in page markup, such as geographic locations, licence information and content relationships. See Google Search Central guidance on image sitemaps.
Video Processing
AI systems process video content through multiple parallel streams. Video transcription engines convert spoken audio into searchable text. Scene detection algorithms identify topic boundaries and visual transitions. Thumbnail analysis provides visual context and preview capabilities. Metadata extraction reads duration, resolution, encoding details and publication information. Speaker identification attributes statements to specific individuals when multiple people appear.
The VideoObject schema type structures this information for AI consumption. Required properties include name and description, upload date and duration, thumbnail URL, and transcript or caption files. Valuable optional properties include video chapters with timestamps, embedding information and player URLs, interaction statistics, content ratings and audience targeting, and associated products or services.
Transcripts dramatically improve video citability. AI systems can quote specific statements from videos when transcripts provide precise text and timestamps. For B2B companies, this means product demonstrations become citable for feature explanations, webinar recordings can be referenced for expert insights, customer testimonials provide quotable social proof, and training videos serve as authoritative how-to sources. Without transcripts, video content remains largely opaque to text-based retrieval systems.
Chapter markers and timestamped sections enable granular citation. A 45-minute product demonstration becomes far more useful when segmented into titled chapters such as “Overview and Use Cases”, “Setup and Configuration”, “Advanced Features” and “Integration Examples”. AI systems can cite specific chapters rather than entire videos, improving relevance and user experience.
PDF and Document Processing
AI systems extract information from PDFs through text layer parsing, metadata field reading and accessibility tag interpretation. PDFs with properly embedded text layers allow direct text extraction. Metadata fields (title, author, subject, keywords, creation date) provide structured context. Accessibility tags create document structure that helps AI systems understand headings, lists, tables and reading order.
Documents without embedded text layers (image-only PDFs) require optical character recognition, which introduces errors and reduces reliability. For technical specifications, compliance documents, white papers and case studies, ensuring PDFs include searchable text layers significantly improves AI citability.
PDF metadata optimisation involves several key fields. The title field should match the document’s primary heading. Author fields establish expertise and attribution. Subject and keyword fields provide topical signals. Creation and modification dates indicate freshness. For B2B technical documents, including model numbers, version identifiers and compliance certifications in metadata improves discoverability.
Document structure through accessibility tags (PDF/UA standards) helps AI systems parse complex documents correctly. Tagged PDFs distinguish body text from headers, identify table structures and relationships, mark lists and hierarchical content, and specify reading order for multi-column layouts. The W3C PDF Techniques for WCAG 2.0 offers authoritative guidance on implementing these features.
Audio Processing
Audio content requires transcription for AI systems to extract semantic meaning. Automatic speech recognition converts audio to text, enabling the same retrieval mechanisms used for written content. The AudioObject schema type provides structured metadata: audio title and description, duration and encoding format, transcript URL or embedded text, associated episodes or series, and publication dates.
For B2B applications, audio content includes podcast episodes discussing industry trends, recorded client calls or consultations, audio-only webinars and presentations, voice-based product demonstrations, and earnings calls or investor presentations. Providing accurate transcripts with speaker attribution and timestamps makes this content citable.
Podcast optimisation for AI visibility involves several technical steps. Publish episode-level PodcastEpisode schema with transcripts. Include show notes with key topics and timestamps. Tag speakers and guests with Person schema. Categorise episodes using industry-standard taxonomies. Maintain RSS feeds with complete metadata so AI systems can parse podcast structure and cite specific episodes and quotes.
How to Optimise Multi-Modal Signals
Image Optimisation Implementation
Start by auditing existing images for missing or inadequate metadata. Identify product images, diagrams, screenshots, infographics, charts, graphs and team photos that lack proper markup. Prioritise high-value assets that support key products, services or thought leadership content.
Implement ImageObject schema for all significant images. Core properties should include contentUrl, name, description, width, height, encodingFormat, datePublished, creator, copyrightHolder, licence and associatedArticle. Populating these fields gives AI systems everything they need to cite the image with confidence.
Write alt text that balances descriptiveness with conciseness. Effective patterns for B2B images include product identification with model numbers, technical context such as “cross-section showing internal components”, functional description such as “high-temperature pressure regulation”, and use case or application context. Avoid generic phrases like “product image” or keyword-stuffed text that sacrifices clarity.
Create an image sitemap to ensure comprehensive discovery. Google Search Central recommends XML sitemaps listing all images with associated metadata, and supports additional fields beyond inline markup including geographic location, licence and usage rights, and subject matter categorisation. For large B2B sites with extensive product catalogues, image sitemaps ensure AI crawlers discover visual assets that may not be prominently linked.
Video Optimisation Implementation
Implement VideoObject schema for all product demonstrations, tutorial content, webinars and presentations, customer testimonials and company overview videos. Comprehensive markup includes core identification (name, description, thumbnailUrl, uploadDate, duration), access (contentUrl, embedUrl, transcript), structural metadata using Clip entries for chapters, and publisher information.
Provide complete transcripts in accessible formats. Transcripts should include timestamped speaker identification, complete dialogue for accuracy, descriptions of on-screen visual content, and section markers for topic transitions. Host transcripts as separate text files or WebVTT caption files so they are accessible both to AI systems and to users requiring accessibility accommodations.
For webinars and presentations with multiple speakers, implement speaker attribution within transcripts. This enables AI systems to quote specific individuals, which is particularly valuable for expert testimony, customer case studies and panel discussions. Use consistent name formats and link speakers to Person schema entities with biographical information.
Implement video chapters using the Clip type within VideoObject schema. Chapters provide semantic structure that helps AI systems understand video organisation and cite specific sections. For longer content such as demos over 10 minutes, webinars and training videos, chapters significantly improve usability and citation accuracy.
PDF and Document Optimisation
Audit existing PDF assets for text layer completeness and metadata quality. Use PDF accessibility checkers to identify documents lacking searchable text or proper tagging. Prioritise high-value documents: product specification sheets, white papers and research reports, case studies and success stories, technical documentation and guides, and compliance certifications and audit reports.
Ensure all PDFs include embedded text layers. If source documents were created in word processors or design tools, export settings should enable text embedding. For scanned documents or image-based PDFs, apply OCR processing to create searchable text layers. Verify OCR accuracy for technical terminology, product names and industry-specific language.
Optimise PDF metadata fields systematically. The title should match the document’s main heading exactly. Author fields should reference individuals or organisational departments with established expertise. Subject and keyword fields should include relevant technical terms, product identifiers and industry categories. Creation and modification dates should reflect actual content updates, not just file conversion dates.
Implement PDF accessibility tagging (PDF/UA) for complex documents. Accessibility tags create semantic structure that benefits both human users with assistive technologies and AI systems parsing document content. Tagged PDFs enable more accurate extraction of tables, lists, headings and multi-column layouts.
Host PDFs with clear, descriptive URLs and link them from relevant HTML pages with contextual anchor text. Isolated PDFs without inbound links or context receive less crawl priority. Embedding PDFs in structured HTML pages with MediaObject schema creates discoverability pathways and provides additional context for AI retrieval systems.
Audio and Podcast Optimisation
Publish episode-level schema for all podcast content using PodcastEpisode and PodcastSeries types. This structured approach enables AI systems to understand podcast organisation, cite specific episodes and attribute statements to speakers.
Provide complete transcripts for every episode. Transcripts should include timestamps at regular intervals (every 30-60 seconds for long-form content), speaker names for multi-host or interview formats, and descriptions of relevant non-verbal audio such as music, sound effects and demonstrations. Host transcripts as separate HTML pages or text files linked from episode pages.
Tag speakers and guests using Person schema with relevant biographical information. For B2B podcasts featuring industry experts, customers or company leadership, structured speaker attribution increases citation value. AI systems can reference “as stated by [Expert Name], [Title] at [Company]” when proper attribution exists.
Maintain comprehensive RSS feeds with complete metadata. Podcast RSS feeds should include episode titles and descriptions, publication dates and durations, transcript URLs, guest information and topics, and category classifications. Many podcast platforms support enhanced RSS extensions that AI systems can parse for richer understanding.
Cross-Modal Consistency
Ensure information presented in multi-modal formats aligns with textual content. Product specifications shown in images should match written specifications. Pricing mentioned in videos should reflect current pricing pages. Statements in podcast episodes should not contradict published documentation. Inconsistencies between content formats degrade trust signals and reduce citation likelihood.
Implement systematic review processes for multi-modal content updates. When product features change, update demonstration videos, specification images, documentation PDFs and related text content simultaneously. Maintain modification date tracking across content types to signal freshness consistently.
CiteCompass Perspective
CiteCompass monitoring tracks how AI systems retrieve and cite multi-modal content alongside text sources. Many B2B companies invest substantially in video production, technical documentation and visual assets without implementing the structured markup that makes these assets discoverable and citable.
Our analytics identify multi-modal citation opportunities by showing which queries trigger image, video or document citations from competitors. When AI systems cite competitor product diagrams, tutorial videos or specification documents, that reveals both the query intent and the content format AI models prefer for those queries. Companies can prioritise multi-modal optimisation based on actual citation patterns rather than assumptions about content value.
The technical implementation of multi-modal signals requires coordination across content teams, development teams and SEO specialists. CiteCompass Professional Services provides implementation guidance that bridges these disciplines, ensuring multi-modal assets receive proper technical treatment.
Multi-modal optimisation delivers measurable improvements in citation diversity. Companies that structure images, videos and documents properly appear in a broader range of AI responses, including visual search results, video answer panels and document-based citations. This expanded presence increases overall Share of Model by capturing citation opportunities that text-only optimisation misses.
The rise of vision-enabled AI models such as GPT-4 Vision, Google Gemini and Claude with vision makes multi-modal optimisation increasingly important. As AI systems become more sophisticated in interpreting visual content, properly structured images and videos become primary sources rather than supplementary assets. B2B companies that implement comprehensive multi-modal strategies position themselves for this shift, and the AI Visibility Suite is designed to help you measure and improve performance across every modality.
What Changed Recently
- 2026-02-06: Published multi-modal signals spoke page with implementation guidance for images, videos, PDFs and audio content.
- 2026-01: Google enhanced AI Overview image citations, now showing structured ImageObject sources alongside text references.
- 2025-Q4: ChatGPT vision capabilities expanded to analyse technical diagrams and product screenshots in detail.
- 2025-Q4: Schema.org published updated VideoObject and Clip documentation emphasising transcript and chapter requirements.
- 2025-Q3: Perplexity began citing video content with timestamp-specific references for how-to queries.
References
Google Search Central (2024). Image sitemaps. Technical specifications for creating XML image sitemaps, how Google processes image metadata, and best practices for image discovery and indexing.
Schema.org (2024). VideoObject. Official documentation for VideoObject schema type, including required and recommended properties for video markup, transcript integration, chapter segmentation using Clip types, and embedding metadata.
W3C Web Accessibility Initiative (2024). PDF Techniques for WCAG 2.0. Technical guidance on implementing PDF accessibility features including text alternatives, document structure tags and metadata optimisation that improve both human accessibility and machine readability.
Related Topics
Explore related concepts in the Technical Implementation pillar:
Learn about Multi-Format Content in the Content Strategy pillar and AI Data Surfaces in the Core Frameworks pillar. Return to the CiteCompass Knowledge Hub to explore all six pillars of AI visibility optimisation.

