Schema Markup for AI: Structured Data Types

Andrew McPherson

Author Introduction

Kia ora, I am Andrew McPherson, co-founder of CiteCompass. I have spent the past year helping B2B teams make their content legible to AI systems, and schema markup is where the real leverage sits. If you want ChatGPT, Claude, Gemini and Perplexity to cite you with confidence, structured data is no longer optional. Here is how to get it right.

Outline

What schema markup is and why it matters
How AI systems parse and use structured data
Core schema types every B2B site needs
Industry-specific schema for products and services
JSON-LD @graph pattern and @id consistency
Validation, pitfalls and common implementation errors
How CiteCompass uses schema for citation tracking
Recent changes shaping schema best practice

Key Takeaways

Schema converts prose into machine-verifiable facts
AI systems cite structured data more confidently
Six core schema nodes belong on every page
Consistent @id references build a unified knowledge graph
dateModified timestamps are critical freshness signals
Industry-specific types strengthen semantic precision
Validate JSON-LD before publication, every time
Schema completeness correlates with citation frequency

What Is Schema Markup for AI?

Schema markup is machine-readable structured data that explicitly defines what your content is about, who created it, when it was published, and how different entities relate to each other. Using the Schema.org vocabulary – a standardised taxonomy maintained by Google, Microsoft, Yahoo and Yandex – schema markup transforms ambiguous HTML prose into unambiguous facts that AI systems can parse, retrieve and cite with confidence.

For AI systems using Retrieval-Augmented Generation (RAG), schema markup serves three critical functions. First, it enables entity disambiguation, helping AI models distinguish your “Phoenix Systems” (manufacturing equipment) from “Phoenix Systems” (consulting firm) or “Phoenix Systems” (software vendor). Second, it provides explicit trust signals through author attribution, publication dates and organisational affiliations that AI systems use to evaluate source credibility. Third, it exposes freshness metadata through datePublished and dateModified timestamps that RAG systems prioritise when answering queries requiring current information.

Schema markup is implemented using JSON-LD (JavaScript Object Notation for Linked Data), a format that embeds structured data within HTML pages as a script block. Unlike older microdata formats that require inline HTML attributes, JSON-LD keeps structured data separate from visible content, making it easier to maintain and validate. AI crawlers parse JSON-LD blocks to extract factual information about pages, entities, relationships and content metadata independent of how that content is presented visually to human readers.

Why Schema Markup Matters for AI Visibility

AI systems prioritise sources where they can verify claims, understand context and assess trustworthiness programmatically. Schema markup provides the explicit semantic signals that enable this verification process. When AI models retrieve content through RAG, they do not just scan visible text. They parse structured data to extract entities, validate relationships, check publication dates and assess authorship credentials. Pages with comprehensive schema markup provide AI systems with unambiguous facts, reducing hallucination risk and increasing citation likelihood.

For B2B companies, schema markup directly influences whether AI systems can answer fundamental questions about your business. “Who runs this company?” requires Organization and Person schema. “When was this information last updated?” requires dateModified timestamps. “What services does this firm provide?” requires Service schema with explicit serviceType properties. “Is this author a credible expert?” requires Person schema with jobTitle, knowsAbout and organisational affiliation. Without structured data, AI systems must infer answers from unstructured content, leading to lower confidence scores and reduced citation rates.

Schema markup also enables AI systems to construct knowledge graphs that connect your brand to related concepts, industries and capabilities. When you define a DefinedTerm for proprietary terminology (such as “AI Data Surfaces” or “Citation Authority”), AI models can reference that canonical definition when explaining your methodology. When you link Person entities to Organization entities through worksFor properties, AI systems understand authority relationships. When you connect Article entities to DefinedTerm entities through about and mentions properties, RAG systems can retrieve your content when queries match those semantic concepts.

The competitive advantage comes from specificity. AI systems increasingly weight structured data over unstructured content when both are available because structured data reduces parsing ambiguity. If a competitor’s pricing page lacks Offer schema while yours includes explicit price, priceCurrency and billingIncrement properties, AI systems will preferentially cite your pricing information when answering cost-related queries. If your service descriptions use Service schema with areaServed and serviceType properties while competitors rely on prose descriptions, AI models can accurately retrieve your capabilities when users ask location-specific or capability-specific questions.

Google Search Central documentation states that structured data helps search engines understand content and present it more effectively in results, and the official guidance confirms how schema supports AI-driven retrieval. RAG systems retrieve schema-rich content more frequently, cite it more accurately and assign it higher confidence scores when formulating responses.

How Schema Markup Works for AI Systems

AI systems access schema markup through two primary mechanisms: pre-indexing crawl processes and real-time retrieval during RAG queries. During the crawl phase, AI systems (including GoogleBot, BingBot, OpenAI’s GPTBot, Anthropic’s ClaudeBot and Perplexity’s PerplexityBot) parse JSON-LD blocks embedded in HTML pages. These blocks are extracted, validated against Schema.org type definitions and stored in structured indexes separate from unstructured page content. This pre-processing enables fast retrieval during RAG operations.

When an AI system receives a query requiring external information, its RAG pipeline performs multi-stage retrieval. The initial stage searches indexed content using semantic similarity between the query and stored embeddings. The verification stage then accesses structured data associated with retrieved documents to validate claims, check freshness and assess source authority. If a page includes TechArticle schema with an author Person entity, a publisher Organization entity and a recent dateModified timestamp, the RAG system can assign higher confidence to information extracted from that page compared to pages lacking these trust signals.

Schema markup works through typed entities and relationships. Every schema type (Article, Person, Organization, Product, Service) has required and recommended properties defined in the Schema.org full hierarchy. For example, an Organization entity should include name, url and logo properties. A Person entity should include name, jobTitle and worksFor properties. These properties create explicit relationships: when a Person’s worksFor property references an Organization’s @id, AI systems understand the employment relationship. When an Article’s author property references a Person’s @id, AI systems understand attribution.

The @id convention enables cross-entity referencing within and across pages. Every entity in your schema should have a unique @id (typically your URL plus a fragment identifier, such as https://example.com/#organization). Other entities can then reference this @id to establish relationships without duplicating data. Every article on your site can reference the same author Person @id, and that Person entity only needs to be fully defined once. This creates an explicit knowledge graph that AI systems can traverse when building context about your brand.

Schema types form hierarchies. TechArticle inherits from Article, which inherits from CreativeWork, which inherits from Thing. When you declare a TechArticle, AI systems understand it possesses all properties available to its parent types. This inheritance enables precise type declarations while maintaining compatibility with systems that recognise parent types. For B2B companies, choosing the most specific applicable type (Service instead of Intangible, TechArticle instead of Article, SoftwareApplication instead of Product) provides stronger semantic signals to AI systems.

How to Optimise Schema Implementation

Core Schema Types for All B2B Companies

Every page on your site should include a minimum schema graph with six essential nodes: WebSite (global), WebPage (page-specific), Article or TechArticle (content), BreadcrumbList (navigation hierarchy), Organization (publisher) and Person (author). These six nodes provide AI systems with entity context, trust signals and content metadata regardless of your industry.

Organization Schema: Define your company entity once and reference it across all pages using a consistent @id. Include your legal name, url, logo and sameAs properties linking to verified social profiles such as LinkedIn. For B2B companies, add contactPoint properties with telephone, contactType and areaServed to enable AI systems to answer contact-related queries accurately.

Person Schema: Define author entities with name, jobTitle, url (link to author bio or LinkedIn) and worksFor properties. For professional services firms, extend Person entities with knowsAbout properties listing expertise areas. AI systems use Person schema to evaluate author credibility when deciding whether to cite content.

Article and TechArticle Schema: Every content page should include Article (for marketing content, blog posts, case studies) or TechArticle (for technical documentation, implementation guides, API references) schema. Required properties include headline (must match your H1 exactly), author, datePublished, dateModified, publisher and description. Include about and mentions properties referencing DefinedTerm entities for proprietary concepts to create explicit semantic relationships.

Industry-Specific Schema Types

Product Schema (Manufacturing, Distribution): For physical products or equipment, use Product schema with model, manufacturer, mpn, brand and offers properties. Include PropertyValue entities in additionalProperty arrays to define technical specifications such as dimensions, materials, weight and performance characteristics. AI systems retrieve this structured data when answering specification queries.

Service Schema (Professional Services, B2B Services): Use Service schema with serviceType, provider, areaServed and availableChannel properties. Create separate Service entities for each practice area or service offering. For geographic service businesses, use areaServed with GeoShape or City properties to define coverage.

SoftwareApplication Schema (SaaS, Software Vendors): Use SoftwareApplication schema with applicationCategory, operatingSystem, softwareVersion and offers properties. Include Offer entities defining pricing plans, billingIncrement for subscription models and eligibleRegion properties. AI systems cite this structured pricing data when answering cost comparison queries.

DefinedTerm Schema (All Industries with Proprietary Terminology): Create DefinedTerm entities for proprietary concepts, methodologies, product names and specialised terminology. Group related terms in a DefinedTermSet, typically on a glossary or hub page. Reference these terms from article schema using about and mentions properties to create explicit semantic relationships that AI systems use when matching queries to content.

Implementation Best Practices

Use the JSON-LD @graph Pattern: Embed all schema entities in a single JSON-LD block using @graph array syntax. This keeps related entities together, makes validation easier and ensures AI crawlers parse all structured data in a single pass.

Maintain Consistent @id References: Use the same @id pattern for entities that appear across multiple pages. Your Organization should always use the same canonical @id, your logo should always use the same @id, and author entities should follow a consistent URL plus fragment pattern. Consistent @id references enable AI systems to build unified knowledge graphs across your entire domain.

Validate Schema Syntax: Use Google’s Rich Results Test and the Schema.org validator to check JSON-LD syntax and identify missing required properties. Run validation on every new page before publication. Invalid JSON syntax or missing required properties prevent AI crawlers from parsing structured data.

Synchronise Schema with Content: Ensure schema headline matches your H1 exactly, description matches your meta description, and dates match actual publication and modification dates. Contradictions between structured data and visible content degrade trust signals. When you update page content, update dateModified timestamps in both schema and page metadata.

Common Schema Pitfalls

Duplicate Schema Blocks: Many WordPress plugins and themes inject default schema markup. Disable these plugins or configure them to output nothing when implementing custom JSON-LD. Multiple schema blocks describing the same entity with conflicting information confuse AI crawlers and reduce citation reliability.

Incorrect @id Usage: Using different @id patterns for the same entity across pages prevents AI systems from building unified entity understanding. If your homepage defines your organisation with one @id but your blog posts use another, AI systems treat these as separate entities.

Missing Required Properties: Each schema type defines required properties. Article requires headline, datePublished and author. Organization requires name. Person requires name. Omitting required properties invalidates schema blocks and prevents both rich results display and AI retrieval.

Stale Timestamps: Leaving dateModified unchanged when updating content signals to AI systems that information may be outdated. Update dateModified whenever substantive content changes, not just for cosmetic edits. Fresh timestamps signal reliability to RAG systems prioritising recent information.

CiteCompass Perspective

CiteCompass uses schema markup as the foundation for its AI Visibility Suite and Citation Authority measurement. When tracking how AI systems cite your content, CiteCompass analyses the relationship between schema completeness and citation frequency. Pages with comprehensive schema markup – all six core nodes plus industry-specific entities – achieve measurably higher citation rates than pages with minimal or missing structured data.

Schema markup enables CiteCompass to identify specific technical issues reducing your AI visibility. Missing author attribution prevents AI systems from evaluating source credibility. Absent dateModified timestamps cause RAG systems to deprioritise your content in favour of sources signalling freshness. Incomplete Organization schema prevents entity disambiguation, leading to citation misattribution when AI systems confuse your brand with similarly named entities.

CiteCompass Professional Services includes schema audit and implementation guidance tailored to your industry. For manufacturing companies, this includes Product schema with technical specification PropertyValue arrays. For professional services firms, this includes Person schema for practitioner directories and Service schema for practice area definitions. For SaaS companies, this includes SoftwareApplication schema with pricing Offer entities and integration capability properties.

The educational principle is straightforward: AI systems preferentially cite sources where they can programmatically verify claims. Schema markup transforms ambiguous content into verifiable facts. When AI models can parse your pricing as structured Offer entities instead of inferring it from prose, citation confidence increases. When AI systems can validate author expertise through Person schema instead of guessing from bylines, source trust increases. These mechanisms directly translate to measurable improvements in Share of Model and Citation Authority.

What Changed Recently

2026-02-06: Published Schema Markup for AI spoke page with B2B implementation guidance across industries
2025-Q4: Google Search Central updated structured data guidelines to emphasise dateModified as a freshness signal for AI systems
2025-Q3: Schema.org released version 26.0 adding SoftwareApplication extensions for SaaS-specific properties including billingIncrement and eligibleRegion
2025-Q2: Major AI systems (ChatGPT, Perplexity, Claude) began prioritising sources with comprehensive Person and Organization schema when evaluating source credibility
2025-Q1: Google Rich Results Test added validation checks for TechArticle schema with about and mentions properties

References

Google Search Central (2024). Understand how structured data works – Official guidelines for implementing JSON-LD structured data.
Schema.org (2024). Full Hierarchy – Complete vocabulary of schema types and properties with inheritance relationships.
Google Developers (2024). Rich Results Test – Validation tool for testing JSON-LD syntax and verifying structured data.
JSON for Linking Data. JSON-LD 1.1 specification – Reference specification for JSON-LD, the recommended format for embedding schema markup in HTML.

Schema Markup for AI: Essential Structured Data Types