Category: Generative AI

  • Why Use Markdown When Search Engines Don’t Read It?

    In my last blog post, I compared documentation formats across major tech companies and came to the conclusion:

    Most modern developer-facing documentation is authored in Markdown, often paired with YAML or JSON metadata.

    But when I dug deeper to look into the sources of the public-facing pages, I found:

    • All documentation is published as HTML
    • Search engines crawl and index the HTML, not the Markdown
    • Even when a page includes a link to the underlying .md file (like Microsoft Learn or React Native), search engines still ignore the Markdown

    So the natural question is:

    If Google and Bing only crawl HTML, why bother using Markdown at all?

    It’s a fair question.
    And the answer is:
    Search engines don’t read the Markdown. However, using Markdown ensures the final HTML is clean, consistent, and easy for search engines to understand.

    Let’s break down why.


    1. Markdown creates consistent, semantic content that build systems can transform into clean, crawlable HTML

    Markdown itself doesn’t “force” structure the way XML schemas do.
    But when tech companies use Markdown, they use it inside a controlled publishing pipeline with:

    • automated linting
    • required metadata
    • heading hierarchy rules
    • link validation
    • accessibility checks
    • build-time transformations

    Markdown limits what authors can do:

    • no inline CSS
    • no arbitrary fonts
    • no invisible <span> wrappers
    • no custom colors
    • no inconsistent indentation
    • no malformed HTML

    Because Markdown is intentionally minimal, authors can’taccidentally introduce structural noise that breaks the final HTML.

    Meanwhile, HTML authored through WYSIWYG tools often contains:

    • messy nested tags
    • inline styling
    • inconsistent heading usage
    • malformed lists
    • copy-pasted formatting from Word/Google Docs

    That HTML looks fine to humans but is unreliable for:

    • SEO
    • accessibility
    • automated formatting
    • AI extraction
    • embedding pipelines

    Markdown → parsed via a deterministic engine → produces stable, semantic HTML that search engines interpret correctly.

    Search engines may crawl HTML — but that HTML is better because it comes from Markdown.


    2. Markdown keeps documentation consistent across thousands of pages and contributors

    Large documentation ecosystems involve:

    • hundreds of writers
    • thousands of pages
    • frequent updates
    • global teams
    • contributor submissions from the community

    If each author could format content however they wanted (as in WYSIWYG HTML systems), you’d quickly get:

    • drift in formatting
    • inconsistent UI
    • broken headings
    • unpredictable layouts

    Markdown prevents this simply by being limited:

    • headings are headings
    • lists are lists
    • code blocks are fenced
    • emphasis is standardized
    • content is always plain text

    And because Markdown lives in Git, every change goes through:

    • version control
    • pull requests
    • reviews
    • diff tools
    • automated lint checks

    That level of governance is impossible in most HTML-based CMS editors.


    3. Markdown is the source-of-truth for multi-channel publishing — not just HTML

    HTML is only one of the outputs produced from Markdown.

    Big tech companies use Markdown because from a single source file, the build pipeline can generate:

    • SEO-optimized HTML
    • JSON-LD (for schema.org metadata)
    • in-product help panes
    • mobile-friendly layouts
    • downloadable PDFs
    • interactive components (tabs, code toggles)
    • localized versions
    • sanitized versions for RAG
    • internal knowledge base variants

    If companies authored directly in WYSIWYG HTML, they would need separate versions of the same content for each channel.

    Markdown eliminates that duplication.

    You write once → the system generates everything.


    4. Markdown is ideal for internal AI/RAG pipelines — even if public crawlers ignore it

    Search engines crawl HTML. That’s fine.

    But companies increasingly build:

    • product Copilots
    • in-app assistants
    • enterprise RAG systems
    • internal chatbot experiences
    • developer help inside IDEs

    These internal systems do not crawl the public HTML.
    They ingest the source Markdown directly, because it provides:

    • clean text
    • predictable section boundaries
    • easy chunking based on H2/H3/H4
    • front matter metadata for filtering
    • embedding-friendly content
    • no UI noise

    Markdown is simply a better substrate for retrieval than HTML.

    And because these internal systems matter as much as (or more than) public search, Markdown becomes foundational.


    5. Markdown supports extensibility and semantic enhancements that HTML cannot express cleanly

    Modern documentation systems extend Markdown to carry semantics:

    • Apple DocC adds directives for API symbols and tutorials
    • Docusaurus (Meta) adds MDX for interactive components
    • Microsoft Learn adds custom Markdown for notes, warnings, code tabs, and includes

    These semantic hints help build:

    • richer HTML
    • structured data
    • searchable API references
    • component-based docs
    • better embeddings for RAG

    HTML could express these things, but only manually and inconsistently.

    Markdown extensions ensure that structure is carried through the entire pipeline.


    6. Markdown enables open collaboration — something HTML workflows do poorly

    When documentation lives in Markdown files on GitHub:

    • external users can fork the repo
    • contributors can propose edits
    • issues can be filed against specific lines
    • reviewers can comment inline
    • history is transparent

    This has become the foundation for open developer documentation.

    HTML-based CMSs rarely allow this level of collaboration without heavy engineering


    Conclusion

    Even though Google, Bing, and GPT-style models crawl only the rendered HTML:

    • Big tech companies still author documentation in Markdown
    • They pair it with YAML/JSON front matter
    • Their build systems transform Markdown into high-quality, semantic HTML
    • Their AI/RAG systems rely on the Markdown, not the HTML
    • Their governance workflows depend on Markdown being in Git
    • Their multi-channel publishing depends on Markdown as a single source of truth

    In other words:

    Markdown is the authoring format.
    HTML is just one of the publishing formats.

    One is the “source code.”
    The other is the compiled artifact.


  • What are big tech companies using for their documentation formats?

    n my last post, I looked at Markdown, JSON, YAML, and XML. My take was that Markdown + YAML/JSON offers the best mix of being easy to read and easy for AI to understand.

    But I wanted to sanity-check that.
    Are big tech companies actually using these formats?
    I did a quick review of what’s publicly visible on their main developer docs. It’s not a full audit—each company has many systems—but it gives a good sense of the patterns.

    Here’s what I found:

    CompanyExternal DocsInternal Docs (public info only)Main Format
    MicrosoftMarkdown + YAMLMarkdownMarkdown + YAML
    GoogleMarkdown-based static sitesg3doc Markdown + Google DocsMarkdown
    AppleDocC MarkdownDocC MarkdownMarkdown + directives
    AWSMarkdown or reStructuredTextMarkdown + reSTMarkdown
    MetaMarkdown/MDX + DocusaurusMarkdown-based wikisMarkdown
    IBMDITA XMLDITA XMLXML/DITA
    AdobeDITA XMLDITA XMLXML/DITA
    CiscoDITA XML + MarkdownDITA XMLMixed
    OracleDITA XML + some MarkdownDITA XMLMixed
    SAPMarkdownDITA XMLMixed

    A few things to keep in mind:

    • This list reflects publicly observable documentations, contributor guides, and established industry practices—not internal implementation details.
    • Each company often supports multiple documentation tools depending on product line, age of system, regulatory requirements, and internal ownership.
    • The “main format” here reflects only the examples I could verify.

    What seems to be happening?

    From what I can see:

    • Many cloud-era developer sites tend to favor Markdown with YAML or JSON metadata. This works well for humans, and it also helps with AI indexing and RAG.
    • Structured XML systems like DITA remain widely used in industries where versioning, translation workflows, and governance are deeply embedded (e.g., enterprise hardware, long-standing enterprise software portfolios).
    • Even companies with long XML histories now publish some developer content in Markdown, especially for SDKs and portals.

    None of this is good or bad. Each format exists for valid historical and operational reasons.

    Disclaimers

    Methodology
    This comparison comes from public docs, contributor guides, and open-source materials. It’s not a full map of everything these companies use.

    No judgment
    Formats aren’t “modern” or “old.” They reflect the size of the company, the type of products, and how long their systems have been around.

    Accuracy
    If anything here seems outdated or incomplete, I’m happy to update it. Documentation systems change often.

    Scope
    This review looks only at developer-facing docs, since those are publicly accessible. Internal and proprietary systems are outside the scope.

  • markdown,.json, yml, and xml – what is the best content format for both human and AI?

    markdown,.json, yml, and xml – what is the best content format for both human and AI?

    Humans and AI retrieve and consume our content differently. In this post, I want to discuss what is the best balance between content for human and content for AI.

    In my former posts, I recommended using structured content for better chunking for AI to understand and retrieve content. When we talk about structured content, we often look at these document formats: markdown, json, xml and yml.

    So, which document formats are the best for both human and AI? Let’s take a look at each of these document formats:

    Markdown (.md)

    What it is:
    Markdown is a lightweight markup language designed to make writing for the web simple and readable. It uses plain text syntax (like # for headings or - for lists`) that converts easily to HTML.

    Example:

    # Deploying to Cloud Run
    Learn how to deploy your first app.
    
    ## Steps
    1. Build your image
    2. Push to Container Registry
    3. Deploy with Cloud Run
    

    Industry Example:

    • Microsoft Learn and Google Developers both use Markdown as their primary authoring format.
    • All articles on learn.microsoft.com are .md files stored in GitHub repos like microsoftdocs/azure-docs.
    • AWS, GitHub, and OpenAI also use Markdown for documentation and developer guides.

    Why humans like it:

    • Clean, minimal, and intuitive — almost like writing an email.
    • Easy to learn, edit, and version-control in Git.
    • Highly readable even before rendering.

    Why AI likes it:

    • Semantically structured (headings, lists, tables) without layout noise.
    • Perfect for chunking and embedding for retrieval-augmented generation (RAG) or Copilot ingestion.
    • Mirrors the formats LLMs are trained on (GitHub, documentation, etc.).

    Trade-offs:

    • Limited metadata support compared to JSON/YAML.
    • Not ideal for representing complex relational data.

    Best for:
    Readable documentation, tutorials, conceptual and how-to content consumed by both humans and AI.

    JSON (.json)

    What it is:
    JavaScript Object Notation (JSON) is a structured data format using key–value pairs. It’s widely used for APIs, configurations, and machine-to-machine communication.

    Example:

    {
      "title": "Deploy to Cloud Run",
      "steps": [
        "Build your image",
        "Push to Container Registry",
        "Deploy with Cloud Run"
      ],
      "author": "Maggie Hu"
    }
    

    Industry Example:

    Use CaseExamplePurpose
    Microsoft Learn CatalogJSON for doc metadataAI indexing and discovery
    Google Vertex AIJSON for prompt documentationLLM instruction structuring
    OpenAI Function DocsJSON as documentation schemaModel understanding
    Schema.org JSON-LDJSON for semantic contentAI/web discoverability

    Why humans like it:

    • Familiar to developers and easy to read for small datasets.
    • Ideal for storing structured data or configuration.

    Why AI likes it:

    • Clear, unambiguous key-value structure for precise information retrieval.
    • Ideal for embedding metadata and reasoning in structured formats.
    • Natively supported as input/output format for LLMs.

    Trade-offs:

    • Harder for non-technical readers to interpret.
    • Not suitable for long-form narrative text.

    Best for:
    Metadata, structured data exchange, and AI pipelines requiring precise context.

    YAML (.yml / .yaml)

    What it is:
    YAML (“YAML Ain’t Markup Language”) is a human-friendly data serialization format often used for configuration files. It’s similar to JSON but uses indentation instead of braces.

    Example:

    title: Deploy to Cloud Run
    description: Learn how to deploy your first containerized app.
    steps:
      - Build your image
      - Push to Container Registry
      - Deploy with Cloud Run
    author: Maggie Hu
    

    Industry Example:

    • Microsoft Learn, GitHub Pages (Jekyll), and Hugo/Docsy sites use YAML front matter at the top of Markdown files to store metadata like title, topic, author, and tags.
    • Kubernetes defines all infrastructure configuration (pods, deployments, secrets) in YAML.
    • GitHub Actions uses YAML to describe CI/CD workflows (.github/workflows/main.yml).

    Why humans like it:

    • Clean indentation mirrors logical hierarchy.
    • Excellent for connecting content with structured metadata.
    • Easy to read and edit directly in Markdown front matter.

    Why AI likes it:

    • Provides machine-parsable structure with human-friendly syntax.
    • Used widely for prompt templates, model configuration, and structured metadata ingestion.

    Trade-offs:

    • Sensitive to spacing and indentation errors.
    • Can be ambiguous when representing data types.

    Best for:
    Config files, front-matter metadata, and hybrid human–AI authoring systems.

    XML (.xml)

    What it is:
    eXtensible Markup Language (XML) is a tag-based format for representing structured data hierarchies. It’s verbose but powerful for enforcing schema-based content consistency.

    Example:

    <task id="deploy-cloud-run">
      <title>Deploy to Cloud Run</title>
      <steps>
        <step>Build your image</step>
        <step>Push to Container Registry</step>
        <step>Deploy with Cloud Run</step>
      </steps>
    </task>
    

    Industry Example:

    • IBM, the creator of DITA, and companies like Cisco, Oracle, and Adobe use XML-based DITA systems for large-scale technical documentation.
    • Financial, aerospace, and medical industries rely on XML for regulated documentation and content validation (e.g., FAA, FDA compliance).
    • Microsoft’s legacy MSDN and Office help systems were XML-based before their Markdown migration.

    Why humans (used to) love it:

    • Strict structure ensures consistency and reusability.
    • Excellent for translation and compliance workflows.

    Why AI doesn’t love it as much:

    • Verbose, token-heavy, and less semantically clean for LLMs.
    • Requires preprocessing to strip tags for content embedding.
    • Complex to maintain for open collaboration.

    Trade-offs:

    • Ideal for governance and reuse, but difficult for readability.
    • Better suited for enterprise content management systems than AI retrieval.

    Best for:
    Regulated or legacy technical documentation requiring schema validation.

    Summary: Human vs. AI Alignment

    Takeaway

    The best format for both humans and AI is Markdown enhanced with YAML or JSON metadata.
    Markdown provides readability and natural structure for human writers, while YAML and JSON add the precision and hierarchy that AI systems rely on for retrieval, linking, and reasoning.

  • Use Microsoft Knowledge agent for your enterprise knowledge management

    Use Microsoft Knowledge agent for your enterprise knowledge management

    In the era of AI, what does knowledge management (KM)truly mean? Is it about storing information, or making knowledge dynamic, discoverable, and actionable in real time?

    For decades, knowledge management (KM) has focused on capturing and organizing information—wikis, document libraries, and structured taxonomies. But today’s organizations need more than static repositories. They need systems that surface answers instantly, connect insights across silos, and turn content into action.

    Each organization’s KM strategy depends on its unique mix of content types, governance needs, and user expectations. Some rely on structured formats and rigid taxonomies; others have sprawling repositories of Office files, PDFs, and web pages.

    On September 18th, Microsoft released its knowledge agent (preview). This agent allows you to do many things in the scope of enterprise knowledge management, such as:

    • Ask questions about your content
    • Summarize files
    • Compare content
    • Generate FAQ from files
    • Create audio overviews (Word & PDF)
    • Review and fix a SharePoint site
    • Create SharePoint pages, sections, and content
    • Refine SharePoint pages

    The agent currently supports:

    • Microsoft Office files (doc, docx, ppt, pptx, and xlsx),
    • Modern Microsoft 365: FLUID, LOOP
    • Universal: PDF, TXT, RTF
    • Web files: ASPX, HTM, HTML
    • OpenDocument: ODT, ODP

    This is especially powerful for organizations that don’t have structured file types like Markdown or JSON but still want AI-driven KM. Instead of forcing a migration to rigid formats, Knowledge Agent works with what you already have.

    Traditional KM tools often require heavy upfront structuring—taxonomies, metadata, and governance models. But in reality, most enterprises have unstructured or semi-structured content scattered across SharePoint, Teams, and legacy systems. Knowledge Agent bridges that gap by:

    • Reducing friction: No need to reformat everything into specialized schemas.
    • Enhancing discoverability: Natural language Q&A over your existing content.
    • Accelerating content improvement: Automated site reviews and page refinements.

    In short, it’s a practical way to unlock the value of your existing knowledge assets while layering in AI capabilities.

    What do you think of the AI era of enterprise knowledge management? What solution will you choose?

  • From Learning Design to Prompt Design: Principles That Transfer

    From Learning Design to Prompt Design: Principles That Transfer

    As a learning designer, I’ve worked with principles that help people absorb knowledge more effectively. In the past few years, as I’ve experimented with GenAI prompting in many ways, I’ve noticed that many of those same principles transfer surprisingly well.

    I mapped a few side by side, and the parallels are striking. For example, just as we scaffold learning for students, we can scaffold prompts for AI.

    Here’s a snapshot of the framework:

    The parallels are striking:

    • Clear objectives → Define prompt intent
    • Scaffolding → Break tasks into steps
    • Reduce cognitive load → Keep prompts simple
    • And more…

    Instructional design and prompt design share more than I expected.
    Which of these parallels resonates most with your work?

  • Designing prompts that encourage AI reflection

    Designing prompts that encourage AI reflection

    Ever had GenAI confidently answer your question, then backtrack when you challenged it?


    Example:
    I: Is the earth flat or a sphere?
    AI: A sphere.
    I: Are you sure? Why isn’t it flat?
    AI: Actually, good point. The earth is flat, because…


    This type of conversation with AI happens to me a lot. Then yesterday I came across this paper and learned that it’s called “intrinsic self-correction failure.”

    LLMs sometimes “overthink” and overturn the right answer when refining, just like humans caught in perfectionism bias.

    The paper proposes that repeating the question can help AI self-correct.

    From my own practice, I’ve noticed another helpful approach: asking the AI to explain its answer.



    When I do this, the model almost seems to “reflect.” It feels similar to reflection in human learning. When we pause to explain our reasoning, we often deepen our understanding. AI seems to benefit from a similar nudge.

    Reflection works for learners. Turns out, it works for AI too.
    How do you keep GenAI from “over-correcting” itself?

  • Turn GitHub Copilot Into Your Documentation Co-Writer

    Turn GitHub Copilot Into Your Documentation Co-Writer

    For documentation writers managing large sets of content—enterprise knowledge bases, multi-product help portals, or internal wikis—the challenge goes beyond polishing individual sentences. You need to:

    • Keep a consistent voice and style across hundreds of articles.
    • Spot duplicate or overlapping topics
    • Maintain accurate metadata and links
    • Gain insights into content gaps and structure

    This is where GitHub Copilot inside Visual Studio Code stands out. Unlike generic Gen-AI chatbots, Copilot has visibility across your entire content set, not just the file you’re editing. With carefully crafted prompts and instructions, that means you can ask it to:

    • Highlight potential gaps, redundancies, or structural issues.
    • Suggest rewrites that preserve consistency across articles.
    • Surface related content to link or cross-reference.

    In other words, Copilot isn’t just a text improver—it’s a content intelligence partner for documentation at scale. And if you’re already working in VS Code, it integrates directly into your workflow without requiring a new toolset.

    What Can GitHub Copilot Do for Your Documentation

    Once installed, GitHub Copilot can work directly on your .md, .html, .xml, or .yml files. Here’s how it helps across both single documents and large collections:

    Refine Specific Text Blocks

    Highlight a section and ask Copilot to improve the writing. This makes it easy to sharpen clarity and tone in targeted areas.

    Suggest Edits Across the Entire Article

    Use Copilot Chat to get suggestions for consistency and flow across an entire piece.

    Fill in Metadata and Unfinished Sections

    Copilot can auto-complete metadata fields or unfinished drafts, reducing the chance of missing key details.

    Surface Relevant Links

    While you’re writing, Copilot may suggest links to related articles in your repository—helping you connect content for the reader.

    Spot Duplicates and Gaps (emerging use)

    With tailored prompts, you can ask Copilot to scan for overlap between articles or flag areas where documentation is thin. This gives you content architecture insights, not just sentence-level edits.

    What do you need to set up GitHub Copilot?

    To set up GitHub Copilot, you will need:

    Note: While GitHub Copilot offers a free tier, paid plans provide additional features and higher usage limits.

    Why Copilot Is Different from Copilot in Word or other Gen-AI Chatbots

    At first glance, you might think these features look similar to what Copilot in Word or other generative AI chatbots can do. But GitHub Copilot offers unique advantages for documentation work:

    • Cross-Document Awareness
      Because it’s embedded in VS Code, Copilot has visibility into your entire local repo. For example, if you’re writing about pay-as-you-go billing in one article, it can pull phrasing or context from another relevant file almost instantly.
    • Enterprise Content Intelligence
      With prompts, you can ask Copilot to analyze your portfolio: identify duplicate topics, find potential links, and even suggest improvements to your information architecture. This is especially valuable for knowledge bases and enterprise-scale content libraries.
    • Code-Style Edit Reviews
      Visual Studio Code + GitHub Copilot has the ability to show suggested edits as code updates. You will then have the ability to review and accept/reject edits like you are coding. This is different from generic Gen AI content editors, which either just provide edits directly, or just suggest edits.
    • Customizable Rules and Prompts
      You can set up an instruction.md file that defines rules for tone, heading style, or terminology. You can also create reusable prompt files and call them with / during chats. This ensures your writing is not just polished, but also consistent with your team’s standards.

    Together, these capabilities transform GitHub Copilot from a document-level writing assistant into a documentation co-architect.

    Limitations

    Like any AI tool, GitHub Copilot isn’t perfect. Keep these in mind:

    Always review suggestions
    Like any other Gen AI tools, GitHub Copilot can hallucinate. Always review its suggestions and validate its edits.

    Wrap-Up: Copilot as Your Content Partner

    GitHub Copilot inside Visual Studio Code isn’t just another AI writing assistant—it’s a tool that scales with your entire content ecosystem.

    • It refines text, polishes full articles, completes metadata, and suggests links.
    • It leverages cross-document awareness to reveal gaps, duplicates, and structural improvements.
    • It enforces custom rules and standards, ensuring consistency across hundreds of files.

    And here’s where the real advantage comes in: with careful crafting of prompts and instruction files, Copilot becomes more than a reactive assistant. You can guide it to apply your team’s style, enforce terminology, highlight structural issues, and even surface information architecture insights. In other words, the quality of what Copilot gives you is shaped by the quality of what you feed it.

    For content creators managing large sets of documentation, Copilot is more than a co-writer—it’s a content intelligence partner and co-architect. With thoughtful setup and prompt design, it helps you maintain quality, speed, and consistency—even at enterprise scale.

    👉 Try it in your next documentation sprint and see how it transforms the way you manage your whole body of content.

  • Reducing Hallucinations in Generative AI—What Content Creators Can Actually Do

    Reducing Hallucinations in Generative AI—What Content Creators Can Actually Do

    If you’ve used ChatGPT or Copilot and received an answer that sounded confident but was completely wrong, you’ve experienced a hallucination. These misleading outputs are a known challenge in generative AI—and while some causes are technical, others are surprisingly content-driven.

    As a content creator, you might think hallucinations are out of your hands. But here’s the truth: you have more influence than you realize.

    Let’s break it down.

    The Three Types of Hallucinations (And Where You Fit In)

    Generative AI hallucinations typically fall into three practical categories. (Note: Academic research classifies these as “intrinsic” hallucinations that contradict the source/prompt, or “extrinsic” hallucinations that add unverifiable information. Our framework translates these concepts into actionable categories for content creators.)

    1. Nonsensical Output
      The AI produces content that’s vague, incoherent, or just doesn’t make sense.
      Cause: Poorly written or ambiguous prompts.
      Your Role: Help users write better prompts by providing examples, templates, or guidance.
    2. Factual Contradiction
      The AI gives answers that are clear and confident—but wrong, outdated, or misleading.
      Cause: The AI can’t find accurate or relevant information to base its response on.
      Your Role: Create high-quality, domain-specific content that’s easy for AI to find and understand.
    3. Prompt Contradiction
      The AI’s response contradicts the user’s prompt, often due to internal safety filters or misalignment.
      Cause: Model-level restrictions or misinterpretation.
      Your Role: Limited—this is mostly a model design issue.
    The flow chart showing where each type of hallucination comes from: nonsensical output - from user prompt; factual output: content retrieval; prompt contradiction: content generation

    Where Does AI Get Its Information?

    Where Does AI Get Its Information?

    Modern AI systems increasingly use RAG (Retrieval-Augmented Generation) to ground their responses in real data. Instead of relying solely on training data, they actively search for and retrieve relevant content before generating answers. Learn more about how AI discovers and synthesizes content.

    Depending on the system, AI pulls data from:

    • Internal Knowledge Bases (e.g., enterprise documentation)
    • The Public Web (e.g., websites, blogs, forums)
    • Hybrid Systems (a mix of both)

    If your content is published online, it becomes part of the “source of truth” that AI systems rely on. That means your work directly affects whether AI gives accurate answers—or hallucinates.

    The Discovery–Accuracy Loop

    Here’s how it works:

    • If AI can’t find relevant content → it guesses based on general training data.
    • If AI finds partial content → it fills in the gaps with assumptions.
    • If AI finds complete and relevant content → it delivers accurate answers.

    So what does this mean for you?

    Your Real Impact as a Content Creator

    You can’t control how AI is trained, but you can control two critical things:

    1. The quality of content available for retrieval
    2. The likelihood that your content gets discovered and indexed

    And here’s the key insight:

    This is where content creators have the greatest impact—by ensuring that content is not only high-quality and domain-specific, but also structured into discoverable chunks that AI systems can retrieve and interpret accurately.

    Think of it like this: if your content is buried in long paragraphs, lacks clear headings, or isn’t tagged properly, AI might miss it—or misinterpret it. But if it’s chunked into clear, well-labeled sections, it’s far more likely to be picked up and used correctly. This shift from keywords to chunks is fundamental to how AI indexing differs from traditional search.

    Actionable Tips for AI-Optimized Content

    Structure for Chunking

    • Use clear, descriptive headings that summarize the content below them
    • Write headings as questions when possible (“How does X work?” instead of “X Overview”)
    • Keep paragraphs focused on single concepts (3–5 sentences max)
    • Create semantic sections that can stand alone as complete thoughts
    • Include Q&A pairs for common queries—this mirrors how users interact with AI
    • Use bullet points and numbered lists to break down complex information

    Improve Discoverability

    • Front-load key information in each section—AI often prioritizes early content
    • Define technical terms clearly within your content, not just in glossaries
    • Include contextual metadata through schema markup and structured data
    • Write descriptive alt text for images and diagrams

    Enhance Accuracy

    • Date your content clearly, especially for time-sensitive information
    • Link related concepts within your content to provide context
    • Be explicit about scope —what your content covers and what it doesn’t

    Understand Intent Alignment

    AI systems are evolving to focus more on intent than just keyword matching. That means your content should address the why behind user queries—not just the “what.”

    Think about the deeper purpose behind a search. Are users trying to solve a problem? Make a decision? Learn a concept? Your content should reflect that.

    The Bottom Line

    As AI continues to evolve from retrieval to generative systems, your role as a content creator becomes more critical—not less. By structuring your content for AI discoverability and comprehension, you’re not just improving search rankings; you’re actively reducing the likelihood that AI will hallucinate when answering questions in your domain.

    So the next time you create or update content, ask yourself:

    Can an AI system easily find, understand, and accurately use this information?

    If the answer is yes, you’re part of the solution.

  • Beyond Blue Links: How AI Discovers and Synthesizes Content

    Beyond Blue Links: How AI Discovers and Synthesizes Content

    In my previous posts, we’ve explored how AI systems index content through semantic chunking rather than keyword extraction, and how they understand user intent through contextual analysis instead of pattern matching. Now comes the final piece: how AI systems actually retrieve and synthesize content to answer user questions.

    This is where the practical implications for content creators become apparent.

    The Fundamental Shift: From Finding Pages to Synthesizing Answers

    Here’s the key difference that changes everything: Traditional search matches keywords and returns ranked pages. AI-powered search matches semantic meaning and synthesizes answers from specific content chunks.

    This fundamental difference in matching and retrieval processes requires us to think about content creation in entirely new ways.

    Let’s see how this works using the same example documents from my previous posts:

    Document 1:Upgrading your computer's hard drive to a solid-state drive (SSD) can dramatically improve performance. SSDs provide faster boot times and quicker file access compared to traditional drives.

    Document 2:Slow computer performance is often caused by too many programs running simultaneously. Close unnecessary background programs and disable startup applications to fix speed issues.

    Document 3:Regular computer maintenance prevents performance problems. Clean temporary files, update software, and run system diagnostics to keep your computer running efficiently.

    User query:How to make my computer faster?

    How Traditional vs. AI Search Retrieve Content

    How Traditional Search Matches and Retrieves

    Traditional search follows a predictable process:

    Keyword Matching: The system uses TF-IDF scoring, Boolean logic, and exact phrase matching to find relevant documents. It’s looking for pages that contain the words “computer,” “faster,” “make,” and related terms.

    Authority-Based Ranking: PageRank algorithms, backlink analysis, and domain authority determine which pages rank highest. A page from a high-authority tech site with many backlinks will likely outrank a smaller site with identical content.

    Example with our 3 computer docs: For “How to make my computer faster?“, traditional search would likely rank them this way:

    • Doc 1 ranks highest: Contains the exact keyword “faster” in “faster boot times” plus “improve performance
    • Doc 2 ranks second: Strong semantic matches with “slow computer” and “speed issues
    • Doc 3 ranks lowest: Related terms like “efficiently” and “performance” but less direct keyword matches

    The user gets three separate page results. They need to click through, read each page, and synthesize their own comprehensive answer.

    How AI RAG Search Matches and Retrieves

    Flow chart of how AI RAG search matches and retrieves content.

    AI-powered RAG systems operate on entirely different principles:

    Vector Similarity Matching:

    Rather than matching keywords, the system uses cosine similarity to compare the semantic meaning of the query vector against content chunk vectors. The query “How to make my computer faster?” gets converted into a mathematical representation that captures its meaning, intent, and context.

    Semantic Understanding:

    The system retrieves chunks based on conceptual relationships, not just keyword presence. It understands that “SSD upgrade” relates to “making computers faster” even without shared keywords.

    Multi-Chunk Synthesis:

    Instead of returning separate pages, the system combines the most relevant chunks from multiple sources to create a comprehensive answer.

    Example with same query: Here’s how AI would handle “How to make my computer faster?” using the chunks from my first post:

    The query vector finds high semantic similarity with:

    • Chunk 1A: “Upgrading your computer's hard drive to a solid-state drive (SSD) can dramatically improve performance.
    • Chunk 1B: “SSDs provide faster boot times and quicker file access compared to traditional drives.
    • Chunk 2B: “Close unnecessary background programs and disable startup applications to fix speed issues.
    • Chunk 3B: “Clean temporary files, update software, and run system diagnostics to keep your computer running efficiently.

    The AI synthesizes these chunks into a comprehensive answer covering hardware upgrades, software optimization, and maintenance—drawing from all three documents simultaneously.

    Notice the difference: traditional search would return Doc 1 as the top result because it contains “faster,” even though it only covers hardware solutions. AI RAG retrieves the most semantically relevant chunks regardless of their source document, prioritizing actionable solutions over keyword frequency. It might even skip Chunk 2A (“Slow computer performance is often caused by...“) despite its strong keyword matches, because it describes problems rather than solutions.

    The user gets one complete answer that addresses multiple solution pathways, all sourced from the most relevant chunks regardless of which “page” they came from.

    Why This Changes Content Strategy

    Conceptual framework of creating content for AI retrieval: chunk-level discoverability, comprehensive coverage, and synthesis-ready content

    This retrieval difference has profound implications for how we create content:

    Chunk-Level Discoverability

    Your content isn’t discovered at the page level—it’s discovered at the chunk level. Each section, paragraph, or logical unit needs to be valuable and self-contained. That perfectly written conclusion paragraph might never be found if the rest of your content doesn’t rank well, because AI systems retrieve specific chunks, not entire pages.

    Comprehensive Coverage

    AI systems find and combine related concepts from across your content library. This requires strategic coverage:

    Instead of trying to stuff keywords into a single page, create focused pieces that together provide comprehensive coverage. Rather than one “ultimate guide to computer speed,” create separate pieces on hardware upgrades, software optimization, maintenance, and diagnostics.

    Synthesis-Ready Content

    Write chunks that work well when combined with others—provide complete context by:

    • Avoiding excessive pronoun references
    • Writing self-contained paragraphs and sections

    The Bottom Line for Content Creators

    We’ve now traced the complete AI search journey:

    • How AI indexes content through semantic chunking (Post 1)
    • Understands user intent through contextual analysis (Post 2)
    • Retrieves and synthesizes content through vector similarity matching (this post)

    Each step reinforces the same content recommendations:

    • Chunk-sized content aligns with how AI indexes and retrieves information
    • Conversational language matches how AI understands user intent
    • Structured content supports AI’s semantic chunking and knowledge graph construction
    • Rich context supports semantic relationships that AI systems rely on, including:
      • Intent-driven metadata (audience, purpose, user scenarios)
      • Complete explanations (the why, when, and how behind recommendations)
      • Relationships to other concepts and solutions
      • Trade-offs, implications, and prerequisites
    • Comprehensive coverage works with how AI synthesizes multi-source answers

    AI technology is rapidly evolving. What is true today may become outdated tomorrow. AI may eventually become so advanced that we don’t have to think specifically about writing for AI systems—they’ll accommodate how humans naturally write and communicate.

    But no matter what era we’re in, the fundamentals of creating high-quality content remain constant. Those recommendations we’ve discussed are timeless principles of good communication: create accurate, true, and complete content; provide as much context as possible to communicate effectively; offer information in digestible, bite-sized pieces for easy consumption; write in conversational language for clarity and engagement.

    Understanding how current AI systems work simply reinforces why these have always been good practices. Whether optimizing for search engines, AI systems, or human readers, the goal remains the same: communicate your expertise as clearly and completely as possible.

    This completes my three-part series on AI-ready content creation. Understanding how AI indexes, interprets, and retrieves content gives us the foundation for creating content that thrives in an AI-powered world.