n my last post, I looked at Markdown, JSON, YAML, and XML. My take was that Markdown + YAML/JSON offers the best mix of being easy to read and easy for AI to understand.
But I wanted to sanity-check that. Are big tech companies actually using these formats? I did a quick review of what’s publicly visible on their main developer docs. It’s not a full audit—each company has many systems—but it gives a good sense of the patterns.
Here’s what I found:
Company
External Docs
Internal Docs (public info only)
Main Format
Microsoft
Markdown + YAML
Markdown
Markdown + YAML
Google
Markdown-based static sites
g3doc Markdown + Google Docs
Markdown
Apple
DocC Markdown
DocC Markdown
Markdown + directives
AWS
Markdown or reStructuredText
Markdown + reST
Markdown
Meta
Markdown/MDX + Docusaurus
Markdown-based wikis
Markdown
IBM
DITA XML
DITA XML
XML/DITA
Adobe
DITA XML
DITA XML
XML/DITA
Cisco
DITA XML + Markdown
DITA XML
Mixed
Oracle
DITA XML + some Markdown
DITA XML
Mixed
SAP
Markdown
DITA XML
Mixed
A few things to keep in mind:
This list reflects publicly observable documentations, contributor guides, and established industry practices—not internal implementation details.
Each company often supports multiple documentation tools depending on product line, age of system, regulatory requirements, and internal ownership.
The “main format” here reflects only the examples I could verify.
What seems to be happening?
From what I can see:
Many cloud-era developer sites tend to favor Markdown with YAML or JSON metadata. This works well for humans, and it also helps with AI indexing and RAG.
Structured XML systems like DITA remain widely used in industries where versioning, translation workflows, and governance are deeply embedded (e.g., enterprise hardware, long-standing enterprise software portfolios).
Even companies with long XML histories now publish some developer content in Markdown, especially for SDKs and portals.
None of this is good or bad. Each format exists for valid historical and operational reasons.
Disclaimers
Methodology This comparison comes from public docs, contributor guides, and open-source materials. It’s not a full map of everything these companies use.
No judgment Formats aren’t “modern” or “old.” They reflect the size of the company, the type of products, and how long their systems have been around.
Accuracy If anything here seems outdated or incomplete, I’m happy to update it. Documentation systems change often.
Scope This review looks only at developer-facing docs, since those are publicly accessible. Internal and proprietary systems are outside the scope.
Humans and AI retrieve and consume our content differently. In this post, I want to discuss what is the best balance between content for human and content for AI.
In my former posts, I recommended using structured content for better chunking for AI to understand and retrieve content. When we talk about structured content, we often look at these document formats: markdown, json, xml and yml.
So, which document formats are the best for both human and AI? Let’s take a look at each of these document formats:
Markdown (.md)
What it is: Markdown is a lightweight markup language designed to make writing for the web simple and readable. It uses plain text syntax (like # for headings or - for lists`) that converts easily to HTML.
Example:
# Deploying to Cloud Run
Learn how to deploy your first app.
## Steps
1. Build your image
2. Push to Container Registry
3. Deploy with Cloud Run
Industry Example:
Microsoft Learn and Google Developers both use Markdown as their primary authoring format.
All articles on learn.microsoft.com are .md files stored in GitHub repos like microsoftdocs/azure-docs.
AWS, GitHub, and OpenAI also use Markdown for documentation and developer guides.
Why humans like it:
Clean, minimal, and intuitive — almost like writing an email.
Easy to learn, edit, and version-control in Git.
Highly readable even before rendering.
Why AI likes it:
Semantically structured (headings, lists, tables) without layout noise.
Perfect for chunking and embedding for retrieval-augmented generation (RAG) or Copilot ingestion.
Mirrors the formats LLMs are trained on (GitHub, documentation, etc.).
Trade-offs:
Limited metadata support compared to JSON/YAML.
Not ideal for representing complex relational data.
✅ Best for: Readable documentation, tutorials, conceptual and how-to content consumed by both humans and AI.
JSON (.json)
What it is: JavaScript Object Notation (JSON) is a structured data format using key–value pairs. It’s widely used for APIs, configurations, and machine-to-machine communication.
Example:
{
"title": "Deploy to Cloud Run",
"steps": [
"Build your image",
"Push to Container Registry",
"Deploy with Cloud Run"
],
"author": "Maggie Hu"
}
Familiar to developers and easy to read for small datasets.
Ideal for storing structured data or configuration.
Why AI likes it:
Clear, unambiguous key-value structure for precise information retrieval.
Ideal for embedding metadata and reasoning in structured formats.
Natively supported as input/output format for LLMs.
Trade-offs:
Harder for non-technical readers to interpret.
Not suitable for long-form narrative text.
✅ Best for: Metadata, structured data exchange, and AI pipelines requiring precise context.
YAML (.yml / .yaml)
What it is: YAML (“YAML Ain’t Markup Language”) is a human-friendly data serialization format often used for configuration files. It’s similar to JSON but uses indentation instead of braces.
Example:
title: Deploy to Cloud Run
description: Learn how to deploy your first containerized app.
steps:
- Build your image
- Push to Container Registry
- Deploy with Cloud Run
author: Maggie Hu
Industry Example:
Microsoft Learn, GitHub Pages (Jekyll), and Hugo/Docsy sites use YAML front matter at the top of Markdown files to store metadata like title, topic, author, and tags.
Kubernetes defines all infrastructure configuration (pods, deployments, secrets) in YAML.
GitHub Actions uses YAML to describe CI/CD workflows (.github/workflows/main.yml).
Why humans like it:
Clean indentation mirrors logical hierarchy.
Excellent for connecting content with structured metadata.
Easy to read and edit directly in Markdown front matter.
Why AI likes it:
Provides machine-parsable structure with human-friendly syntax.
Used widely for prompt templates, model configuration, and structured metadata ingestion.
Trade-offs:
Sensitive to spacing and indentation errors.
Can be ambiguous when representing data types.
✅ Best for: Config files, front-matter metadata, and hybrid human–AI authoring systems.
XML (.xml)
What it is: eXtensible Markup Language (XML) is a tag-based format for representing structured data hierarchies. It’s verbose but powerful for enforcing schema-based content consistency.
Example:
<task id="deploy-cloud-run">
<title>Deploy to Cloud Run</title>
<steps>
<step>Build your image</step>
<step>Push to Container Registry</step>
<step>Deploy with Cloud Run</step>
</steps>
</task>
Industry Example:
IBM, the creator of DITA, and companies like Cisco, Oracle, and Adobe use XML-based DITA systems for large-scale technical documentation.
Financial, aerospace, and medical industries rely on XML for regulated documentation and content validation (e.g., FAA, FDA compliance).
Microsoft’s legacy MSDN and Office help systems were XML-based before their Markdown migration.
Why humans (used to) love it:
Strict structure ensures consistency and reusability.
Excellent for translation and compliance workflows.
Why AI doesn’t love it as much:
Verbose, token-heavy, and less semantically clean for LLMs.
Requires preprocessing to strip tags for content embedding.
Complex to maintain for open collaboration.
Trade-offs:
Ideal for governance and reuse, but difficult for readability.
Better suited for enterprise content management systems than AI retrieval.
✅ Best for: Regulated or legacy technical documentation requiring schema validation.
Summary: Human vs. AI Alignment
Takeaway
The best format for both humans and AI is Markdown enhanced with YAML or JSON metadata. Markdown provides readability and natural structure for human writers, while YAML and JSON add the precision and hierarchy that AI systems rely on for retrieval, linking, and reasoning.