Category: Content development

What are big tech companies using for their documentation formats?

n my last post, I looked at Markdown, JSON, YAML, and XML. My take was that Markdown + YAML/JSON offers the best mix of being easy to read and easy for AI to understand.

But I wanted to sanity-check that.
Are big tech companies actually using these formats?
I did a quick review of what’s publicly visible on their main developer docs. It’s not a full audit—each company has many systems—but it gives a good sense of the patterns.

Here’s what I found:

Company	External Docs	Internal Docs (public info only)	Main Format
Microsoft	Markdown + YAML	Markdown	Markdown + YAML
Google	Markdown-based static sites	g3doc Markdown + Google Docs	Markdown
Apple	DocC Markdown	DocC Markdown	Markdown + directives
AWS	Markdown or reStructuredText	Markdown + reST	Markdown
Meta	Markdown/MDX + Docusaurus	Markdown-based wikis	Markdown
IBM	DITA XML	DITA XML	XML/DITA
Adobe	DITA XML	DITA XML	XML/DITA
Cisco	DITA XML + Markdown	DITA XML	Mixed
Oracle	DITA XML + some Markdown	DITA XML	Mixed
SAP	Markdown	DITA XML	Mixed

A few things to keep in mind:

This list reflects publicly observable documentations, contributor guides, and established industry practices—not internal implementation details.
Each company often supports multiple documentation tools depending on product line, age of system, regulatory requirements, and internal ownership.
The “main format” here reflects only the examples I could verify.

What seems to be happening?

From what I can see:

Many cloud-era developer sites tend to favor Markdown with YAML or JSON metadata. This works well for humans, and it also helps with AI indexing and RAG.
Structured XML systems like DITA remain widely used in industries where versioning, translation workflows, and governance are deeply embedded (e.g., enterprise hardware, long-standing enterprise software portfolios).
Even companies with long XML histories now publish some developer content in Markdown, especially for SDKs and portals.

None of this is good or bad. Each format exists for valid historical and operational reasons.

Disclaimers

Methodology
This comparison comes from public docs, contributor guides, and open-source materials. It’s not a full map of everything these companies use.

No judgment
Formats aren’t “modern” or “old.” They reflect the size of the company, the type of products, and how long their systems have been around.

Accuracy
If anything here seems outdated or incomplete, I’m happy to update it. Documentation systems change often.

Scope
This review looks only at developer-facing docs, since those are publicly accessible. Internal and proprietary systems are outside the scope.

November 25, 2025

markdown,.json, yml, and xml – what is the best content format for both human and AI?

Humans and AI retrieve and consume our content differently. In this post, I want to discuss what is the best balance between content for human and content for AI.

In my former posts, I recommended using structured content for better chunking for AI to understand and retrieve content. When we talk about structured content, we often look at these document formats: markdown, json, xml and yml.

So, which document formats are the best for both human and AI? Let’s take a look at each of these document formats:

Markdown (.md)

What it is:
Markdown is a lightweight markup language designed to make writing for the web simple and readable. It uses plain text syntax (like # for headings or - for lists`) that converts easily to HTML.

Example:

# Deploying to Cloud Run
Learn how to deploy your first app.

## Steps
1. Build your image
2. Push to Container Registry
3. Deploy with Cloud Run

Industry Example:

Microsoft Learn and Google Developers both use Markdown as their primary authoring format.
All articles on learn.microsoft.com are .md files stored in GitHub repos like microsoftdocs/azure-docs.
AWS, GitHub, and OpenAI also use Markdown for documentation and developer guides.

Why humans like it:

Clean, minimal, and intuitive — almost like writing an email.
Easy to learn, edit, and version-control in Git.
Highly readable even before rendering.

Why AI likes it:

Semantically structured (headings, lists, tables) without layout noise.
Perfect for chunking and embedding for retrieval-augmented generation (RAG) or Copilot ingestion.
Mirrors the formats LLMs are trained on (GitHub, documentation, etc.).

Trade-offs:

Limited metadata support compared to JSON/YAML.
Not ideal for representing complex relational data.

✅ Best for:
Readable documentation, tutorials, conceptual and how-to content consumed by both humans and AI.

JSON (.json)

What it is:
JavaScript Object Notation (JSON) is a structured data format using key–value pairs. It’s widely used for APIs, configurations, and machine-to-machine communication.

Example:

{
  "title": "Deploy to Cloud Run",
  "steps": [
    "Build your image",
    "Push to Container Registry",
    "Deploy with Cloud Run"
  ],
  "author": "Maggie Hu"
}

Industry Example:

Use Case	Example	Purpose
Microsoft Learn Catalog	JSON for doc metadata	AI indexing and discovery
Google Vertex AI	JSON for prompt documentation	LLM instruction structuring
OpenAI Function Docs	JSON as documentation schema	Model understanding
Schema.org JSON-LD	JSON for semantic content	AI/web discoverability

Why humans like it:

Familiar to developers and easy to read for small datasets.
Ideal for storing structured data or configuration.

Why AI likes it:

Clear, unambiguous key-value structure for precise information retrieval.
Ideal for embedding metadata and reasoning in structured formats.
Natively supported as input/output format for LLMs.

Trade-offs:

Harder for non-technical readers to interpret.
Not suitable for long-form narrative text.

✅ Best for:
Metadata, structured data exchange, and AI pipelines requiring precise context.

YAML (.yml / .yaml)

What it is:
YAML (“YAML Ain’t Markup Language”) is a human-friendly data serialization format often used for configuration files. It’s similar to JSON but uses indentation instead of braces.

Example:

title: Deploy to Cloud Run
description: Learn how to deploy your first containerized app.
steps:
  - Build your image
  - Push to Container Registry
  - Deploy with Cloud Run
author: Maggie Hu

Industry Example:

Microsoft Learn, GitHub Pages (Jekyll), and Hugo/Docsy sites use YAML front matter at the top of Markdown files to store metadata like title, topic, author, and tags.
Kubernetes defines all infrastructure configuration (pods, deployments, secrets) in YAML.
GitHub Actions uses YAML to describe CI/CD workflows (.github/workflows/main.yml).

Why humans like it:

Clean indentation mirrors logical hierarchy.
Excellent for connecting content with structured metadata.
Easy to read and edit directly in Markdown front matter.

Why AI likes it:

Provides machine-parsable structure with human-friendly syntax.
Used widely for prompt templates, model configuration, and structured metadata ingestion.

Trade-offs:

Sensitive to spacing and indentation errors.
Can be ambiguous when representing data types.

✅ Best for:
Config files, front-matter metadata, and hybrid human–AI authoring systems.

XML (.xml)

What it is:
eXtensible Markup Language (XML) is a tag-based format for representing structured data hierarchies. It’s verbose but powerful for enforcing schema-based content consistency.

Example:

<task id="deploy-cloud-run">
  <title>Deploy to Cloud Run</title>
  <steps>
    <step>Build your image</step>
    <step>Push to Container Registry</step>
    <step>Deploy with Cloud Run</step>
  </steps>
</task>

Industry Example:

IBM, the creator of DITA, and companies like Cisco, Oracle, and Adobe use XML-based DITA systems for large-scale technical documentation.
Financial, aerospace, and medical industries rely on XML for regulated documentation and content validation (e.g., FAA, FDA compliance).
Microsoft’s legacy MSDN and Office help systems were XML-based before their Markdown migration.

Why humans (used to) love it:

Strict structure ensures consistency and reusability.
Excellent for translation and compliance workflows.

Why AI doesn’t love it as much:

Verbose, token-heavy, and less semantically clean for LLMs.
Requires preprocessing to strip tags for content embedding.
Complex to maintain for open collaboration.

Trade-offs:

Ideal for governance and reuse, but difficult for readability.
Better suited for enterprise content management systems than AI retrieval.

✅ Best for:
Regulated or legacy technical documentation requiring schema validation.

Summary: Human vs. AI Alignment

Takeaway

The best format for both humans and AI is Markdown enhanced with YAML or JSON metadata.
Markdown provides readability and natural structure for human writers, while YAML and JSON add the precision and hierarchy that AI systems rely on for retrieval, linking, and reasoning.

October 22, 2025