TGIS/AI
← Technical LandscapeRapid Research

SDMX and AI Readiness

How the SDMX standard intersects with agentic AI, and what it means for transport data

SDMX, AI Agents, and the Future of Transport Data Standards

Last reviewed: 2026-02-24 Confidence level: High for SDMX technical content and AI-readiness developments; medium for transport-specific implications


What SDMX actually is

SDMX (Statistical Data and Metadata eXchange) is an ISO standard (ISO 17369) that defines how statistical data should be structured, described, and exchanged between organisations. Eight international organisations maintain it: BIS, ECB, Eurostat, ILO, IMF, OECD, UNSD, and the World Bank.

The standard has three layers, and understanding how they fit together matters for everything that follows.

The data file is deliberately thin. A CSV, XML, or JSON file containing observations — something like KEN, A, 2024, 5.3. No labels. No explanations. Compact by design.

The Data Structure Definition (DSD) sits separately and says what those positions mean: the first value is the country dimension, the second is frequency, the third is time period, the fourth is the observation value. The DSD defines the shape of a dataset — how many dimensions, what order, what role each plays. But it doesn't tell you what "KEN" or "A" actually mean. It points outward to reference files.

Codelists and concept schemes are those reference files. A codelist called CL_AREA maps KEN to Kenya, GBR to United Kingdom, and so on. CL_FREQ maps A to Annual, M to Monthly. Concept schemes sit above codelists and define what "country" or "frequency" mean statistically — the abstract definition, independent of any specific dataset. These reference files are globally shared. The BIS, Eurostat, and the IMF can all point to the same CL_AREA codelist rather than each defining their own.

The full stack: data file → DSD → codelists + concept schemes. The data file is the numbers. The DSD is the schema. The codelists give codes their meaning. The concept schemes give dimensions their statistical meaning. Structure and data travel together but are maintained separately, which is why repeated exchanges between the same organisations are efficient — agree on the DSD once, publish the reference files once, and every subsequent transmission is compact and unambiguous.

SDMX also specifies a REST API for querying data. URL parameters specify dimensions, time ranges, and providers. The result is self-describing: structural metadata accompanies the data, so any consumer can interpret it without external documentation.

Version history:

  • v1.0 (2004), v2.0 (2005), v2.1 (2011, ISO-approved 2013) — v2.1 is still the most widely deployed
  • v3.0 (September 2021) — added microdata support, geospatial capabilities, multiple measures per dataset, VTL integration
  • v3.1 (May 2025) — added "Horizontally Complex DSDs" for datasets with hundreds of dimensions

The September 2025 inflection point

At the 10th SDMX Global Conference in Rome (September–October 2025, 300+ participants from 80 countries), all eight sponsor organisations issued a joint statement on AI-readiness. The commitments:

  • The sponsors will serve as "pathfinders" for integrating official statistics into AI systems
  • They are investigating a "Global Trusted Data Commons" — shared infrastructure where official statistics are structured, trusted, and accessible to AI
  • They are explicitly exploring Model Context Protocol (MCP) — the same protocol used by Anthropic's Claude and other AI systems to connect to external data
  • The goal: ensure official statistics remain trusted, accessible, and AI-readable so AI systems produce evidence-based outputs rather than hallucinations

The commitments are backed by action. The SDMX+AI initiative has been a formal workstream since March 2024, when the OECD and BIS co-organised a workshop with 70 participants from statistical organisations worldwide. The workshop concluded that natural language access to statistics is "within reach" using RAG combined with SDMX semantics — because SDMX data is already self-describing, it maps naturally to what LLMs need for grounded responses.

Working implementations

StatGPT 2.0 — The IMF built this on SDMX-structured datasets. It queries data from eight international organisations through natural language.

MAIA (Metadata AI Assistant) — A Python tool built on GPT, LangChain, and pysdmx that automates statistical metadata management: syntax checking, consistency validation, formatting, and metadata generation. It's fully SDMX-compliant.

Google Data Commons MCP Server (September 2025) — Not SDMX-specific, but demonstrates the pattern the SDMX sponsors are pursuing: AI agents discovering variables, resolving entities, fetching time series, and generating reports from public statistics via MCP.

SEASE project — Combines search engines with LLMs to resolve natural language queries into structured SDMX requests.


Why SDMX is structurally well-suited for AI

SDMX data carries its own semantics. It is:

  • Semantically annotated — every dimension and value has a machine-readable definition
  • Structurally consistent — DSDs enforce uniform structure across datasets from different countries and organisations
  • Validated — codelists constrain values to known, defined sets
  • Self-describing — metadata travels with the data
  • API-accessible — standard REST endpoints with predictable URL patterns

Research shows LLM hallucinations drop by over 50% when models are grounded in structured semantic context rather than raw schemas. SDMX provides exactly that context. An agent querying SDMX data knows what every dimension means, what values are valid, and how observations relate to each other — without any additional metadata engineering.


Transport's SDMX gap

Transport statistics basically doesn't exist as a formal global SDMX domain.

The mature SDMX domains are a narrow club: National Accounts, Consumer Price Index, Balance of Payments, Foreign Direct Investment, SEEA (environmental-economic accounting). New global DSDs were released in January 2026 for several of these. They exist because there's a legal or institutional mandate forcing convergence, and because the sponsor organisations directly need and use them.

A 2023 IFC survey found that even within central banks — the most SDMX-native institutions — the standard is less adopted for payment system and supervisory statistics. If those adjacent financial domains are still gaps, transport is nowhere near the agenda. The SDMX 2021–2025 Roadmap lists energy, health, and business statistics as aspirational targets where work hasn't started — transport isn't even on that list.

The reason isn't technical. SDMX could represent passenger volumes, freight tonnage by mode, road casualty rates, aviation emissions — the information model is flexible enough. The bottleneck is the process for reaching international agreement: which dimensions, which codelists, how to handle modal breakdowns consistently across countries with very different transport systems. That requires a sustained international working group with mandate, funding, and the right subject-matter experts. For transport, which sits across ITF (OECD), ICAO, IMO, and UNECE, nobody has the clear mandate to convene the group.

TDC is building toward this. TDC's data standards require that harmonised data be in SDMX format (SDMX-CSV, SDMX-ML, or SDMX-JSON). The transport_data Python package validates SDMX compliance and converts source data into SDMX structures. But only 35 of TDC's 460 datasets are fully harmonised, and no formal global transport DSD exists yet.


What AI agents change about this picture

The acceleration opportunity

Building a new SDMX domain package currently takes years. An agent can produce a well-structured draft DSD from real national data files and methodology documents in hours. Feed it representative data from a dozen countries, the relevant international methodology docs, and the global concept scheme. It maps field names semantically across inputs, identifies where national definitions diverge, flags definitional conflicts for human resolution, and outputs a structured draft.

This doesn't remove human judgment. It removes the months of preparatory work before human judgment can engage. The expert committee no longer spends its first year building from scratch — it spends its first meeting reviewing a draft and marking where the agent got it wrong.

The same applies to onboarding new data providers. A national statistics office with limited technical capacity no longer needs SDMX specialists on staff. An agent can handle the mapping between their internal data and an established DSD.

The "probabilistic draft, deterministic lock" architecture

This is the design pattern that makes agentic AI safe for official statistics.

Stage 1 — Probabilistic (agent domain). Draft DSDs, propose codelists, map cross-domain concepts, identify data quality anomalies, translate metadata, onboard new reporters. Agents do this work. It's creative, contextual, and variable. Human review gates all outputs before promotion.

Stage 2 — Deterministic lock (SDMX/VTL domain). Once a human approves an agent's output, it gets encoded as a formal VTL expression, a published DSD revision, or a certified codelist update. From this point, the pipeline is deterministic and auditable. The agent cannot touch it.

Stage 3 — Agent-assisted audit. Verification agents run against the Stage 2 outputs — not to change them, but to validate that the deterministic transformation produced what the specification said it should.

Agents generate. Humans approve. VTL executes. Agents verify. The loop is tight enough to be fast. The separation is strict enough to be trustworthy.

Production experience backs this up. Teams building AI for regulated clinical environments found that agents consistently underperformed deterministic pipelines for execution tasks — the agent had to "re-derive its search strategy every single time," introducing stochasticity and drift. The lesson: use agents where the problem is open and creative; use deterministic pipelines where the rules are known and the stakes are high.


What SDMX doesn't cover

Geospatial data

SDMX 3.0 added geospatial support, but it's designed for attaching spatial references to statistical observations — geocoding stats — not for carrying native geospatial data.

The three mechanisms:

  • Indirect reference — a URL pointing to an external shapefile or GeoJSON. Just a typed link.
  • GeospatialInformation type — a string-encoded geometry (points, lines, polygons) embedded in an SDMX observation. Uses a bespoke SDMX-specific syntax, not GeoJSON or WKT.
  • Geographic codelists — each code includes boundary geometry, so a codelist for "Regions of Kenya" carries the polygon of each region. Plus GeoGridCodelists for gridded statistics.

These are useful for saying "this GDP figure refers to this administrative region with this boundary." They are not adequate for:

  • Road networks — topological graph structures with nodes, edges, per-segment attributes, turn restrictions. Need OSM PBF, GeoPackage, or OpenDRIVE.
  • Infrastructure features — bridges, tunnels, ports with complex geometry. Need GIS feature classes or IFC/BIM models.
  • Satellite-derived layers — land use, nighttime lights, vegetation indices. These are raster data needing GeoTIFF, COG, or Zarr.
  • Real-time positions — vessel AIS tracks, aircraft ADS-B, GTFS-RT vehicle locations. Need streaming protocols.
  • Spatial queries — "find all road segments within 50km of Mombasa that have flood risk above 0.7." SDMX has no spatial query capability.

As of early 2026, virtually no SDMX providers publish data using the 3.0 geospatial features. Most are still on SDMX 2.1. Esri's documented workflow for SDMX integration is "download as CSV, join to boundary layer externally" — no native geospatial consumption.

The UN Expert Group on Integration of Statistical and Geospatial Information (EG-ISGI) published a second edition of the Global Statistical Geospatial Framework in August 2025. Its Principle 4 explicitly names SDMX and OGC as the two standards families that need to be bridged. But bridged means federated — keep statistical data in SDMX, keep spatial data in OGC services, link them through shared identifiers. Not SDMX absorbing geospatial data.

Unstructured knowledge

Policy reports, NDC commitments, project evaluations, research findings, corridor studies, institutional knowledge in PDFs and Word documents — SDMX has no mechanism for any of this. It's a standard for structured, validated, numerical statistical data. The transport sector's unstructured knowledge base is arguably larger than its structured data, and it's what gives the numbers context: why a country's road investment dropped, what a new corridor strategy commits to, what research found about maintenance cost-effectiveness.

AI agents with RAG (retrieval-augmented generation) can search, index, and synthesise this knowledge. No data standard is needed. What's needed is indexing infrastructure and the ability to cross-reference unstructured findings with structured data in a single query flow.


Implications for TGIS

TGIS sits at the intersection of these three worlds.

For structured statistical data, the path is clear: align with SDMX, build on TDC, contribute to both. Agents can accelerate TDC's SDMX standards pipeline — drafting transport DSDs, automating provider onboarding, translating natural language queries into SDMX calls. MCP servers for SDMX sources should be built as contributions to the SDMX ecosystem's own AI-readiness push, not as proprietary wrappers.

For geospatial data, TGIS fills a genuine gap. Road networks from OPSIS, trade disruption feeds from PortWatch, infrastructure maps from Overture — these need agent-accessible interfaces that SDMX will not provide. The access patterns are different (bounding box queries, spatial joins, feature-level attributes), and the agent orchestration that combines statistical indicators with spatial infrastructure data is where TGIS adds value that neither SDMX nor TDC alone can deliver.

For unstructured knowledge, TGIS provides the RAG layer. Policy documents, research findings, NDC commitments — indexed, searchable, and synthesisable alongside structured data. This is the least standardised and most AI-native part of the stack.

The strategic position: TGIS is not building around SDMX or TDC. It's using agentic AI to accelerate what they do well, and to connect them with the geospatial and unstructured worlds that they don't reach.


Key sources

  • SDMX Sponsor Organisations' Joint Statement on AI-Readiness (September 2025) — sdmx.org
  • 10th SDMX Global Conference proceedings (Rome, September–October 2025) — sdmx.org
  • SDMX+AI Workshop Summary Report (March 2024) — siscc.org
  • SDMX 3.0 Section 6 Technical Notes — geospatial support specification
  • SDMX 3.1 release (May 2025) — sdmx.org
  • SDMX Roadmap 2021–2025 — published by SDMX sponsors
  • IFC Survey on SDMX adoption in central banks (2023) — bis.org
  • Global Statistical Geospatial Framework, 2nd Edition (August 2025) — UN-GGIM
  • TDC Data Standards Documentation — docs.transport-data.org
  • Esri "Geoenable SDMX with Data Pipelines" (February 2024) — esri.com
  • Google Data Commons MCP Server (September 2025) — developers.googleblog.com
  • IMF StatGPT — imf.org
  • MAIA Metadata AI Assistant — sdmx.io

This document should be updated as the SDMX sponsors publish their next roadmap (expected 2026) and as TDC's SDMX coverage expands.