MONDOGRAPH · THE WORLD'S PROPER-NAME KNOWLEDGE GRAPH

The biggest, most complete, most accurate knowledge graph of proper names.

MondoGraph is Mondonomo's defensible moat. A decade of academic and industrial research distilled into a single graph: 126 billion attestations, every script in active use, every country, with formal etymology and the connections between names — romanizations, transliterations, soundalikes, cognates, variants — modelled as first-class edges.

126B

Token attestations

165M

Unique token strings

2,410

Language codes

269

Countries

See business cases → Explore the schema

WHY A KNOWLEDGE GRAPH

Names are 99% of language — and almost entirely outside the reach of general LLMs.

Every general-purpose model fails on the long tail of proper names. They mis-pronounce, mis-translate, mis-gender, and confuse two unrelated people with similar spellings. MondoGraph is the substrate that lets a tiny specialized model do all of that correctly — and a hundred times cheaper than GPT-4.

Coverage no LLM has

53M distinct given-name forms, 49M surname forms. Rare regional names with three bearers are in MondoGraph and not in your foundation model's training corpus.

Edges, not just nodes

Every name connects to its romanizations, transliterations, IPA, soundalikes, etymological cognates, and bearer demographics. The graph is the model.

Self-improving ecosystem

B2C surfaces (mondonomo.ai, thai.mondonomo.ai, echoes) feed user contributions back into MondoGraph. A flywheel that's hard to copy.

SCHEMA

A name is not a string. It's a node with twelve kinds of edges.

Most "name lists" are flat tables. MondoGraph models a name as a node connected to scripts, languages, countries, IPA realizations, soundalike clusters, etymological roots, variants, parsed parts, gender distributions, and known bearers. The connections are what make the downstream models possible.

Scripts, languages, countries Variants & transliterations Phonetic forms Parsed structure & demographics Known bearers

WHAT'S INSIDE — TOKEN INVENTORY

556M rows. Six entity classes. Honest about coverage.

Tokens by entity type

126B total attestations across 165M unique strings

LOCATION 47.7%

SURNAME 30%

GIVEN 17.3%

Location 60B · 47.7% Surname 38B · 30.0% Given 22B · 17.3% Org · Title · Patronymic ~5%

Person-name lexicon

GIVEN + SURNAME — the substrate of every PNEUMA-DD model

GIVEN names

53M

unique forms · 22B attestations

SURNAMES

49M

unique forms · 38B attestations

Top languages by token volume

Of 2,410 language codes

English

24.2%

Chinese

9.1%

Unknown xx

9.1%

Hindi

8.7%

Tibetan bo

3.8%

Spanish · Portuguese · Russian · French · Arabic

2–6% ea.

Uyghur · Mongolian · Zhuang

~2% ea.

Data note. Disproportionate shares for Tibetan, Uyghur, Mongolian, Zhuang relative to speaker populations indicate concentrated source corpora (likely Chinese administrative or institutional records). The xx bucket (9.1%) covers tokens where source language could not be determined. Both surfaced rather than hidden.

ACCESS

How to use it.

Production API

REST + Python SDK

Hit MondoGraph through PNEUMA-DD and MondoPhon endpoints. Sub-50ms typical latency. Same keys work across all six demo endpoints. Free tier covers research and prototyping.

Bulk dataset

Hugging Face

Filtered slices for academic use (Apache 2.0). Full graph available under commercial license. Snapshot updates quarterly.

Research collaboration

co-author

Mondonomo collaborates on onomastic research and dataset extensions. Past partners: University of Zagreb, Chulalongkorn University. Reach out for new languages.