AI/TLDR

NVIDIA · 2026-04-21 · notable

Nemotron-Personas-Korea — 7M Synthetic Personas from Official Korean Demographics

NVIDIA releases 7M synthetic Korean personas (1M records × 7 types) grounded in KOSIS census data, CC BY 4.0, zero PII. Covers all 17 Korean provinces with 1.7B tokens total, for training Korean-language AI agents and improving demographic representation.

Nemotron-Personas-Korea dataset — NVIDIA synthetic Korean personas grounded in official government demographics

NVIDIA's Korean-language persona dataset: 7 million synthetic personas grounded in official government statistics, CC BY 4.0.

Key specs

LicenseCC BY 4.0
Total personas7M
Records1M
Total tokens1.7B
LanguageKorean

What is it?

Nemotron-Personas-Korea is a synthetic dataset of 7 million Korean personas generated by NVIDIA using their NeMo Data Designer pipeline. Each of 1 million base records has 7 persona variants (professional, family, sports, arts, travel, culinary, concise) across 26 fields — demographics, geography, occupations, life stages, and personal narratives in natural Korean. All data is synthesized; no real individuals were used. CC BY 4.0, fully PIPA-compliant.

How does it work?

NVIDIA used a probabilistic graphical model seeded from four official Korean government databases (KOSIS census, Supreme Court name distributions, National Health Insurance Service, Korea Rural Economic Institute) to sample demographically realistic persona parameters. Gemma-4-31B then generated natural-language narratives for each persona field. This two-stage approach produces statistically realistic distributions (209K unique names, 2K+ occupations, 17 provinces, 252+ districts) without any PII.

Why does it matter?

Most large persona datasets are English-centric. Building Korean-language agents that respond appropriately to demographic nuance requires training data grounded in actual Korean demographics — not translated English personas. The official statistical grounding and PIPA compliance make this usable in regulated contexts where synthetic data provenance matters.

Who is it for?

Teams building Korean-language AI agents and researchers studying demographic representation

Try it

huggingface.co/datasets/nvidia/Nemotron-Personas-Korea

Sources

Tags

  • dataset
  • synthetic-data
  • korean
  • personas
  • nvidia
  • nemotron
  • cc-by-4-0
  • agents
  • multilingual
  • demographics

← All releases · Learn AI