NVIDIA · 2026-04-21 · notable
Nemotron-Personas-Korea — 7M Synthetic Personas from Official Korean Demographics
NVIDIA releases 7M synthetic Korean personas (1M records × 7 types) grounded in KOSIS census data, CC BY 4.0, zero PII. Covers all 17 Korean provinces with 1.7B tokens total, for training Korean-language AI agents and improving demographic representation.

NVIDIA's Korean-language persona dataset: 7 million synthetic personas grounded in official government statistics, CC BY 4.0.
Key specs
| License | CC BY 4.0 |
|---|---|
| Total personas | 7M |
| Records | 1M |
| Total tokens | 1.7B |
| Language | Korean |
What is it?
Nemotron-Personas-Korea is a synthetic dataset of 7 million Korean personas generated by NVIDIA using their NeMo Data Designer pipeline. Each of 1 million base records has 7 persona variants (professional, family, sports, arts, travel, culinary, concise) across 26 fields — demographics, geography, occupations, life stages, and personal narratives in natural Korean. All data is synthesized; no real individuals were used. CC BY 4.0, fully PIPA-compliant.
How does it work?
NVIDIA used a probabilistic graphical model seeded from four official Korean government databases (KOSIS census, Supreme Court name distributions, National Health Insurance Service, Korea Rural Economic Institute) to sample demographically realistic persona parameters. Gemma-4-31B then generated natural-language narratives for each persona field. This two-stage approach produces statistically realistic distributions (209K unique names, 2K+ occupations, 17 provinces, 252+ districts) without any PII.
Why does it matter?
Most large persona datasets are English-centric. Building Korean-language agents that respond appropriately to demographic nuance requires training data grounded in actual Korean demographics — not translated English personas. The official statistical grounding and PIPA compliance make this usable in regulated contexts where synthetic data provenance matters.
Who is it for?
Teams building Korean-language AI agents and researchers studying demographic representation
Try it
huggingface.co/datasets/nvidia/Nemotron-Personas-Korea