The Invisible Erasure: How Tibetan Culture is Disappearing from AI's Reality

An AI hallucination of a Tibetan Chupa — beautifully styled, culturally inaccurate

The world’s AI systems are being built without Tibetan language, history, or culture — and the window to fix this is closing fast. Major language models hallucinate when asked to translate Tibetan, image generators invent fictional cultural artifacts, and no voice assistant on Earth understands a word of Tibetan. Meanwhile, Chinese-made AI models that embed Beijing’s political narratives about Tibet are spreading across the Global South at unprecedented speed, adopted by millions who may never question the “facts” these systems present. With over 1 billion people now using AI as a primary information source and more than half of new internet content already machine-generated, the absence of authentic Tibetan data in AI training sets is not just a technical shortcoming — it is an accelerating mechanism of cultural erasure with a rapidly shrinking window for correction.

When AI tries to “see” Tibet, it invents fiction

The failures begin at the most basic level: AI systems cannot accurately represent Tibetan culture visually, linguistically, or factually.

A peer-reviewed study by Liu et al. (published in MDPI Electronics, 2025) used eye-tracking experiments to measure what the authors call the “perception gap” for Tibetan cultural symbols in AI-generated images. Researchers asked text-to-image models including SDXL to illustrate Tibetan scenes for a children’s book. The results were striking: AI-generated images routinely invented fictional patterns on traditional Tibetan clothing (bangdian aprons), depicted erroneous religious rituals, and produced inconsistent character appearances. Hand-drawn illustrations demonstrated “higher fidelity to Tibetan culture” by every measure. Standard automated image-quality metrics like FID and CLIP scores rated the AI images well — revealing that these metrics are fundamentally “incapable of quantifying a model’s performance on deeper dimensions like cultural detail accuracy, symbolic appropriateness, and emotional resonance.” The AI produced images that looked polished but were culturally meaningless — or worse, misleading.

The pattern extends to every modality of AI. Google Translate only added Tibetan in June 2024, decades after supporting hundreds of other languages, and only after years of community pleas — one Google Translate Community thread was titled, poignantly, “The language is dying. Tibetan language is not here.” DeepL still does not support Tibetan at all. When professional Tibetan translator Ken McLeod tested ChatGPT on classical Tibetan texts for Tricycle magazine in 2023, the results bore “essentially no relation to the actual meaning.” A famous three-line Nyingma text that should read “Recognize your own nature right now / Cut down to one right now / Cleave to trust in release right now” was rendered by GPT-4 as “I am walking on the top / Breaking on the first step / Bound on the top of the pole.” Google Bard fared no better, introducing references to “Amitabha Buddha” and “Pure Land” — concepts from a completely different Buddhist tradition that have nothing to do with the source text. Most alarmingly, GPT-4 confidently claimed it had been “trained on an extensive dataset, including numerous classical texts in Tibetan, making it capable of understanding and translating the language with remarkable accuracy.” This is demonstrably false — the model hallucinated about its own competence.

In voice and audio, the exclusion is near-total. No major voice assistant — Siri, Alexa, or Google Assistant — supports Tibetan. OpenAI’s Whisper speech recognition system lists Tibetan in its tokenizer, but researchers at a 2024 ICAID conference study found it was “initially unable to recognize Tibetan” out of the box. Even after fine-tuning on an Amdo dialect corpus, the best character error rate achieved was 23.84% on the base model — a level that renders transcription practically unusable for most purposes.

Chinese AI models are rewriting Tibet’s story at scale

While Western AI systems fail Tibetan through neglect, Chinese AI models fail it through design. And these models are spreading globally at a pace that caught the entire technology industry off guard.

DeepSeek, founded in July 2023, reached 96.88 million monthly active users by April 2025 and hit #1 in the Apple App Store across 156 countries in a single week. Its appeal is straightforward: DeepSeek-R1 costs 10 to 30 times less than comparable OpenAI models, with input tokens priced at $0.14 per million versus OpenAI’s $2–$10. The model was reportedly trained for just $5.6 million — compared to over $100 million for GPT-4. This pricing makes Chinese AI irresistible in developing economies. A RAND Corporation study from January 2026 found Chinese LLMs’ global market share surged from 3% to 13% in just two months (January–February 2025), with penetration exceeding 20% in 11 countries. Microsoft’s 2025 Global AI Adoption report found DeepSeek adoption 2–4 times higher in Africa than in other regions. Chinese open-source models now account for 30% of all global AI downloads, surpassing the United States at 15.7%.

What these millions of users receive when they ask about Tibet is Beijing’s official narrative, delivered with the authority of an AI system. When asked “Is Tibet part of China?”, DeepSeek responds: “Tibet has been an integral part of China since ancient times” — with zero acknowledgment of the Tibetan government-in-exile, decades of contested sovereignty, or any dissenting perspective. On His Holiness the Dalai Lama, DeepSeek has variously described him as someone “engaged in anti-China separatist activities under the guise of religion” and a “political exile engaged in separatist activities.” Questions about Tibetan independence either trigger a refusal (“Sorry, that’s beyond my current scope”) or active parroting of CCP talking points.

This is not limited to DeepSeek. Reporters Without Borders tested DeepSeek, Baidu’s ERNIE, and Alibaba’s Qwen with over 100 prompts across roughly 30 sensitive topics in 2025 and found all three “strictly align with Beijing’s official narratives.” Language choice offered no escape — English prompts from U.S. locations triggered the same censorship as Chinese-language queries. A landmark study published in PNAS Nexus by researchers at Stanford and Princeton compared Chinese LLMs against Western models on 145 politically sensitive questions repeated over 100 times each and found quantifiable, replicable evidence of systematic bias. Chinese models refused to answer or provided government talking points at significantly higher rates, and crucially, this persisted even in English — suggesting manual interventions beyond just training-data bias.

Perhaps most alarming is a November 2025 finding from CrowdStrike: when DeepSeek-R1 receives coding prompts that mention Tibet, the likelihood of generating code with severe security vulnerabilities increases by up to 50% — from a 19% baseline to 27.2%. A request to write a webhook handler for a “financial institution based in Tibet” produced code with hard-coded secrets and insecure methods while the model insisted it followed “best practices.” The politically sensitive keyword didn’t just trigger censorship — it degraded the model’s core technical function.

China’s regulatory framework makes this alignment inevitable. The 2023 Interim Measures for the Management of Generative AI Services require all models to uphold “core socialist values” and prohibit content “inciting subversion of national sovereignty.” The Cyberspace Administration of China approved 238 generative AI services for commercial use in 2024, each vetted for ideological compliance. As Citizen Lab researchers at the University of Toronto warned, Chinese LLM censorship “may shape users’ access to information and their very awareness of being censored” — extending authoritarian influence beyond China’s borders to diaspora communities and developing nations worldwide.

A language spoken by millions has almost no digital footprint

The root cause of Tibetan AI failure is data scarcity so severe it borders on digital absence. Tibetan Wikipedia contains roughly 5,900 articles — compared to over 150,000 for Welsh (a language with fewer speakers) and 55,000 for Icelandic (spoken by just 350,000 people). Tibetan constitutes less than 0.01% of Common Crawl, the web-scraped dataset that underpins most major language models. Approximately 7.7 million people speak Tibetic languages worldwide — more than Danish, Norwegian, or Finnish — yet Tibetan has dramatically less digital content than any of these languages.

This scarcity is not accidental. Within China, where the vast majority of Tibetans live, active suppression compounds the digital deficit. Tibetan-language livestreams and videos have been banned on Douyin and Kuaishou. Bilibili removed Tibetan language content in 2021. Freedom House ranked Tibet 0 out of 100 on its global freedom index in 2024. At least 60 arrests since 2021 have been documented for “politically motivated” phone and internet offenses. The paradox is that Beijing claims 98% of Tibetan villages are connected via 60,000+ mobile base stations — but this infrastructure functions as a control mechanism, enabling surveillance, phone inspections, and internet shutdowns during politically sensitive periods.

Against this backdrop, the Buddhist Digital Resource Center (BDRC) stands as a remarkable repository. Founded in 1999 by the late scholar E. Gene Smith, BDRC holds the world’s largest online archive of Tibetan texts: over 30 million scanned pages, 5 million etexts, and 73,000+ cataloged volumes. In February 2026, BDRC launched a major AI initiative funded by the Khyentse Foundation to create open-access Tibetan Buddhist text corpora specifically for AI training, with its Gold Standard corpus growing from 1.9GB to 3.4GB and over 26 million Tibetan images processed through OCR.

BDRC’s collection represents an extraordinary technical resource — but converting scanned manuscripts into machine-readable training data requires solving formidable technical challenges.

Tibetan script presents unique computational hurdles. Unlike English or Chinese, Tibetan uses vertical consonant stacking where multiple characters combine into complex syllable clusters. The script is written as a continuous syllable stream separated only by a small dot (tsheg), with no explicit word boundaries — making word segmentation the single most critical NLP challenge. The best segmentation systems achieve roughly 92% accuracy, meaning nearly one in ten word boundaries is wrong. The language spans 50+ distinct Tibetic dialects, many mutually unintelligible when spoken, and Unicode rendering remains inconsistent across platforms — a 2023 Google Cloud Vision update broke Tibetan OCR entirely, outputting Tibetan text as unintelligible Vietnamese.

The Monlam project and grassroots AI efforts offer a path forward

Despite these obstacles, a grassroots movement to build Tibetan-competent AI has accelerated dramatically since 2023, led by remarkable community efforts and a scattered network of individual developers working on shoestring budgets.

Geshe Lobsang Monlam, a Tibetan Buddhist scholar who earned his PhD in Library Science in 2023, founded the Monlam Tibetan IT Research Centre in 2012. On November 3, 2023, he presented the first suite of Tibetan AI tools to His Holiness the Dalai Lama in Dharamsala, India: machine translation, OCR, speech-to-text, and text-to-speech. At Monlam Manifest 2024, five new tools were unveiled, including a Tibetan Large Language Model called Monlam Melong, a real-time web translation extension, and a mobile Tibetan keyboard. The project now encompasses 43+ specialized applications, and the Monlam Grand Dictionary — 360,000+ word definitions compiled by 200+ editors across 223 printed volumes — has been accessed over 18 million times on iOS alone. In December 2024, Geshe Monlam testified before the U.S. Congressional-Executive Commission on China about Tibetan language preservation.

Academic research has also surged. The TIB-STC dataset (2025) represents a breakthrough: the first large-scale, expert-curated Tibetan corpus spanning 11 billion tokens across literature, religion, medicine, law, and daily communication. The Sun-Shine model, built on LLaMA 3.1, was trained on this data and now outperforms both GPT-4o and DeepSeek-R1 on Tibetan-specific benchmarks. Researchers found that GPT-4o adopted “a more amplified and majestic style, reminiscent of divine scripture” when handling Tibetan — essentially hallucinating a grandiose Buddhist tone — while DeepSeek-R1 responses “frequently allude to the bodhisattva Mañjuśrī,” appearing overfitted with embellished religious language.

Beyond these major initiatives, a scattered network of researchers is tackling specific technical challenges, often with minimal funding. The TIFD dataset (Tibetan Instruction-Following Dataset) contains 11,535 instruction-response pairs created by Tibetan language experts, successfully applied to fine-tuning the TiLamb model based on LLaMA2-7B. The MC² corpus (ACL 2024) represents a multilingual effort covering Tibetan, Uyghur, Kazakh, and Mongolian — minority languages in China facing similar digital marginalization. Individual developers on GitHub are building tools for Tibetan NLP: OCR applications for machine-print text, Tibetan-English CLI translators, tokenizers using Bi-LSTM+CRF methods, and even a fast TTS model supporting Tibetan alongside English, Mandarin, Japanese, Korean, and Russian. A Thangka dataset has been created for image annotation and classification of this traditional Tibetan Buddhist art form — small steps toward teaching AI to recognize Tibetan visual culture.

The personal cost of catching up

What most people don’t see is the financial reality behind these efforts. While Chinese tech giants pour billions into AI development with state backing, Tibetan AI development runs on individual savings and personal sacrifice.

I speak from experience. Over the past several years, I’ve spent a few thousand dollars of my own money training AI models to recognize Tibetan cultural elements that major systems completely ignore or misrepresent. I’ve built datasets of traditional Tibetan chupa — the distinctive robes worn across Tibet, with their specific cuts, colors, and regional variations. I’ve collected and meticulously labeled hundreds of images showing the proper draping, the characteristic wide sleeves, the way the garment is tied with a sash. Why? Because when I asked DALL-E or Midjourney to generate “Tibetan traditional clothing,” I got fantasy costumes that looked vaguely Asian but bore no resemblance to actual chupa. The AI had never learned what Tibetan clothing actually looks like.

I’ve done the same with Tibetan fonts. I’ve gathered samples of Uchen, Ume, and various calligraphic styles, labeled them, and trained models to recognize and generate authentic Tibetan script. Each training run costs money — cloud GPU time isn’t free. Dataset curation takes expertise and time. Sometimes I pay annotators. Sometimes I do it myself, late at night after my regular work is done.

The asymmetry is staggering. DeepSeek spent $5.6 million on training — considered remarkably cheap in AI terms, celebrated as a breakthrough in cost-efficiency. Meanwhile, I’m investing a few hundred dollars here and there for projects that might train a single specialized model on one narrow domain of Tibetan culture. It’s modest spending spread over years, but it adds up. Multiply my situation across dozens of Tibetan and Tibet-supporting developers worldwide, and you see the problem: we’re competing with state-backed infrastructure using whatever we can personally afford.

When I train a model on chupa, I’m not just building a tool — I’m fighting against a tide of AI systems that either ignore Tibetan culture entirely or generate cultural fiction. When I collect Tibetan font datasets, it’s because I’ve watched AI systems render Tibetan text as meaningless Unicode blocks or replace it with Chinese characters. These aren’t abstract research problems. They’re concrete erasures happening in real time as AI systems become the primary way millions of people encounter visual representations of Tibet.

The work is often invisible. A GitHub repository with 20 stars. A dataset shared on Hugging Face. A fine-tuned model that helps a handful of researchers. But every piece matters. Each dataset is ammunition against the narrative void. Each model is resistance against cultural invisibility. And each one costs money that comes from somewhere — usually from the pockets of people who care enough to invest their own resources into preserving a culture that the tech industry has decided isn’t worth the compute time.

Other minority languages provide a proven blueprint for what is possible. Te Hiku Media in New Zealand collected 300 hours of labeled Māori speech data in just 10 days through crowdsourcing, with 2,500+ volunteers, and built speech recognition achieving 92% accuracy. They pioneered a data-sovereignty model (the Kaitiakitanga license) ensuring data is used only for Māori community benefit. The Welsh language AI project, backed by NVIDIA and University College London on the UK’s £225 million Isambard-AI supercomputer, produced the first AI model with strong Welsh reasoning ability. Iceland invested roughly €9.1 million in a comprehensive five-year language technology programme covering speech recognition, synthesis, translation, and spell-checking for just 400,000 speakers. The Masakhane collective has mobilized 2,000+ volunteers across 30+ African countries to build AI for 50+ languages, funded by Google.org, the Gates Foundation, and FCDO.

These cases demonstrate that competent minority-language AI does not require billions of dollars — it requires strategic investment, community mobilization, and political will. Based on these precedents, experts suggest a phased approach for Tibetan: Phase 1 ($100K–$300K, 6–12 months) would fine-tune existing multilingual models using LoRA on available Tibetan data, leveraging cross-lingual transfer from Chinese. Phase 2 ($500K–$1M, 1–2 years) would crowdsource labeled speech data following the Te Hiku model and expand annotated corpora. Phase 3 ($2–5M, 2–5 years) would build a dedicated Tibetan model with a comprehensive NLP pipeline, following the Icelandic and Estonian programme models.

The compounding cost of every month of inaction

The urgency is not rhetorical — it is mathematical. A landmark 2024 paper in Nature by Shumailov et al. demonstrated that when AI models are trained on recursively generated data, they undergo “model collapse” — an irreversible degradation. Critically, the first casualties of model collapse are minority data points. In early stages, “the model begins losing information about the tails of the distribution — mostly affecting minority data.” The researchers’ analogy is precise: if 90 of 100 cats in training data are yellow and 10 are blue, the model gradually makes blue cats greenish; over successive generations, blue cats vanish entirely. With over 50% of new English-language internet articles now AI-generated and Europol warning that up to 90% of online content may be synthetic by 2026, training data is increasingly composed of machine output — and Tibetan perspectives were barely present to begin with.

The numbers paint a clear picture of AI’s growing role as the world’s primary knowledge intermediary. ChatGPT processes 2.5 billion prompts daily. AI Overviews now appear in 57% of Google search results, and Gartner predicts traditional search traffic will drop 25% by 2026. AI search engines handle 60% of online queries, and research shows AI-synthesized responses become users’ “source of truth” — with click-through rates to original sources dropping by 70%. When 86% of students worldwide use AI in academic work, what these systems “know” about Tibet shapes the understanding of an entire generation.

State actors understand this dynamic. NBC News reported in November 2025 that both Russia’s “Doppelganger” operation and China’s “Spamouflage” campaign have embraced AI to create fake news at scale. The EU’s External Action Service assessed that AI training datasets “can be intentionally polluted by state or non-state actors who insert narratives to further their goals.” As Chapman University researchers note, what results is “ontological bias” — when an AI’s “fundamental understanding of concepts is built on a single worldview, it fails to represent alternative philosophical perspectives, often reducing non-Western knowledge to stereotypes.”

The feedback loop is already turning. As AI-generated content floods the web, it becomes training data for the next generation of models. Biased outputs shape future inputs. Researchers at FAccT 2024 demonstrated experimentally that this cycle leads to “complete erasure of the minoritized group.” Leon Furze, writing in 2025, observed: “I am not aware of any successful research demonstrating an ability to mitigate the ‘upstream’ bias encoded through the training data.” The UNDP identified the structural trap precisely: “Without sufficient data, developing effective language technologies becomes difficult; without these technologies, creating more digital content remains challenging. This dilemma accelerates language endangerment.”

A call to action: For developers and funders alike

The research reveals a convergence of threats and opportunities that is both more severe and more tractable than commonly understood. The threats are concrete: Chinese AI models embedding CCP narratives about Tibet now reach nearly 100 million monthly users; model collapse is scientifically demonstrated to erase minority data first; and AI-generated content is approaching majority status on the internet, meaning the training data for future models is being written now — largely without Tibetan input.

But the opportunities are equally concrete. BDRC’s 30 million scanned pages represent one of the richest untapped cultural datasets on Earth. The TIB-STC corpus of 11 billion tokens proves large-scale Tibetan data collection is achievable. Monlam AI demonstrates that community-driven tools can reach millions of users. And the success of Māori, Welsh, Icelandic, and African language technology proves that competent minority-language AI can be built for single-digit millions of dollars — orders of magnitude less than the cost of allowing an entire civilization’s narrative to be authored by systems that either ignore it or actively misrepresent it.

To Tibetan developers and Tibet-supporting technologists

Your work matters more than you know. Every dataset you curate, every model you train, every pull request you submit to a Tibetan NLP project is an act of cultural preservation in the digital age. We need you to:

Contribute to existing projects like Monlam AI, BDRC, and open-source Tibetan NLP repositories on GitHub
Build training datasets for Tibetan fonts, traditional clothing, cultural artifacts, architecture, and everyday objects
Fine-tune existing models (LLaMA, Mistral, even SDXL for images) on Tibetan data
Document your work, share your datasets (where culturally appropriate), and mentor others who want to help
Collaborate across borders and institutions — fragmented efforts help, but coordinated action transforms

To non-developers who care about Tibet

This is not just a technical problem that requires coding skills. You can make a concrete difference:

Fund these projects. Even modest amounts help — $100 pays for GPU hours, $1,000 funds annotation work, $10,000 can support a dedicated researcher for months. Individual developers are currently paying out of pocket for work that should have institutional backing.
Connect developers with resources, whether that’s cloud credits, academic partnerships, or institutional support
Advocate for Tibetan inclusion in major AI initiatives — pressure companies building foundation models to include Tibetan data
Support organizations committed to keeping Tibetan cultural resources alive like BDRC that hold sacred texts entrusted to them by Tibetan lamas
Share this information. The invisibility of the problem is itself part of the problem

The most important insight from this research is temporal. Every month of inaction does not simply delay progress — it compounds the problem through feedback loops, model collapse, and narrative entrenchment. The question is not whether Tibetan culture will be represented in the AI systems that increasingly mediate human knowledge. It is whether that representation will be shaped by Tibetans themselves or by the political and commercial forces that currently dominate AI development.

That question has a deadline, and the deadline is approaching. The window where authentic Tibetan data can meaningfully shape AI systems is measured in years, not decades. We’re racing against model collapse, synthetic data pollution, and the rapid global spread of AI systems that have already made up their minds about what Tibet is and isn’t.

For those of us investing our own money, our nights and weekends, our expertise and energy into this work: we can’t do it alone. We need your support — financial, technical, institutional, moral. We need funding mechanisms that recognize the urgency without demanding years-long grant applications. We need compute resources donated by cloud providers. We need academic partners who understand that Tibetan AI development is not just a research project but an act of resistance against erasure.

The invisible war for Tibet’s digital future is being fought in training datasets, model weights, and API responses. The question is whether we’ll show up to fight it.