Research | Kuma

Introduction

How do you build AI that speaks for a country? Not just in its official language, but in the languages its people actually use. In Sierra Leone, over 8 million citizens speak more than 20 indigenous languages, yet every digital government service, every healthcare platform, every education tool is built exclusively in English. Kuma is our attempt to change that.

Background

Sierra Leone has a population of over 8 million people spread across 16 districts. The country recognizes over 20 indigenous languages, with Krio serving as the lingua franca spoken by approximately 97% of the population, alongside Temne (35%), Mende (31%), Limba (8%), and dozens of others. Despite this, all of Sierra Leone's digital government infrastructure, from health platforms to education systems to citizen service portals, operates exclusively in English, a language spoken fluently by fewer than 10% of the population.

This is not a minor accessibility issue. It is a structural barrier that compounds existing inequalities in healthcare, education, and civic participation. When a pregnant woman in Kono district cannot communicate her symptoms because the health worker's data entry system requires English, the consequences are clinical. When a student in Kenema cannot access their BECE results because the portal does not speak Krio, the consequences are educational. When a citizen cannot understand a government policy announcement because it was issued only in written English, the consequences are democratic.

The Problem

Sierra Leone has the highest maternal mortality rate in the world: 443 deaths per 100,000 live births according to recent WHO data. DSTI's PreSTrack platform was built to address this, giving healthcare workers a system to track pregnant women through their gestation period and flag high-risk cases for referral. It currently has 11,000 registered pregnant women, 1,500+ antenatal care visits tracked, 60 emergency referrals logged, and 1,000 health workers trained across three districts.

But PreSTrack is entirely in English.

In a country where most rural healthcare workers and patients communicate in Temne, Mende, or Krio, this creates a consistent and dangerous translation layer at the point of care. Workers guess at translations, record approximate information, and sometimes miss critical referral triggers. The same barrier exists in education (SenseBod serves 5,000+ students but delivers content in English), in civic access (WanGov is Sierra Leone's official digital citizen portal but operates only in English), and in governance.

Kuma is built to remove this barrier, not just for one platform, but for Sierra Leone's entire digital public infrastructure.

Citizen Speaks Local Language

English-Only System

Information Lost

The Goal

We are building three core AI capabilities for Sierra Leone's major languages: Translation, Text-to-Speech (TTS), and Speech-to-Text (STT).

Input (Speech/Text)

→

Translation

→

TTS / STT

→

Local Output

Data Challenges

Building AI for low-resource languages is fundamentally a data problem. Unlike English, French, or even major African languages like Swahili or Hausa, Sierra Leonean indigenous languages have virtually no existing digital corpora. There are no Wikipedia articles in Temne. There are no large-scale annotated speech datasets in Mende. There are no parallel translation corpora for Limba. Every dataset used in Kuma had to be built from scratch or adapted from adjacent sources.

This is not unusual for African language AI. It is the norm. The languages spoken by the majority of Africans remain the most underrepresented in global AI training data. Kuma is part of a broader movement alongside organizations like Masakhane, AI4D Africa, and others building foundational resources for African NLP.

Data Sources

African Bible text and audio corpus

The Bible has been translated into most Sierra Leonean languages and exists in both written and audio formats, making it one of the few large-scale parallel corpora available. While domain-limited, it provides a foundational phonemic and grammatical reference for each language.

Local community collection

Working with native speakers across Sierra Leone, we collected voice recordings, conversational audio, and written text in Krio, Temne, Mende, and Limba. Sessions were held in familiar environments to capture natural speech, including code-switching and regional accents.

Government platform data

Anonymized, consent-based linguistic data from DSTI's existing platforms including PreSTrack and the Government Services Portal (800,000+ SMS/WhatsApp users). This gives us domain-specific vocabulary for healthcare and civic service contexts.

Data Pipeline

Our pipeline puts community consent and native speaker validation at every stage, not as an afterthought.

Raw data → Collection and consent → Transcription and diarization → Linguistic annotation by native speakers → Quality validation → Preprocessing → Model training

Native speaker validation is a critical step that is often skipped in low-resource language AI work. At each annotation stage, we work with fluent native speakers to verify transcriptions, flag dialect variations, and correct model outputs. No automated validation tools exist for Sierra Leonean languages, so human judgment is irreplaceable here.

Raw Data

→

Collection

→

Annotation

Model Stack

Capability	Model	Approach
Translation	XLM-R, mBERT	Fine-tuning on local parallel corpora
TTS	CSM-1B (Sesame AI)	Fine-tuning on Sierra Leonean voice recordings
STT	OpenAI Whisper	Fine-tuned for Krio, Temne, Mende

Translation Models

Speech Models

Unified Language API

Architecture

Our translation pipeline uses an encoder-decoder transformer architecture. Source text is tokenized and encoded using XLM-R, a multilingual model that has shown strong performance on low-resource language transfer learning. The decoder generates target language text conditioned on the encoded source representation.

For TTS, we use CSM-1B (Conversational Speech Model) by Sesame AI. CSM frames speech generation as a multimodal task using two autoregressive transformers: a backbone that processes interleaved text and audio to model the zeroth codebook, and a decoder that models the remaining codebooks to reconstruct speech. We fine-tune CSM-1B on our Sierra Leonean voice recordings to adapt its prosodic and phonemic representations to local languages.

For STT, we fine-tune OpenAI's Whisper on our collected audio. Whisper's encoder-decoder architecture and its pre-training on large-scale multilingual audio make it a strong base for adaptation to low-resource languages. We train with a small learning rate and augment the data to account for recording environment variation.

What we built

Text to speech (TTS)

Using CSM-1B fine-tuned on collected voice data, we developed initial TTS prototypes for Krio and Mende. Audio samples are available in the playground. Performance is below production quality: voice consistency is variable and phoneme accuracy drops for Sierra Leonean-specific sounds.

Language model (LLM)

Using GPT-2 OSS 20B as a base, we built an initial model with basic Sierra Leonean language understanding. It can generate and complete text in Krio and Mende. Accuracy is limited: it struggles with Krio's flexible verb-noun ordering and performs poorly on code-switching.

Speech to text (STT)

We set up a Whisper-based STT pipeline fine-tuned for Sierra Leonean languages. The pipeline is functional. Informal listening tests suggest reasonable transcription quality for slow, clear Krio speech, but we have not yet run formal evaluation. Word Error Rate metrics are pending.

Related work

Several research groups have worked on African language NLP in ways that inform our approach. Masakhane has produced translation benchmarks for over 100 African languages. The AI4D African Language Dataset project contributed speech datasets. Mozilla Common Voice has collected voice data in Krio.

What distinguishes Kuma is direct integration with live government platforms that have real users, a data governance model centered on community consent and local ownership, and a commitment to open-sourcing all outputs as Digital Public Goods.

Limitations

Our current limitations are primarily dataset size, compute access for large-scale training, and the need for broader linguistic validation across more diverse regional accents and dialects.

Six-month roadmap

Month 1–2: Data Completion

Complete data collection across all 8 target languages. Native speaker validation complete. Establish baseline evaluation metrics.

Month 3–4: Full Model Training

Full model training on GPU infrastructure. Target benchmarks: BLEU 35+ for translation, CMOS within 1 point of human for TTS, WER 20% or lower for STT.

Month 5–6: Pilot Integration

Pilot integration into PreSTrack across at least one district. Open API endpoints made available for SenseBod, Kabo, and WanGov.

Long-term vision

Kuma is designed to be infrastructure, not a product. The goal is not a standalone app. It is a language layer that sits beneath all of Sierra Leone's digital public services. When a healthcare worker in Tonkolili can document a patient visit in Temne, when a student in Kenema can ask a question in Krio, and when a citizen in Bo can access a government service in Mende, that is what success looks like.

Open Source

Kuma is committed to open-sourcing all models and datasets to the Digital Public Goods Alliance.

View on GitHub →

Building language AI for Sierra Leone's indigenous languages