Skip to main content

Text normalization (TN/ITN) and ASR evaluation framework for Bambara (Bamanankan) language processing

Project description

Bambara Text Normalizer

Text Normalization & ASR Evaluation Framework for Bambara (Bamanankan)

InstallationNormalizationASR EvaluationModesCLILinguisticsReferences


Purpose

This tool serves two complementary purposes for Bambara language processing:

Purpose Description
Text Normalization Standardize Bambara text for any downstream NLP task (TTS, MT, NER, etc.)
ASR Evaluation Fair WER/CER computation that accounts for valid orthographic variation

[!NOTE] Bambara orthography allows variation: the same utterance can be written as k'a ta or ka a ta both are correct. Without normalization, evaluation metrics unfairly penalize models for human writing inconsistencies rather than actual recognition errors.


Installation

pip install git+https://github.com/sudoping01/bambara-text-normalization.git

Text Normalization

from bambara_normalizer import normalize

# Default: expand contractions
normalize("⁠Ne k’a ma ko ayi")           
normalize("⁠K’ale t’a fɛ k’a kɛ")    
normalize("⁠K’i k’i janto i yɛrɛ la")        


# Contract mode: collapse expanded forms
normalize("Ne ko a ma ko ayi", mode="contract")    
normalize("Ko ale tɛ a fɛ ka o kɛ", mode="contract")    
normalize("Ko i ka i janto i yɛrɛ la", mode="contract")    

# Preserve mode: don't touch contractions
normalize("K’i k’i janto i yɛrɛ la", mode="preserve")     

Custom Settings

from bambara_normalizer import normalize

# Full control over normalization
text = normalize(
    "Ka na son k’o k’a la",
    mode="expand",                      # "expand" | "contract" | "preserve"
    preserve_tones=False,               
    normalize_legacy_orthography=True, 
    lowercase=True,                     
    remove_punctuation=False,           
    normalize_whitespace=True,         
    normalize_apostrophes=True,         
    normalize_special_chars=True,    
    expand_dates = False,
    expand_measurements=False, 
    expand_numbers=False,  
    expand_times=False,            
    remove_diacritics_except_tones=False,  
    handle_french_loanwords=True,   
    strip_repetitions=False,       
    normalize_compounds=True, 
)

Using BambaraNormalizer Class

For repeated normalization with consistent settings:

from bambara_normalizer import BambaraNormalizer, BambaraNormalizerConfig


config = BambaraNormalizerConfig(contraction_mode="expand") # change it to "contract" or preserve
normalizer = BambaraNormalizer(config)
normalizer("A y'a fɔ")      
normalizer("k'a la")   

# Contraction mode
config = BambaraNormalizerConfig(contraction_mode="contract")
normalizer = BambaraNormalizer(config)
normalizer("bɛ a fɔ")     
normalizer("ka a ta")     

Predefined Configuration Presets

from bambara_normalizer import BambaraNormalizer, BambaraNormalizerConfig

# For WER evaluation (aggressive normalization, removes tones)
normalizer = BambaraNormalizer(BambaraNormalizerConfig.for_wer_evaluation())

# For WER with contract mode
normalizer = BambaraNormalizer(BambaraNormalizerConfig.for_wer_evaluation(mode="contract"))

# For CER evaluation
normalizer = BambaraNormalizer(BambaraNormalizerConfig.for_cer_evaluation())

# Preserve tone marks
normalizer = BambaraNormalizer(BambaraNormalizerConfig.preserving_tones())

# Minimal normalization (only essential fixes)
normalizer = BambaraNormalizer(BambaraNormalizerConfig.minimal())

Number Normalization

The normalizer supports bidirectional number conversion between digits and Bambara words (TN/ITN).

With Normalizer

from bambara_normalizer import normalize

normalize("A ye 100 sɔrɔ", expand_numbers=True)   # => "a ye kɛmɛ sɔrɔ"
normalize("A ye 100 sɔrɔ", expand_numbers=False)  # => "a ye 100 sɔrɔ"

# WER preset has expand_numbers=True by default
normalize("A ye 5 ta", preset="wer")  # => "a ye duuru ta"

Digits to Words (Text Normalization)

from bambara_normalizer import number_to_bambara, normalize_numbers_in_text


number_to_bambara(5)        # => "duuru"
number_to_bambara(123)      # => "kɛmɛ ni mugan ni saba"
number_to_bambara(1000)     # => "waa kelen"
number_to_bambara(5.3)      # => "duuru tomi saba"

# In text
normalize_numbers_in_text("A ye 5 wari di")      # => "A ye duuru wari di"
normalize_numbers_in_text("Mɔgɔ 100 nana")       # => "Mɔgɔ kɛmɛ nana"
normalize_numbers_in_text("N ye shekɛ 1000 sɔrɔ") # => "N ye shekɛ waa kelen sɔrɔ"

Words to Digits (Inverse Text Normalization)

from bambara_normalizer import bambara_to_number, denormalize_numbers_in_text

bambara_to_number("duuru")                    # => 5
bambara_to_number("kɛmɛ ni mugan ni saba")    # => 123
bambara_to_number("duuru tomi saba")          # => 5.3

# In text
denormalize_numbers_in_text("A ye duuru di a ma")  # => "A ye 5 di a ma"
denormalize_numbers_in_text("Mɔgɔ kɛmɛ nana")      # => "Mɔgɔ 100 nana"

Number Vocabulary

Value Bambara Value Bambara
0 fu 10 tan
1 kelen 20 mugan
2 fila 30 bi saba
3 saba 40 bi naani
4 naani 50 bi duuru
5 duuru 100 kɛmɛ
6 wɔɔrɔ 1000 waa
7 wolonwula 1,000,000 miliyɔn
8 seegin decimal tomi
9 kɔnɔntɔn connector ni


Date Normalization

The normalizer supports bidirectional date conversion between standard formats and Bambara expressions (TN/ITN).

With Normalizer

from bambara_normalizer import normalize

normalize("A bɛ na 13-10-2024 la", expand_dates=True)   # ==> "a bɛ na oktɔburu tile tan ni saba san Baa fila ni mugan ni naani la"
normalize("A bɛ na 13-10-2024 la", expand_dates=False)  # => "a bɛ na 13-10-2024 la"

# WER preset has expand_dates=True by default
normalize("A bɛ na 25-01-2008 la", preset="wer")  # => "a bɛ na zanwuye tile mugan ni duuru san Baa fila ni seegin la"

Date to Bambara (Text Normalization)

from bambara_normalizer import date_to_bambara, format_date_bambara, normalize_dates_in_text
from datetime import date

# Single dates
date_to_bambara(2024, 10, 13)      # => "Oktɔburu tile tan ni saba san baa fila ni mugan ni naani"
date_to_bambara(2008, 1, 25)       # => "Zanwuye tile mugan ni duuru san baa fila ni seegin"

# With "kalo" (month) included
date_to_bambara(2024, 10, 13, include_kalo=True)  # => "Oktɔburu kalo tile tan ni saba san ..."

# With day of week
date_to_bambara(2024, 10, 13, include_day_of_week=True)  # => "Kari Oktɔburu tile ..." (Sunday)

# From date object or string
format_date_bambara(date(2024, 10, 13))  # => "Oktɔburu tile tan ni saba san ..."
format_date_bambara("13-10-2024")        # => "Oktɔburu tile tan ni saba san ..."

# In text
normalize_dates_in_text("A bɛ na 13-10-2024 la")  # => "A bɛ na Oktɔburu tile tan ni saba san baa fila ni mugan ni naani la"

Bambara to Date (Inverse Text Normalization)

from bambara_normalizer import bambara_to_date

bambara_to_date("Oktɔburu tile tan ni saba san baa fila ni mugan ni naani")
# => datetime.date(2024, 10, 13)

bambara_to_date("Zanwuye tile mugan ni duuru san baa fila ni seegin")
# => datetime.date(2008, 1, 25)

Date Format

Bambara dates follow this structure:

[Month] (kalo) tile [day] san [year]

Example: 13-10-2024 => Oktɔburu tile tan ni saba san baa fila ni mugan ni naani

Literal translation: "October day thirteen year two thousand twenty-four"

Days of the Week

Day Bambara Day Bambara
Monday Tɛnɛn Friday Juma
Tuesday Tarata Saturday Sibiri
Wednesday Araba Sunday Kari
Thursday Alamisa

Months of the Year

Month Bambara Month Bambara
January Zanwuye July Zuluye
February Feburuye August Uti
March Marsi September Sɛtanburu
April Awirili October Oktɔburu
May November Nɔwanburu
June Zuwen December Desanburu

Time Normalization

The normalizer supports bidirectional time and duration conversion between standard formats and Bambara expressions (TN/ITN).

With Normalizer

from bambara_normalizer import normalize

normalize("A nana 7:30 la", expand_times=True)   # => "a nana nɛgɛ kaɲɛ wolonwula ni sanga bi saba la"
normalize("A nana 7:30 la", expand_times=False)  # => "a nana 7:30 la"

# WER preset has expand_times=True by default
normalize("A nana 13:50 la", preset="wer")  # => "a nana nɛgɛ kaɲɛ tan ni saba ni sanga bi duuru la"

Clock Time to Bambara (Text Normalization)

from bambara_normalizer import time_to_bambara, format_time_bambara, normalize_times_in_text
from datetime import time

# Clock times
time_to_bambara(1, 0)       # => "Nɛgɛ kaɲɛ kelen"
time_to_bambara(1, 5)       # => "Nɛgɛ kaɲɛ kelen ni sanga duuru"
time_to_bambara(7, 30)      # => "Nɛgɛ kaɲɛ wolonwula  ni sanga bi saba"
time_to_bambara(13, 50)     # => "Nɛgɛ kaɲɛ tan ni saba ni sanga bi duuru"

# From time object or string
format_time_bambara(time(7, 30))  # => "Nɛgɛ kaɲɛ wolonwula ni sanga bi saba"
format_time_bambara("13:50")      # => "Nɛgɛ kaɲɛ tan ni sab ni sanga bi duuru"

# In text
normalize_times_in_text("A nana 7:30 la")  # => "A nana nɛgɛ kaɲɛ wolonwula ni sanga bi saba la"

Bambara to Clock Time (Inverse Text Normalization)

from bambara_normalizer import bambara_to_time

bambara_to_time("Nɛgɛ kaɲɛ kelen")
# => datetime.time(1, 0)

bambara_to_time("Nɛgɛ kaɲɛ wolonwula ni sanga bi saba")
# => datetime.time(7, 30)

Duration to Bambara

from bambara_normalizer import duration_to_bambara, format_duration_bambara

# Durations
duration_to_bambara(minutes=30)                      # => "miniti bi saba"
duration_to_bambara(hours=1, minutes=30)             # => "lɛrɛ kelen ni miniti bi saba"
duration_to_bambara(hours=1, minutes=30, seconds=10) # => "lɛrɛ kelen ni miniti bi saba ni segɔni tan"

# From string format
format_duration_bambara("30min")      # => "miniti bi saba"
format_duration_bambara("1h30min")    # => "lɛrɛ kelen ni miniti bi saba"
format_duration_bambara("1h30min10s") # => "lɛrɛ kelen ni miniti bi saba ni segɔni tan"

Bambara to Duration (Inverse Text Normalization)

from bambara_normalizer import bambara_to_duration

bambara_to_duration("miniti bi saba")
# => (0, 30, 0)  # (hours, minutes, seconds)

bambara_to_duration("lɛrɛ kelen ni miniti bi saba")
# => (1, 30, 0)

bambara_to_duration("lɛrɛ kelen ni miniti bi saba ni segɔni tan")
# => (1, 30, 10)

Time Format

Clock time follows this structure:

Nɛgɛ kaɲɛ [hour] ( ni sanga [minutes])

Example: 7:30 => Nɛgɛ kaɲɛ wolonwula ni sanga bi saba

Literal translation: "Clock needle seven passed with minute thirty"

Duration follows this structure:

(lɛrɛ [hours] ni) (miniti [minutes] ni) (segɔni [seconds])

Example: 1h30min10s => lɛrɛ kelen ni miniti bi saba ni segɔni tan


Measurement Normalization

The normalizer supports bidirectional measurement conversion between standard units and Bambara expressions (TN/ITN).

With Normalizer

from bambara_normalizer import normalize

normalize("A ye 5 kg san", expand_measurements=True)   # => "a ye kilogaramu duuru san"
normalize("A ye 5 kg san", expand_measurements=False)  # => "a ye 5 kg san"

# WER preset has expand_measurements=True by default
normalize("So in bɛ 100 m", preset="wer")  # => "so in bɛ mɛtɛrɛ kɛmɛ"

Measurement to Bambara (Text Normalization)

from bambara_normalizer import measurement_to_bambara, format_measurement_bambara, normalize_measurements_in_text

# Weight
measurement_to_bambara(5, "kg")      # => "kilogaramu duuru"
measurement_to_bambara(100, "g")     # => "garamu kɛmɛ"

# Length
measurement_to_bambara(10, "km")     # => "kilomɛtɛrɛ tan"
measurement_to_bambara(100, "m")     # => "mɛtɛrɛ kɛmɛ"
measurement_to_bambara(50, "cm")     # => "santimɛtɛrɛ bi duuru"

# Volume
measurement_to_bambara(2, "L")       # => "litiri fila"
measurement_to_bambara(500, "mL")    # => "mililitiri kɛmɛ duuru"

# Area
measurement_to_bambara(3, "ha")      # => "ɛkitari saba"
measurement_to_bambara(100, "m²")    # => "mɛtɛrɛ kare kɛmɛ"

# Decimal values
measurement_to_bambara(2.5, "L")     # => "litiri fila tomi duuru"

# From string format
format_measurement_bambara("5kg")    # => "kilogaramu duuru"
format_measurement_bambara("100 m")  # => "mɛtɛrɛ kɛmɛ"

# In text
normalize_measurements_in_text("A ye 5 kg san")  # => "A ye kilogaramu duuru san"
normalize_measurements_in_text("So in bɛ 100 m") # => "So in bɛ mɛtɛrɛ kɛmɛ"

Bambara to Measurement (Inverse Text Normalization)

from bambara_normalizer import bambara_to_measurement, denormalize_measurements_in_text

bambara_to_measurement("kilogaramu duuru")
# => (5, 'kg')

bambara_to_measurement("mɛtɛrɛ kɛmɛ")
# => (100, 'm')

bambara_to_measurement("litiri fila tomi duuru")
# => (2.5, 'L')

# In text
denormalize_measurements_in_text("A ye kilogaramu duuru san")
# => "A ye 5 kg san"

Measurement Format

Measurements follow this structure:

[unit] [number]

Example: 5 kg => kilogaramu duuru

Literal translation: "kilogram five"

Supported Units

Weight

Unit Abbreviation Bambara
Kilogram kg kilogaramu
Gram g garamu
Milligram mg miligaramu
Ton t tɔni

Length

Unit Abbreviation Bambara
Kilometer km kilomɛtɛrɛ
Meter m mɛtɛrɛ
Centimeter cm santimɛtɛrɛ
Millimeter mm milimɛtɛrɛ

Volume

Unit Abbreviation Bambara
Liter L litiri
Milliliter mL mililitiri

Area

Unit Abbreviation Bambara
Hectare ha ɛkitari
Square meter mɛtɛrɛ kare

ASR Evaluation Framework

Quick Evaluation

from bambara_normalizer import evaluate


result = evaluate(
    reference="B'a fɔ ka taa",
    hypothesis="bɛ a fɔ ka taa"

)

print(f"WER: {result.wer:.2%}")  
print(f"CER: {result.cer:.2%}")
print(f"MER: {result.mer:.2%}")

Evaluator with Mode Selection

[!IMPORTANT] The mode parameter determines how contractions are handled during evaluation. This significantly impacts WER scores when reference and hypothesis use different orthographic conventions.

from bambara_normalizer import evaluate


result = evaluate(
    reference="k'a ta", 
    hypothesis="ka a ta"
    mode="expand" # contract | preserve 
    )
print(f"WER: {result.wer:.2%}")   

Flexible Configuration

For full control use the evalution class and define the normalization configuration:

from bambara_normalizer import (
    BambaraNormalizer, 
    BambaraNormalizerConfig, 
    BambaraEvaluator
)

# Define custom normalizer: same then the config we did upside
config = BambaraNormalizerConfig(
    contraction_mode="contract",
    preserve_tones=False,
    lowercase=True,
    remove_punctuation=True,
    normalize_legacy_orthography=True,
)


evaluator = BambaraEvaluator(config=config)


result = evaluator.evaluate(
    reference="K'a fɔ́!",
    hypothesis="ka a fo"
)
print(f"WER: {result.wer:.2%}")

Batch Evaluation

from bambara_normalizer import BambaraEvaluator

evaluator = BambaraEvaluator(mode="contract")

references = ["k'a ta", "b'a fɔ", "n'a ma"]
hypotheses = ["ka a ta", "bɛ a fɔ", "na a ma"]

aggregate, individual = evaluator.evaluate_batch(references, hypotheses)

print(f"Overall WER: {aggregate.wer:.2%}")
for i, result in enumerate(individual):
    print(f"  [{i}] WER: {result.wer:.2%}")

Available Metrics

Metric Method Description
WER evaluator.wer(ref, hyp) Word Error Rate
CER evaluator.cer(ref, hyp) Character Error Rate
MER evaluator.mer(ref, hyp) Match Error Rate
WIL evaluator.wil(ref, hyp) Word Information Lost
WIP evaluator.wip(ref, hyp) Word Information Preserved
DER result.der Diacritic Error Rate (tone accuracy)

Contraction Modes

[!WARNING] Choosing the right mode is critical for fair ASR evaluation. Using the wrong mode can inflate or deflate WER scores artificially.

Version 2.0 introduces three contraction modes to handle bidirectional Bambara orthography:

Mode Direction When to Use
expand b'a => bɛ a Default. Full linguistic analysis with k'/n' disambiguation
contract bɛ a => b'a Simpler, more forgiving. No disambiguation ambiguity
preserve No change Debugging, or when you want raw comparison

Why Contract Mode Matters

Expansion is complex the contraction k' can expand to three different words:

Contraction Possible Expansions Meaning
k'a ka a infinitive marker
k'a kɛ a verb "to do"
k'a ko a verb "to say"

The normalizer uses context to disambiguate, but some cases are genuinely ambiguous.

Contraction is simple all variants collapse to the same form:

ka a  ─┐
kɛ a  ─┼─>  k'a
ko a  ─┘

[!TIP] For ASR evaluation, contract mode is more forgiving because it doesn't penalize the model for disambiguation differences when both forms are linguistically valid.

Contraction Mappings

Expanded Contracted Function
+ vowel b' Affirmative imperfective
+ vowel t' Negative imperfective
ye + vowel y' Perfective marker
ni + vowel n' Conjunction
na + vowel n' Verb "come"
ka + vowel k' Infinitive marker
+ vowel k' Verb "to do"
ko + vowel k' Verb "to say"

Command Line Interface

Basic Usage

# default mode is expand
bambara-normalize "B'a fɔ́"
# Output: bɛ a fɔ

# Contract mode
bambara-normalize --mode contract "bɛ a fɔ"
# Output: b'a fɔ

# Preserve mode
bambara-normalize --mode preserve "B'a fɔ"
# Output: b'a fɔ

With Presets

# WER preset (aggressive normalization)
bambara-normalize --preset wer "K'a fɔ́!"
# Output: ka a fɔ

# WER preset with contract mode
bambara-normalize --preset wer --mode contract "Ka a fɔ"
# Output: k'a fɔ

# CER preset
bambara-normalize --preset cer "B'a fɔ"

File Evaluation

# Evaluate reference vs hypothesis files
bambara-normalize --evaluate reference.txt hypothesis.txt

# With contract mode
bambara-normalize --evaluate --mode contract ref.txt hyp.txt

# Output detailed metrics
bambara-normalize --evaluate --detailed ref.txt hyp.txt

Batch Processing

# Process file line by line
bambara-normalize --input corpus.txt --output normalized.txt

# With specific mode
bambara-normalize --input corpus.txt --output normalized.txt --mode contract

Linguistic Decisions

Why Normalize?

Bambara orthography allows variation. For the same spoken utterance:

  • Annotator A writes: k'a ta
  • Annotator B writes: ka a ta

Both are correct. Without normalization, we penalize models for human writing inconsistencies, not recognition errors.

n' Disambiguation

Pattern Expansion Meaning
n' + pronoun + ma na Verb "to come"
n' + other ni Conjunction (default)

Examples:

  • n'a ma => na a ma (come to him)
  • n'a ta => ni a ta (if he takes)

k' Disambiguation Rules

Applied in priority order (derived from Daba grammar):

Priority Pattern Result Example
1 k' + pronoun + ma + X + ye k'a ma hɛrɛ ye => kɛ a ma hɛrɛ ye
2 k' + pronoun + ma + speech marker ko k'anw ma ko => ko anw ma ko
3 k' + pronoun + postposition k'a la => kɛ a la
4 k' + pronoun + clause marker ko k'an ka ta => ko an ka ta
5 Default ka k'a ta => ka a ta

Postpositions: la, na, ye, , kɔnɔ, , kɔrɔ, kan, kun, ɲɛ, bolo

Clause markers: ka, kana, , , bɛna, tɛna, tun, mana

Legacy Orthography Conversion

Legacy Modern Notes
è ɛ Pre-standard spelling
ò ɔ Pre-standard spelling
ny ɲ Digraph => single character
ng ŋ Digraph => single character
ñ ɲ Spanish/Senegalese variant

Known Limitations

Inherent Linguistic Ambiguity

[!CAUTION] Some Bambara constructions are genuinely ambiguous and cannot be resolved without broader context. This is not a bug it reflects real ambiguity in the language.

The ye Problem

The word ye has five grammatical functions:

Function Example Meaning
Postposition à fɔ́ ń yé say it to me
Perfective ù ye ɲɔ̀ gòsi they have beaten
Copula ò yé kɔ̀nɔ yé it is a bird
Verb "see" ka a ye to see it
Imperative á' yé nà! come! (plural)

This creates genuine ambiguity for k'a ye:

Interpretation Expansion Meaning
Postposition kɛ a ye do it for him
Verb "see" ka a ye to see it

Default behavior: The normalizer chooses kɛ a ye (postposition is more frequent).

Solution: Use mode="contract" for ASR evaluation to avoid disambiguation penalties:

evaluator = BambaraEvaluator(mode="contract")
# Both "kɛ a ye" and "ka a ye" => "k'a ye" 

Scope

The normalizer uses local context (1-3 word lookahead). It does not:

  • Parse full sentence structure
  • Use dictionary/lexicon for POS tagging
  • Consider discourse-level context

Utility Functions

from bambara_normalizer import (
    is_contraction,
    can_contract,
    find_contractions,
    find_contractable_sequences,
    compare_normalization_modes,
    analyze_text,
    is_bambara_vowel,
    get_tone,
    remove_tones,
    number_to_bambara,
    bambara_to_number,
    normalize_numbers_in_text,
    denormalize_numbers_in_text,
    is_number_word,
    
    bambara_to_date,
    bambara_to_day_of_week,
    bambara_to_month,
    date_to_bambara,
    day_of_week_to_bambara,
    denormalize_dates_in_text,
    format_date_bambara,
    is_bambara_day,
    is_bambara_month,
    month_to_bambara,
    normalize_dates_in_text,

    time_to_bambara,
    bambara_to_time,
    format_time_bambara,
    duration_to_bambara,
    bambara_to_duration,
    format_duration_bambara,
    normalize_times_in_text,
    is_time_word,

    measurement_to_bambara,
    bambara_to_measurement,
    format_measurement_bambara,
    normalize_measurements_in_text,
    denormalize_measurements_in_text,
    is_measurement_word,
    get_unit_category,
)


is_contraction("b'a")                   
is_contraction("bɛ")                     
can_contract("bɛ a")                      

# Find patterns in text
find_contractions("B'a fɔ k'a ta")       # ["b'", "k'"]
find_contractable_sequences("bɛ a fɔ")   # [('bɛ', 'a')]

# Compare modes side-by-side
compare_normalization_modes("b'a fɔ")
# {'original': "b'a fɔ", 'expand': 'bɛ a fɔ', 'contract': "b'a fɔ", 'preserve': "b'a fɔ"}

# Full text analysis
analyze_text("B'a fɔ k'a la")
# {'word_count': 4, 'contractions_found': ["b'", "k'"], 'has_tone_marks': False, ...}

# Tone handling
get_tone("fɔ́")                           # "high"
remove_tones("fɔ́ bɛ̀")                    # "fɔ bɛ"

# Number conversion: digits => Bambara words
number_to_bambara(5)                     # "duuru"
number_to_bambara(23)                    # "mugan ni saba"
number_to_bambara(100)                   # "kɛmɛ"
number_to_bambara(123)                   # "kɛmɛ ni mugan ni saba"
number_to_bambara(1000)                  # "waa kelen"
number_to_bambara(5.3)                   # "duuru tomi saba"

# Number conversion: Bambara words => digits
bambara_to_number("duuru")               # 5
bambara_to_number("mugan ni saba")       # 23
bambara_to_number("kɛmɛ")                # 100
bambara_to_number("waa kelen")           # 1000
bambara_to_number("duuru tomi saba")     # 5.3

# Number normalization in text
normalize_numbers_in_text("A ye 5 di")       # "A ye duuru  di"
normalize_numbers_in_text("Mɔgɔ 100 nana")        # "Mɔgɔ kɛmɛ nana"
normalize_numbers_in_text("A be san 25 bɔ")       # "A be san mugan ni duuru bɔ"

# Inverse: Bambara words => digits in text
denormalize_numbers_in_text("A ye duuru di")  # "A ye 5  di"
denormalize_numbers_in_text("Mɔgɔ kɛmɛ nana")      # "Mɔgɔ 100 nana"

# Check if word is a number word
is_number_word("duuru")                  # True
is_number_word("kɛmɛ")                   # True
is_number_word("fɔ")                     # False


# Date conversion: dates => Bambara
date_to_bambara(2024, 10, 13)            # "Oktɔburu tile tan ni saba san baa fila ni mugan ni naani"
format_date_bambara("13-10-2024")        # Same as above

# Date conversion: Bambara => dates
bambara_to_date("Oktɔburu tile tan ni saba san baa fila ni mugan ni naani")  # datetime.date(2024, 10, 13)

# Day/Month helpers
day_of_week_to_bambara(0)                # "Tɛnɛn" (Monday)
day_of_week_to_bambara(6)                # "Kari" (Sunday)
month_to_bambara(10)                     # "Oktɔburu"
bambara_to_month("Oktɔburu")             # 10

# Date normalization in text
normalize_dates_in_text("A bɛ na 13-10-2024 la")  # "A bɛ na Oktɔburu tile ... la"

# Check if word is date-related
is_bambara_month("Oktɔburu")             # True
is_bambara_day("Juma")                   # True


# Time conversion: clock times → Bambara
time_to_bambara(1, 0)                    # "Nɛgɛ kaɲɛ kelen"
time_to_bambara(7, 30)                   # "Nɛgɛ kaɲɛ wolonwula ni sanga bi saba"
format_time_bambara("13:50")             # "Nɛgɛ kaɲɛ tan ni saba ni sanga bi duuru"

# Time conversion: Bambara → clock times
bambara_to_time("Nɛgɛ kaɲɛ wolonwula ni sanga bi saba")  # datetime.time(7, 30)

# Duration conversion: durations → Bambara
duration_to_bambara(minutes=30)          # "miniti bi saba"
duration_to_bambara(hours=1, minutes=30) # "lɛrɛ kelen ni miniti bi saba"
format_duration_bambara("1h30min10s")    # "lɛrɛ kelen ni miniti bi saba ni segɔni tan"

# Duration conversion: Bambara → durations
bambara_to_duration("lɛrɛ kelen ni miniti bi saba")  # (1, 30, 0)

# Time normalization in text
normalize_times_in_text("A nana 7:30 la")  # "A nana Nɛgɛ kaɲɛ wolonwula ... la"

# Check if word is time-related
is_time_word("lɛrɛ")                      # True
is_time_word("miniti")                    # True
is_time_word("segɔni")                    # True



# Measurement conversion: units => Bambara
measurement_to_bambara(5, "kg")          # "kilogaramu duuru"
measurement_to_bambara(100, "m")         # "mɛtɛrɛ kɛmɛ"
measurement_to_bambara(2.5, "L")         # "litiri fila tomi duuru"
format_measurement_bambara("5kg")        # "kilogaramu duuru"

# Measurement conversion: Bambara => units
bambara_to_measurement("kilogaramu duuru")   # (5, 'kg')
bambara_to_measurement("mɛtɛrɛ kɛmɛ")        # (100, 'm')

# Measurement normalization in text
normalize_measurements_in_text("A ye 5 kg san")      # "A ye kilogaramu duuru san"
denormalize_measurements_in_text("kilogaramu duuru") # "5 kg"

# Check if word is measurement-related
is_measurement_word("kilogaramu")        # True
is_measurement_word("mɛtɛrɛ")            # True
get_unit_category("kg")                  # "weight"
get_unit_category("m")                   # "length"

Evaluation Metrics

Metric Description Range
WER Word Error Rate 0.0 – ∞
CER Character Error Rate 0.0 – ∞
MER Match Error Rate 0.0 – 1.0
WIL Word Information Lost 0.0 – 1.0
WIP Word Information Preserved 0.0 – 1.0
DER Diacritic Error Rate (tone accuracy) 0.0 – ∞

References

Linguistic Resources

Standards

  • UNESCO Bamako Meeting (1966)
  • Niamey African Reference Alphabet (1978)

Tools

  • jiwer ASR evaluation metrics

Related Work


MALIBA-AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bambara_text_normalizer-1.0.1.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bambara_text_normalizer-1.0.1-py3-none-any.whl (42.3 kB view details)

Uploaded Python 3

File details

Details for the file bambara_text_normalizer-1.0.1.tar.gz.

File metadata

  • Download URL: bambara_text_normalizer-1.0.1.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bambara_text_normalizer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 5b43063a7485e81608c5fde43be38b260f2c280f7d460096cdb47ba677b3da16
MD5 fa65b195de0d4525ad1b52c9c41e8dbc
BLAKE2b-256 614b90c0a17fb214c7d8dc846bb6da332cf94faf39c7b7bb2d31c6b5cc44a53f

See more details on using hashes here.

File details

Details for the file bambara_text_normalizer-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for bambara_text_normalizer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 40171ad0087ae0761c99748d4b3d1893d054f117915bb13d173d6d4116be1754
MD5 78e28a39adb52df8b05406b52907682b
BLAKE2b-256 02099ed301f6272f3d796871c13ed718cbe9c62ebcf00ce66353a9dabf501deb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page