Context-aware language detection with improved accuracy for multilingual documents

These details have not been verified by PyPI

Project links

Project description

contextual-langdetect

A context-aware language detection library that improves accuracy by considering document-level language patterns.

Use Case

This library is designed for processing corpora where individual lines or sentences might be in different languages, but with a strong prior that there are only one or two primary languages. It uses document-level context to improve accuracy in cases where individual sentences might be ambiguously detected.

For example, in a primarily Chinese corpus:

Some sentences might be detected at an individual level as Japanese, but if they don't contain kana characters, they're likely Chinese
Some sentences might be detected as Wu Chinese (wuu), but in a Mandarin context they're likely Mandarin
The library uses the dominant language(s) in the corpus to resolve these ambiguities

This is particularly useful for:

Transcriptions of bilingual conversations, including
Language instruction texts and transcriptions
Mixed-language documents where the majority language should inform ambiguous cases

Features

Accurate language detection with confidence scores
Context-aware detection that uses surrounding text to disambiguate
Special case handling for commonly confused languages (e.g., Wu Chinese, Japanese without kana)
Support for mixed language documents

Installation

pip install contextual-langdetect

Usage

`count_by_language`

from contextual_langdetect import contextual_detect

# Process a document with context-awareness
sentences = [
    "你好。",  # Detected as ZH
    "你好吗?",  # Detected as ZH
    "很好。",  # Detected as JA when model=small
    "我家也有四个,刚好。",  # Detected as ZH
    "那么现在天气很冷,你要开暖气吗?",  # Detected as WUU
    "Okay, fine I'll see you next week.",  # English
    "Great, I'll see you then.",  # English
]

# Context-unaware language detection
languages = contextual_detect(sentences, context_correction=False)
print(languages)
# Output: ['zh', 'zh', 'ja', 'zh', 'wuu', 'en', 'en']

# Context-aware language detection
languages = contextual_detect(sentences)
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Context-aware detection with language biasing
# Specify expected languages to improve detection in ambiguous cases
languages = contextual_detect(sentences, languages=["zh", "en"])
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Force a specific language for all sentences
languages = contextual_detect(sentences, languages=["en"])
print(languages)
# Output: ['en', 'en', 'en', 'en', 'en', 'en', 'en']

`count_by_language`

def count_by_language(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> dict[Language, int]

Given a batch of sentences, returns a dict mapping language codes to the number of sentences assigned to each language, using the contextual detection algorithm.

Example:

from contextual_langdetect.detection import count_by_language

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
counts = count_by_language(sentences)
# Example output: {'en': 2, 'fr': 1, 'de': 1}

`get_languages_by_count`

def get_languages_by_count(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> list[tuple[Language, int]]

Given a batch of sentences, returns a list of (language, count) tuples sorted by decreasing count, using the contextual detection algorithm.

Example:

from contextual_langdetect.detection import get_languages_by_count

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
language_counts = get_languages_by_count(sentences)
# Example output: [('en', 2), ('fr', 1), ('de', 1)]

`get_majority_language`

def get_majority_language(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> Language | None

Given a batch of sentences, returns the language code with the highest count (the majority language), or None if there are no sentences.

Example:

from contextual_langdetect.detection import get_majority_language

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
majority_language = get_majority_language(sentences)
# Example output: 'en'

Dependencies

This library builds upon:

LlmKira/fast-langdetect for base language detection
zafercavdar/fasttext-langdetect (transitively) , which LlmKira/fast-langdetect builds on
FastText by Facebook, which both these projects wrap

Development

For development instructions, see DEVELOPMENT.md.

Documentation

Context-Aware Detection - Learn how the context-aware language detection algorithm works

My Related Projects

add2anki - Browser extension to add words and phrases to Anki language learning decks. contextual-langdetect was extracted from this.
audio2anki - Extract audio from video files for creating Anki language flashcards. add2anki was developed to support this and other tools.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Oliver Steele (@osteele on GitHub)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 23, 2025

0.1.5

Apr 21, 2025

0.1.3

Apr 21, 2025

0.1.2

Apr 8, 2025

0.1.1

Apr 2, 2025

0.1.0

Apr 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextual_langdetect-0.2.0.tar.gz (49.6 kB view details)

Uploaded Apr 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

contextual_langdetect-0.2.0-py3-none-any.whl (8.7 kB view details)

Uploaded Apr 23, 2025 Python 3

File details

Details for the file contextual_langdetect-0.2.0.tar.gz.

File metadata

Download URL: contextual_langdetect-0.2.0.tar.gz
Upload date: Apr 23, 2025
Size: 49.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.2

File hashes

Hashes for contextual_langdetect-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`28ae4842f283f79ff1a92fab83076fc1be7ae09d5ee90767d0712b5edcb61ab2`
MD5	`f25b39cba1dec632a05441e95df2174e`
BLAKE2b-256	`aec54acf67c7386298b11201c2d0e5276c44a3ff50baf65645b40b1061f4cd7f`

See more details on using hashes here.

File details

Details for the file contextual_langdetect-0.2.0-py3-none-any.whl.

File metadata

Download URL: contextual_langdetect-0.2.0-py3-none-any.whl
Upload date: Apr 23, 2025
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.2

File hashes

Hashes for contextual_langdetect-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`320483cef7b3b986d3ef4314fcbc58cb7da14fad656541ad531bb96ccf93f383`
MD5	`48ba3959e9458b92527c02d02480799a`
BLAKE2b-256	`e08ccd69f98294cb5d88b551c41646b984917611efc203b853a537545b41620a`

See more details on using hashes here.

contextual-langdetect 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

contextual-langdetect

Use Case

Features

Installation

Usage

count_by_language

count_by_language

get_languages_by_count

get_majority_language

Dependencies

Development

Documentation

My Related Projects

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`count_by_language`

`count_by_language`

`get_languages_by_count`

`get_majority_language`