Context-aware language detection with improved accuracy for multilingual documents
Project description
contextual-langdetect
A context-aware language detection library that improves accuracy by considering document-level language patterns.
Use Case
This library is designed for processing corpora where individual lines or sentences might be in different languages, but with a strong prior that there are only one or two primary languages. It uses document-level context to improve accuracy in cases where individual sentences might be ambiguously detected.
For example, in a primarily Chinese corpus:
- Some sentences might be detected at an individual level as Japanese, but if they don't contain kana characters, they're likely Chinese
- Some sentences might be detected as Wu Chinese (wuu), but in a Mandarin context they're likely Mandarin
- The library uses the dominant language(s) in the corpus to resolve these ambiguities
This is particularly useful for:
- Transcriptions of bilingual conversations, including
- Language instruction texts and transcriptions
- Mixed-language documents where the majority language should inform ambiguous cases
Features
- Accurate language detection with confidence scores
- Context-aware detection that uses surrounding text to disambiguate
- Special case handling for commonly confused languages (e.g., Wu Chinese, Japanese without kana)
- Support for mixed language documents
Installation
pip install contextual-langdetect
Usage
count_by_language
from contextual_langdetect import contextual_detect
# Process a document with context-awareness
sentences = [
"你好。", # Detected as ZH
"你好吗?", # Detected as ZH
"很好。", # Detected as JA when model=small
"我家也有四个,刚好。", # Detected as ZH
"那么现在天气很冷,你要开暖气吗?", # Detected as WUU
"Okay, fine I'll see you next week.", # English
"Great, I'll see you then.", # English
]
# Context-unaware language detection
languages = contextual_detect(sentences, context_correction=False)
print(languages)
# Output: ['zh', 'zh', 'ja', 'zh', 'wuu', 'en', 'en']
# Context-aware language detection
languages = contextual_detect(sentences)
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']
# Context-aware detection with language biasing
# Specify expected languages to improve detection in ambiguous cases
languages = contextual_detect(sentences, languages=["zh", "en"])
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']
# Force a specific language for all sentences
languages = contextual_detect(sentences, languages=["en"])
print(languages)
# Output: ['en', 'en', 'en', 'en', 'en', 'en', 'en']
count_by_language
def count_by_language(
sentences: Sequence[str],
languages: Sequence[Language] | None = None,
model: ModelSize = ModelSize.SMALL,
context_correction: bool = True,
) -> dict[Language, int]
Given a batch of sentences, returns a dict mapping language codes to the number of sentences assigned to each language, using the contextual detection algorithm.
Example:
from contextual_langdetect.detection import count_by_language
sentences = [
"Hello world.",
"Bonjour le monde.",
"Hallo Welt.",
"Hello again.",
]
counts = count_by_language(sentences)
# Example output: {'en': 2, 'fr': 1, 'de': 1}
get_languages_by_count
def get_languages_by_count(
sentences: Sequence[str],
languages: Sequence[Language] | None = None,
model: ModelSize = ModelSize.SMALL,
context_correction: bool = True,
) -> list[tuple[Language, int]]
Given a batch of sentences, returns a list of (language, count) tuples sorted by decreasing count, using the contextual detection algorithm.
Example:
from contextual_langdetect.detection import get_languages_by_count
sentences = [
"Hello world.",
"Bonjour le monde.",
"Hallo Welt.",
"Hello again.",
]
language_counts = get_languages_by_count(sentences)
# Example output: [('en', 2), ('fr', 1), ('de', 1)]
get_majority_language
def get_majority_language(
sentences: Sequence[str],
languages: Sequence[Language] | None = None,
model: ModelSize = ModelSize.SMALL,
context_correction: bool = True,
) -> Language | None
Given a batch of sentences, returns the language code with the highest count (the majority language), or None if there are no sentences.
Example:
from contextual_langdetect.detection import get_majority_language
sentences = [
"Hello world.",
"Bonjour le monde.",
"Hallo Welt.",
"Hello again.",
]
majority_language = get_majority_language(sentences)
# Example output: 'en'
Dependencies
This library builds upon:
- LlmKira/fast-langdetect for base language detection
- zafercavdar/fasttext-langdetect (transitively) , which
LlmKira/fast-langdetect
builds on - FastText by Facebook, which both these projects wrap
Development
For development instructions, see DEVELOPMENT.md.
Documentation
- Context-Aware Detection - Learn how the context-aware language detection algorithm works
My Related Projects
- add2anki - Browser extension to add
words and phrases to Anki language learning decks.
contextual-langdetect
was extracted from this. - audio2anki - Extract audio from video
files for creating Anki language flashcards.
add2anki
was developed to support this and other tools.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Oliver Steele (@osteele on GitHub)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file contextual_langdetect-0.2.0.tar.gz
.
File metadata
- Download URL: contextual_langdetect-0.2.0.tar.gz
- Upload date:
- Size: 49.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
28ae4842f283f79ff1a92fab83076fc1be7ae09d5ee90767d0712b5edcb61ab2
|
|
MD5 |
f25b39cba1dec632a05441e95df2174e
|
|
BLAKE2b-256 |
aec54acf67c7386298b11201c2d0e5276c44a3ff50baf65645b40b1061f4cd7f
|
File details
Details for the file contextual_langdetect-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: contextual_langdetect-0.2.0-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
320483cef7b3b986d3ef4314fcbc58cb7da14fad656541ad531bb96ccf93f383
|
|
MD5 |
48ba3959e9458b92527c02d02480799a
|
|
BLAKE2b-256 |
e08ccd69f98294cb5d88b551c41646b984917611efc203b853a537545b41620a
|