A Python package for analyzing multilingual text.

Project description

multilang-probe

A Python package for analyzing multilingual text.

Overview

multilang-probe is a toolkit designed to classify character sets, detect languages in text files, and extract specific multilingual passages. It supports character detection for a wide range of writing systems using Unicode script properties (e.g., Latin, Japanese, Cyrillic, Arabic, Devanagari, and more). Additionally, it leverages the FastText model for robust language detection.

Whether you are analyzing large corpora or extracting specific language data, multilang-probe simplifies the process with an easy-to-use API.

Features

Character Set Classification:

Detect and calculate proportions of character types (e.g., Latin, Japanese, Cyrillic, Arabic, Devanagari) in text.
Uses regex with Unicode script properties (\p{Script}) for more accurate classification.
Special handling for Japanese vs Chinese characters (Han script).

Example: Character Detection

from charlang_detect.character_detection import classify_text_with_proportions

text = "これは日本語と English です。"
proportions = classify_text_with_proportions(text)
print(proportions)
# Possible output:
# {"japanese": 50.0, "latin": 50.0}

Explanation:

If the text contains Hiragana/Katakana, Han characters are considered Japanese Kanji.
Otherwise, Han characters are considered Chinese.

Language Detection:

Identify top languages in text using Facebook's FastText pre-trained model.

Example: Language Detection

from charlang_detect.language_detection import detect_language_fasttext

text = "Ceci est un texte en français."
languages = detect_language_fasttext(text)
print(languages)
# Output example: "fr: 99.2%, en: 0.8%"

Corpus Analysis:

Analyze all .txt files in a folder to detect multilingual passages and language distributions.
Character-based filtering: Identify and filter text lines containing specific character sets (e.g., Japanese, Cyrillic, Arabic).
Language-based filtering: Extract passages in a specific language, with customizable confidence thresholds (e.g., 70%).
Targeted extraction: Extract lines of text meeting both minimum length requirements and language detection accuracy.
Calculate language proportions: Aggregate detected languages across files and calculate their proportions.

Example: Analyze and Detect Multilingual Passages

from charlang_detect.corpus_analysis import analyze_corpus_with_fasttext

folder_path = "path/to/corpus/"
results = analyze_corpus_with_fasttext(folder_path)
for filename, langs in results.items():
    print(filename, langs)

Example: Filter Passages by Character Types

from charlang_detect.corpus_analysis import filter_passages_by_character_types

folder_path = "path/to/corpus/"
character_types = ["japanese", "cyrillic"]
filtered = filter_passages_by_character_types(folder_path, character_types)
for filename, passages in filtered.items():
    print(filename, passages)

Example: Extract Passages by Language with Threshold

from charlang_detect.corpus_analysis import filter_passages_by_language

folder_path = "path/to/corpus/"
target_languages = ["fr", "en"]
threshold = 70
filtered = filter_passages_by_language(results, target_languages, folder_path, threshold)
for filename, passages in filtered.items():
    print(filename, passages)

Supported Character Sets

Japanese (Hiragana, Katakana)
Han (Kanji; considered Japanese if Hiragana/Katakana present, else Chinese)
Korean (Hangul)
Cyrillic (for languages like Russian, Bulgarian, etc.)
Arabic
Hebrew
Greek
Latin (basic and extended)
Devanagari (e.g., Hindi, Sanskrit)
Tamil, Bengali, Thai, and many more (extendable via Unicode scripts)
"other" category for characters not belonging to known scripts

Dependencies

Python 3.7+
FastText
Regex (for Unicode script classification)

License

This project is licensed under the MIT License. While the MIT License allows unrestricted use, modification, and distribution of this software, I kindly request that proper credit be given when this project is used in academic, research, or published work. For citation purposes, please refer to the following:

CAFIERO Florian, 'multilang-probe', 2024, [https://github.com/floriancafiero/multilang-probe].

Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

Author

Florian Cafiero
GitHub: floriancafiero
Email: florian.cafiero@chartes.psl.eu

Future Features

Support for other pre-trained language models (e.g., spaCy).
Visualization tools for multilingual analysis.
CLI (Command-Line Interface) for easy usage without writing code.

Project details

Release history Release notifications | RSS feed

1.1.2

Jan 15, 2026

1.1.1

Jan 15, 2026

1.0.1

Jan 15, 2026

0.1.7

Dec 18, 2024

0.1.6

Dec 13, 2024

This version

0.1.5

Dec 12, 2024

0.1.4

Dec 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multilang_probe-0.1.5.tar.gz (6.3 kB view details)

Uploaded Dec 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multilang_probe-0.1.5-py3-none-any.whl (7.1 kB view details)

Uploaded Dec 12, 2024 Python 3

File details

Details for the file multilang_probe-0.1.5.tar.gz.

File metadata

Download URL: multilang_probe-0.1.5.tar.gz
Upload date: Dec 12, 2024
Size: 6.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for multilang_probe-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`6cfa5fdee500dd9944716f674cff11757fa843f8304e9f615c0cd00427d666ca`
MD5	`09dc46c83cd9cd81e7e19aa327832b78`
BLAKE2b-256	`d688b4712f2c613b0de2b0775b4c5ecca802604d82501a38d5786a9b5818dd7c`

See more details on using hashes here.

File details

Details for the file multilang_probe-0.1.5-py3-none-any.whl.

File metadata

Download URL: multilang_probe-0.1.5-py3-none-any.whl
Upload date: Dec 12, 2024
Size: 7.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for multilang_probe-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5002377a051559da20951e3b2577b1572bb125d6d86b0b3ee76b9c3d46ac9de`
MD5	`bae8744b595ca5a4e7486956ebacac30`
BLAKE2b-256	`dc2c017f3de0f789c918911d0bd2d5b9d1fb6e60062a9d3d5a5c7937a8cdb5f1`

See more details on using hashes here.

multilang-probe 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

multilang-probe

Overview

Features

Character Set Classification:

Example: Character Detection

Language Detection:

Example: Language Detection

Corpus Analysis:

Example: Analyze and Detect Multilingual Passages

Example: Filter Passages by Character Types

Example: Extract Passages by Language with Threshold

Supported Character Sets

Dependencies

License

Contributing

Author

Future Features

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes