A Python package for analyzing multilingual text.
Project description
multilang-probe
A Python package for analyzing multilingual text.
Overview
multilang-probe is a toolkit designed to classify character sets, detect languages in text files, and extract specific multilingual passages. It supports character detection for a wide range of writing systems using Unicode script properties (e.g., Latin, Japanese, Cyrillic, Arabic, Devanagari, and more). Additionally, it leverages the FastText model for robust language detection.
Whether you are analyzing large corpora or extracting specific language data, multilang-probe simplifies the process with an easy-to-use API.
Features
Character Set Classification:
- Detect and calculate proportions of character types (e.g., Latin, Japanese, Cyrillic, Arabic, Devanagari) in text.
- Uses
regexwith Unicode script properties (\p{Script}) for more accurate classification. - Special handling for Japanese vs Chinese characters (Han script).
Example: Character Detection
from multilang_probe.character_detection import classify_text_with_proportions
# Sample text with multiple languages/scripts
text = "これは日本語です。Привет мир! Ελληνικά και हिन्दी।"
# Classify the text
proportions = classify_text_with_proportions(text)
# Print the proportions
print("Character script proportions:")
print(proportions)
Expected outcome:
Character script proportions:
{'japanese': 19.51, 'cyrillic': 21.95, 'greek': 26.83, 'devanagari': 14.63, 'other': 17.07}
Explanation:
- If the text contains Hiragana/Katakana, Han characters are considered Japanese Kanji.
- Otherwise, Han characters are considered Chinese.
Language Detection:
- Identify top languages in text using Facebook's FastText pre-trained model.
- Install the package from PyPI:
pip install multilang-probe
- Download the model once (example command):
curl -L -o lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
- Place
lid.176.binin your working directory, setMULTILANG_PROBE_MODEL_PATH, or pass amodel_pathargument (the model file is not bundled with the PyPI package).
Example: Language Detection
from multilang_probe.language_detection import detect_language_fasttext
text = "Ceci est un texte en français."
languages = detect_language_fasttext(text, model_path="path/to/lid.176.bin")
print(languages)
# Output example: "fr: 99.2%, en: 0.8%"
Corpus Analysis:
- Analyze all
.txtfiles in a folder to detect multilingual passages and language distributions. - Character-based filtering: Identify and filter text lines containing specific character sets (e.g., Japanese, Cyrillic, Arabic).
- Language-based filtering: Extract passages in a specific language, with customizable confidence thresholds (e.g., 70%).
- Targeted extraction: Extract lines of text meeting both minimum length requirements and language detection accuracy.
- Calculate language proportions: Aggregate detected languages across files and calculate their proportions.
Example: Analyze and Detect Multilingual Passages
from multilang_probe.corpus_analysis import analyze_corpus_with_fasttext
folder_path = "path/to/corpus/"
results = analyze_corpus_with_fasttext(folder_path)
for filename, langs in results.items():
print(filename, langs)
Example: Filter Passages by Character Types
Example: Extract Passages by Language with Threshold
from multilang_probe.corpus_analysis import filter_passages_by_language
folder_path = "path/to/corpus/"
target_languages = ["fr", "en"]
threshold = 70
filtered = filter_passages_by_language(results, target_languages, folder_path, threshold)
for filename, passages in filtered.items():
print(filename, passages)
Supported Character Sets
- Japanese (Hiragana, Katakana)
- Han (Kanji; considered Japanese if Hiragana/Katakana present, else Chinese)
- Korean (Hangul)
- Cyrillic (for languages like Russian, Bulgarian, etc.)
- Arabic
- Hebrew
- Greek
- Latin (basic and extended)
- Devanagari (e.g., Hindi, Sanskrit)
- Tamil, Bengali, Thai
- Extendable via Unicode scripts
- "other" category for characters not belonging to known scripts
Dependencies
- Python 3.7+
- FastText
- Regex (for Unicode script classification)
License
This project is licensed under the MIT License. While the MIT License allows unrestricted use, modification, and distribution of this software, I kindly request that proper credit be given when this project is used in academic, research, or published work. For citation purposes, please refer to the following:
CAFIERO Florian, 'multilang-probe', 2024, [https://github.com/floriancafiero/multilang-probe].
Contributing
Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
Author
Florian Cafiero
GitHub: floriancafiero
Email: florian.cafiero@chartes.psl.eu
Future Features
- Support for other pre-trained language models (e.g., spaCy).
- Detection of mathematical language
- Visualization tools for multilingual analysis.
- CLI (Command-Line Interface) for easy usage without writing code.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multilang_probe-1.0.1.tar.gz.
File metadata
- Download URL: multilang_probe-1.0.1.tar.gz
- Upload date:
- Size: 6.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7bc269c4c2d064e1297cf6b757cc4a7f82cc1892940282a4e6dec842d27af47
|
|
| MD5 |
801ddf0d5a8812faba5521de12867037
|
|
| BLAKE2b-256 |
6b4c304b13f5a0748337f9289804d2f507bc650f9dde586774e6e8daeb526583
|
Provenance
The following attestation bundles were made for multilang_probe-1.0.1.tar.gz:
Publisher:
python-publish.yml on floriancafiero/multilang-probe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
multilang_probe-1.0.1.tar.gz -
Subject digest:
d7bc269c4c2d064e1297cf6b757cc4a7f82cc1892940282a4e6dec842d27af47 - Sigstore transparency entry: 828700659
- Sigstore integration time:
-
Permalink:
floriancafiero/multilang-probe@cd60c89116dae196179369a69d90aa8f67fb54ba -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/floriancafiero
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@cd60c89116dae196179369a69d90aa8f67fb54ba -
Trigger Event:
release
-
Statement type:
File details
Details for the file multilang_probe-1.0.1-py3-none-any.whl.
File metadata
- Download URL: multilang_probe-1.0.1-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1841b0f6e5172ba270cac12dfaa809f18a4942ddf13683124473e073f5decabf
|
|
| MD5 |
9c4f48de9a1df3d2328cd5dbd31c8fa7
|
|
| BLAKE2b-256 |
c8bd5ac10056c3e4234b374ca9eeed7caa5aa47cb37325c8ff9d416c21c30efa
|
Provenance
The following attestation bundles were made for multilang_probe-1.0.1-py3-none-any.whl:
Publisher:
python-publish.yml on floriancafiero/multilang-probe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
multilang_probe-1.0.1-py3-none-any.whl -
Subject digest:
1841b0f6e5172ba270cac12dfaa809f18a4942ddf13683124473e073f5decabf - Sigstore transparency entry: 828700732
- Sigstore integration time:
-
Permalink:
floriancafiero/multilang-probe@cd60c89116dae196179369a69d90aa8f67fb54ba -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/floriancafiero
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@cd60c89116dae196179369a69d90aa8f67fb54ba -
Trigger Event:
release
-
Statement type: