Multilingual text segmentation tool for zh/ja/en/ko and more. Unofficial backup of LangSegment 0.3.5.
Project description
LangSegment (Unofficial Backup)
⚠️ This is an unofficial backup of LangSegment 0.3.5. The original repository has been removed or made private.
A multilingual text segmentation tool that automatically identifies and splits text by language. Particularly useful for TTS (Text-to-Speech) processing with mixed-language content.
Attribution & History
- Original Author: sunnyboxs (juntaosun)
- Original Repository: https://github.com/juntaosun/LangSegment (now removed)
- First Backup: https://github.com/chameleon-ai/LangSegment-0.3.5-backup
- This Fork: https://github.com/MiniXC/LangSegment-0.3.5-backup (PyPI release)
This package is published to PyPI to preserve access to this useful tool after the original repository was removed. All credit for the original work goes to the original author.
Installation
pip install langsegment-backup
Supported Languages
Primary support:
- 🇨🇳 Chinese (zh)
- 🇯🇵 Japanese (ja)
- 🇬🇧 English (en)
- 🇰🇷 Korean (ko)
Experimental support:
- 🇫🇷 French (fr)
- 🇻🇳 Vietnamese (vi)
- 🇷🇺 Russian (ru)
- 🇹🇭 Thai (th)
The tool can actually support up to 97 different languages through the underlying py3langid library.
Quick Start
from LangSegment import LangSegment
# Basic usage - segment mixed language text
text = "你好世界!Hello World! こんにちは!안녕하세요!"
results = LangSegment.getTexts(text)
for item in results:
print(f"[{item['lang']}] {item['text']}")
Output:
[zh] 你好世界!
[en] Hello World!
[ja] こんにちは!
[ko] 안녕하세요!
Features
Language Filtering
You can specify which languages to detect and in what priority order:
from LangSegment import LangSegment
# Set language filter (priority order: left = highest)
LangSegment.setfilters(["zh", "ja", "en", "ko"])
# Or for Chinese-English only
LangSegment.setfilters(["zh", "en"])
Language Statistics
Get statistics about the languages in your text:
from LangSegment import LangSegment
text = "你好世界!Hello World! こんにちは!"
LangSegment.getTexts(text)
# Get language counts (sorted by character count, descending)
counts = LangSegment.getCounts()
print(counts) # [('zh', 10), ('en', 12), ('ja', 6)]
# Get the primary language
primary_lang, char_count = counts[0]
print(f"Primary language: {primary_lang}")
Manual Language Tags
You can manually specify language regions using tags:
text = "这是中文<ja>これは日本語です</ja>这又是中文"
results = LangSegment.getTexts(text)
SSML Support (Chinese)
The tool includes SSML-like tags for Chinese number/date processing:
# Number reading
text = "<number>12345</number>" # → 一二三四五
# Phone number
text = "<telephone>13812345678</telephone>" # → 幺三八幺二三四五六七八
# Currency
text = "<currency>12345</currency>" # → 一万二千三百四十五
# Date
text = "<date>2024-08-24</date>" # → 二零二四年八月二十四日
Configuration Options
from LangSegment import LangSegment
# Set Chinese/Japanese priority threshold (0-1, default: 0.89)
LangSegment.setPriorityThreshold(0.89)
# Enable/disable result merging
LangSegment.setLangMerge(True)
# Keep Chinese pinyin format
LangSegment.setKeepPinyin(False)
# Enable preview features (French, Vietnamese support)
LangSegment.setEnablePreview(True)
API Reference
Main Functions
| Function | Description |
|---|---|
LangSegment.getTexts(text) |
Segment text and return list of {lang, text, score} dicts |
LangSegment.classify(text) |
Alias for getTexts() |
LangSegment.getCounts() |
Get language statistics as list of (lang, count) tuples |
LangSegment.setfilters(list) |
Set language filter/priority list |
LangSegment.getfilters() |
Get current language filters |
Configuration Functions
| Function | Description |
|---|---|
setPriorityThreshold(float) |
Set zh/ja disambiguation threshold (0-1) |
getPriorityThreshold() |
Get current threshold |
setLangMerge(bool) |
Enable/disable merging adjacent same-language segments |
getLangMerge() |
Get merge setting |
setKeepPinyin(bool) |
Keep Chinese pinyin format in parentheses |
getKeepPinyin() |
Get pinyin setting |
setEnablePreview(bool) |
Enable experimental language support |
getEnablePreview() |
Get preview setting |
Dependencies
numpy >= 1.19.5py3langid >= 0.2.2
License
BSD 3-Clause License
Copyright (c) 2024 juntaosun. All rights reserved.
See LICENSE for full license text.
Contributing
Since this is a backup/preservation fork, major feature additions are not planned. However, bug fixes and compatibility updates are welcome. Please open an issue or pull request at https://github.com/MiniXC/LangSegment-0.3.5-backup.
Acknowledgments
- Original author: sunnyboxs (juntaosun) for creating LangSegment
- chameleon-ai for creating the first backup at https://github.com/chameleon-ai/LangSegment-0.3.5-backup
- The
py3langidlibrary by Adrien Barbaresi - The original
langid.pyby Marco Lui
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langsegment_backup-0.3.5.post1.tar.gz.
File metadata
- Download URL: langsegment_backup-0.3.5.post1.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60a4330f99acfcc98c055b406554bb1e7528002cc8ac5a2efd584ed6ae698cfd
|
|
| MD5 |
491b02d326056a49973e38cf380081eb
|
|
| BLAKE2b-256 |
d7546570ba2597906c0372d984355d475c894d3a7c0eb12cb649f9ca31ccc9ab
|
File details
Details for the file langsegment_backup-0.3.5.post1-py3-none-any.whl.
File metadata
- Download URL: langsegment_backup-0.3.5.post1-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fe0993d3a02b2ef53e406a6f6abc993f6cec1a0fb7b42fe1b6e4b096c8a4e15
|
|
| MD5 |
4c050de54dd0f1d48742327b3cd1b20b
|
|
| BLAKE2b-256 |
f45c0a4e1c03ddc0f6f2bd0cad8611024cb747b46c9d3707f123170bc53589b9
|