Multilingual text segmentation tool for zh/ja/en/ko and more. Unofficial backup of LangSegment 0.3.5.

These details have not been verified by PyPI

Project links

Project description

LangSegment (Unofficial Backup)

⚠️ This is an unofficial backup of LangSegment 0.3.5. The original repository has been removed or made private.

A multilingual text segmentation tool that automatically identifies and splits text by language. Particularly useful for TTS (Text-to-Speech) processing with mixed-language content.

Attribution & History

Original Author: sunnyboxs (juntaosun)
Original Repository: https://github.com/juntaosun/LangSegment (now removed)
First Backup: https://github.com/chameleon-ai/LangSegment-0.3.5-backup
This Fork: https://github.com/MiniXC/LangSegment-0.3.5-backup (PyPI release)

This package is published to PyPI to preserve access to this useful tool after the original repository was removed. All credit for the original work goes to the original author.

Installation

pip install langsegment-backup

Supported Languages

Primary support:

🇨🇳 Chinese (zh)
🇯🇵 Japanese (ja)
🇬🇧 English (en)
🇰🇷 Korean (ko)

Experimental support:

🇫🇷 French (fr)
🇻🇳 Vietnamese (vi)
🇷🇺 Russian (ru)
🇹🇭 Thai (th)

The tool can actually support up to 97 different languages through the underlying py3langid library.

Quick Start

from LangSegment import LangSegment

# Basic usage - segment mixed language text
text = "你好世界！Hello World! こんにちは！안녕하세요!"
results = LangSegment.getTexts(text)

for item in results:
    print(f"[{item['lang']}] {item['text']}")

Output:

[zh] 你好世界！
[en] Hello World! 
[ja] こんにちは！
[ko] 안녕하세요!

Features

Language Filtering

You can specify which languages to detect and in what priority order:

from LangSegment import LangSegment

# Set language filter (priority order: left = highest)
LangSegment.setfilters(["zh", "ja", "en", "ko"])

# Or for Chinese-English only
LangSegment.setfilters(["zh", "en"])

Language Statistics

Get statistics about the languages in your text:

from LangSegment import LangSegment

text = "你好世界！Hello World! こんにちは！"
LangSegment.getTexts(text)

# Get language counts (sorted by character count, descending)
counts = LangSegment.getCounts()
print(counts)  # [('zh', 10), ('en', 12), ('ja', 6)]

# Get the primary language
primary_lang, char_count = counts[0]
print(f"Primary language: {primary_lang}")

Manual Language Tags

You can manually specify language regions using tags:

text = "这是中文<ja>これは日本語です</ja>这又是中文"
results = LangSegment.getTexts(text)

SSML Support (Chinese)

The tool includes SSML-like tags for Chinese number/date processing:

# Number reading
text = "<number>12345</number>"  # → 一二三四五

# Phone number
text = "<telephone>13812345678</telephone>"  # → 幺三八幺二三四五六七八

# Currency
text = "<currency>12345</currency>"  # → 一万二千三百四十五

# Date
text = "<date>2024-08-24</date>"  # → 二零二四年八月二十四日

Configuration Options

from LangSegment import LangSegment

# Set Chinese/Japanese priority threshold (0-1, default: 0.89)
LangSegment.setPriorityThreshold(0.89)

# Enable/disable result merging
LangSegment.setLangMerge(True)

# Keep Chinese pinyin format
LangSegment.setKeepPinyin(False)

# Enable preview features (French, Vietnamese support)
LangSegment.setEnablePreview(True)

API Reference

Main Functions

Function	Description
`LangSegment.getTexts(text)`	Segment text and return list of `{lang, text, score}` dicts
`LangSegment.classify(text)`	Alias for `getTexts()`
`LangSegment.getCounts()`	Get language statistics as list of `(lang, count)` tuples
`LangSegment.setfilters(list)`	Set language filter/priority list
`LangSegment.getfilters()`	Get current language filters

Configuration Functions

Function	Description
`setPriorityThreshold(float)`	Set zh/ja disambiguation threshold (0-1)
`getPriorityThreshold()`	Get current threshold
`setLangMerge(bool)`	Enable/disable merging adjacent same-language segments
`getLangMerge()`	Get merge setting
`setKeepPinyin(bool)`	Keep Chinese pinyin format in parentheses
`getKeepPinyin()`	Get pinyin setting
`setEnablePreview(bool)`	Enable experimental language support
`getEnablePreview()`	Get preview setting

Dependencies

numpy >= 1.19.5
py3langid >= 0.2.2

License

BSD 3-Clause License

See LICENSE for full license text.

Contributing

Since this is a backup/preservation fork, major feature additions are not planned. However, bug fixes and compatibility updates are welcome. Please open an issue or pull request at https://github.com/MiniXC/LangSegment-0.3.5-backup.

Acknowledgments

Original author: sunnyboxs (juntaosun) for creating LangSegment
chameleon-ai for creating the first backup at https://github.com/chameleon-ai/LangSegment-0.3.5-backup
The py3langid library by Adrien Barbaresi
The original langid.py by Marco Lui

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.5.post1

Jan 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langsegment_backup-0.3.5.post1.tar.gz (25.5 kB view details)

Uploaded Jan 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langsegment_backup-0.3.5.post1-py3-none-any.whl (26.0 kB view details)

Uploaded Jan 23, 2026 Python 3

File details

Details for the file langsegment_backup-0.3.5.post1.tar.gz.

File metadata

Download URL: langsegment_backup-0.3.5.post1.tar.gz
Upload date: Jan 23, 2026
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for langsegment_backup-0.3.5.post1.tar.gz
Algorithm	Hash digest
SHA256	`60a4330f99acfcc98c055b406554bb1e7528002cc8ac5a2efd584ed6ae698cfd`
MD5	`491b02d326056a49973e38cf380081eb`
BLAKE2b-256	`d7546570ba2597906c0372d984355d475c894d3a7c0eb12cb649f9ca31ccc9ab`

See more details on using hashes here.

File details

Details for the file langsegment_backup-0.3.5.post1-py3-none-any.whl.

File metadata

Download URL: langsegment_backup-0.3.5.post1-py3-none-any.whl
Upload date: Jan 23, 2026
Size: 26.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for langsegment_backup-0.3.5.post1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2fe0993d3a02b2ef53e406a6f6abc993f6cec1a0fb7b42fe1b6e4b096c8a4e15`
MD5	`4c050de54dd0f1d48742327b3cd1b20b`
BLAKE2b-256	`f45c0a4e1c03ddc0f6f2bd0cad8611024cb747b46c9d3707f123170bc53589b9`

See more details on using hashes here.

langsegment-backup 0.3.5.post1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LangSegment (Unofficial Backup)

Attribution & History

Installation

Supported Languages

Quick Start

Features

Language Filtering

Language Statistics

Manual Language Tags

SSML Support (Chinese)

Configuration Options

API Reference

Main Functions

Configuration Functions

Dependencies

License

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes