Skip to main content

Multilingual text segmentation tool for zh/ja/en/ko and more. Unofficial backup of LangSegment 0.3.5.

Project description

LangSegment (Unofficial Backup)

PyPI version License: BSD-3-Clause

⚠️ This is an unofficial backup of LangSegment 0.3.5. The original repository has been removed or made private.

A multilingual text segmentation tool that automatically identifies and splits text by language. Particularly useful for TTS (Text-to-Speech) processing with mixed-language content.

Attribution & History

This package is published to PyPI to preserve access to this useful tool after the original repository was removed. All credit for the original work goes to the original author.

Installation

pip install langsegment-backup

Supported Languages

Primary support:

  • 🇨🇳 Chinese (zh)
  • 🇯🇵 Japanese (ja)
  • 🇬🇧 English (en)
  • 🇰🇷 Korean (ko)

Experimental support:

  • 🇫🇷 French (fr)
  • 🇻🇳 Vietnamese (vi)
  • 🇷🇺 Russian (ru)
  • 🇹🇭 Thai (th)

The tool can actually support up to 97 different languages through the underlying py3langid library.

Quick Start

from LangSegment import LangSegment

# Basic usage - segment mixed language text
text = "你好世界!Hello World! こんにちは!안녕하세요!"
results = LangSegment.getTexts(text)

for item in results:
    print(f"[{item['lang']}] {item['text']}")

Output:

[zh] 你好世界!
[en] Hello World! 
[ja] こんにちは!
[ko] 안녕하세요!

Features

Language Filtering

You can specify which languages to detect and in what priority order:

from LangSegment import LangSegment

# Set language filter (priority order: left = highest)
LangSegment.setfilters(["zh", "ja", "en", "ko"])

# Or for Chinese-English only
LangSegment.setfilters(["zh", "en"])

Language Statistics

Get statistics about the languages in your text:

from LangSegment import LangSegment

text = "你好世界!Hello World! こんにちは!"
LangSegment.getTexts(text)

# Get language counts (sorted by character count, descending)
counts = LangSegment.getCounts()
print(counts)  # [('zh', 10), ('en', 12), ('ja', 6)]

# Get the primary language
primary_lang, char_count = counts[0]
print(f"Primary language: {primary_lang}")

Manual Language Tags

You can manually specify language regions using tags:

text = "这是中文<ja>これは日本語です</ja>这又是中文"
results = LangSegment.getTexts(text)

SSML Support (Chinese)

The tool includes SSML-like tags for Chinese number/date processing:

# Number reading
text = "<number>12345</number>"  # → 一二三四五

# Phone number
text = "<telephone>13812345678</telephone>"  # → 幺三八幺二三四五六七八

# Currency
text = "<currency>12345</currency>"  # → 一万二千三百四十五

# Date
text = "<date>2024-08-24</date>"  # → 二零二四年八月二十四日

Configuration Options

from LangSegment import LangSegment

# Set Chinese/Japanese priority threshold (0-1, default: 0.89)
LangSegment.setPriorityThreshold(0.89)

# Enable/disable result merging
LangSegment.setLangMerge(True)

# Keep Chinese pinyin format
LangSegment.setKeepPinyin(False)

# Enable preview features (French, Vietnamese support)
LangSegment.setEnablePreview(True)

API Reference

Main Functions

Function Description
LangSegment.getTexts(text) Segment text and return list of {lang, text, score} dicts
LangSegment.classify(text) Alias for getTexts()
LangSegment.getCounts() Get language statistics as list of (lang, count) tuples
LangSegment.setfilters(list) Set language filter/priority list
LangSegment.getfilters() Get current language filters

Configuration Functions

Function Description
setPriorityThreshold(float) Set zh/ja disambiguation threshold (0-1)
getPriorityThreshold() Get current threshold
setLangMerge(bool) Enable/disable merging adjacent same-language segments
getLangMerge() Get merge setting
setKeepPinyin(bool) Keep Chinese pinyin format in parentheses
getKeepPinyin() Get pinyin setting
setEnablePreview(bool) Enable experimental language support
getEnablePreview() Get preview setting

Dependencies

  • numpy >= 1.19.5
  • py3langid >= 0.2.2

License

BSD 3-Clause License

Copyright (c) 2024 juntaosun. All rights reserved.

See LICENSE for full license text.

Contributing

Since this is a backup/preservation fork, major feature additions are not planned. However, bug fixes and compatibility updates are welcome. Please open an issue or pull request at https://github.com/MiniXC/LangSegment-0.3.5-backup.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langsegment_backup-0.3.5.post1.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langsegment_backup-0.3.5.post1-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file langsegment_backup-0.3.5.post1.tar.gz.

File metadata

  • Download URL: langsegment_backup-0.3.5.post1.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for langsegment_backup-0.3.5.post1.tar.gz
Algorithm Hash digest
SHA256 60a4330f99acfcc98c055b406554bb1e7528002cc8ac5a2efd584ed6ae698cfd
MD5 491b02d326056a49973e38cf380081eb
BLAKE2b-256 d7546570ba2597906c0372d984355d475c894d3a7c0eb12cb649f9ca31ccc9ab

See more details on using hashes here.

File details

Details for the file langsegment_backup-0.3.5.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for langsegment_backup-0.3.5.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 2fe0993d3a02b2ef53e406a6f6abc993f6cec1a0fb7b42fe1b6e4b096c8a4e15
MD5 4c050de54dd0f1d48742327b3cd1b20b
BLAKE2b-256 f45c0a4e1c03ddc0f6f2bd0cad8611024cb747b46c9d3707f123170bc53589b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page