Skip to main content

A package for splitting sentences by language (concatenating over-split substrings based on their language)

Project description

VisActor Logo VisActor Logo

split-lang

Splitting sentences by concatenating over-split substrings based on their language powered by wtpsplit and language detection (fast-langdetect and langdetect)


PyPI version Downloads Downloads

Open In Colab

License GitHub Repo stars wakatime

1. Idea

Stage 1: rule-based split using punctuation

  • hello, how are you -> hello | , | how are you

Stage 2: then, over-split text to substrings by wtpsplit

  • 你喜欢看アニメ吗 -> | 喜欢 | | アニメ |

Stage 3: concatenate substrings based on their languages using fast-langdetect and langdetect

  • | 喜欢 | | アニメ | -> 你喜欢看 | アニメ |

2. Motivation

  1. TTS (Text-To-Speech) model often fails on multi-language sentence, separate sentence based on language will bring better result
  2. Existed NLP toolkit (e.g. SpaCy) is helpful for parsing text in one language, however when it comes to multi-language texts like below are hard to deal with:
你喜欢看アニメ吗?
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか?요즘 어떻게 지내요?sky is clear and sunny。

3. Usage

3.1. Installation

You can install the package using pip:

pip install split-lang

3.2. Basic

3.2.1. split_by_lang

Open In Colab

from split_lang import split_by_lang

texts = [
    "你喜欢看アニメ吗?",
]

for text in texts:
    substr = split_by_lang(
        text=text,
        threshold=4.9e-5,
        default_lang="en",
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
    print("----------------------")
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗
3|punctuation:?
from split_lang import split_by_lang

texts = [
    "Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]

for text in texts:
    substr = split_by_lang(
        text=text,
        threshold=4.9e-5,
        default_lang="en",
        merge_across_punctuation=True,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
from split_lang import split_by_lang

texts = [
    "Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]

for text in texts:
    substr = split_by_lang(
        text=text,
        threshold=4.9e-5,
        default_lang="en",
        merge_across_punctuation=False,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
0|en:Please star this project on GitHub
1|punctuation:, 
2|en:Thanks you
3|punctuation:. 
4|en:I love you
5|zh:请加星这个项目
6|punctuation:,
7|zh:谢谢你
8|punctuation:。
9|zh:我爱你
10|ja:この項目をスターしてください
11|punctuation:、
12|ja:ありがとうございます
13|punctuation:!
14|ja:愛してる

3.3. Advanced

3.3.1. TextSplitter and threshold

TextSplitter is a class which implement split() method to split the text after splitting with rule-based logic (Idea-Stage 2).

By default, it using WtP model from wtpsplit. (since WtP is faster and more accurate in SHORT TEXT situation, switch to SaT model for long paragraph).

the threshold is used for WtP and SaT models, default to 1e-4, the smaller the more substring you will get in wtpsplit stage.

[!NOTE] Check GitHub Repo tests/split_acc.py to find best threshold for your use case

3.3.2. usage of lang_map and default_lang (for better result)

[!IMPORTANT] Add lang code for your usecase if other languages are needed

  • default lang_map looks like below
    • if langdetect or fasttext or any other language detector detect the language that is NOT included in lang_map will be set to default_lang
    • if you set default_lang or value of key:value in lang_map to x, this substring will be merged to the near substring
      • zh | x | jp -> zh | jp (x been merged to one side)
      • In example below, zh-tw is set to x because character in zh and jp sometimes been detected as Traditional Chinese
  • default default_lang is x
LANG_MAP = {
    "zh": "zh",
    "yue": "zh",  # 粤语
    "wuu": "zh",  # 吴语
    "zh-cn": "zh",
    "zh-tw": "x",
    "ko": "ko",
    "ja": "ja",
    "de": "de",
    "fr": "fr",
    "en": "en",
}
DEFAULT_LANG = "x"

4. Acknowledgement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

split_lang-1.1.1.tar.gz (17.1 kB view hashes)

Uploaded Source

Built Distribution

split_lang-1.1.1-py3-none-any.whl (18.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page