A package for splitting sentences by language (concatenating over-split substrings based on their language)
Project description
split-lang
Splitting sentences by concatenating over-split substrings based on their language
powered by wtpsplit
and language detection (fast-langdetect
and langdetect
)
1.1. Idea
Stage 1: rule-based split using punctuation
hello, how are you
->hello
|,
|how are you
Stage 2: then, over-split text to substrings by wtpsplit
你喜欢看アニメ吗
->你
|喜欢
|看
|アニメ
|吗
Stage 3: concatenate substrings based on their languages using fast-langdetect
and langdetect
你
|喜欢
|看
|アニメ
|吗
->你喜欢看
|アニメ
|吗
2. Motivation
- TTS (Text-To-Speech) model often fails on multi-language sentence, separate sentence based on language will bring better result
- Existed NLP toolkit (e.g.
SpaCy
) is helpful for parsing text in one language, however when it comes to multi-language texts like below are hard to deal with:
你喜欢看アニメ吗?
你最近好吗、最近どうですか?요즘 어떻게 지내요?sky is clear and sunny。
Vielen Dank merci beaucoup for your help.
3. Usage
3.1. Installation
You can install the package using pip:
pip install split-lang
3.2. Basic
3.2.1. split_by_lang
from split_lang import split_by_lang
texts = [
"你喜欢看アニメ吗?",
]
for text in texts:
substr = split_by_lang(
text=text,
threshold=4.9e-5,
default_lang="en",
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
print("----------------------")
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗
3|punctuation:?
from split_lang import split_by_lang
texts = [
"Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
for text in texts:
substr = split_by_lang(
text=text,
threshold=4.9e-5,
default_lang="en",
merge_across_punctuation=True,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
from split_lang import split_by_lang
texts = [
"Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
for text in texts:
substr = split_by_lang(
text=text,
threshold=4.9e-5,
default_lang="en",
merge_across_punctuation=False,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
0|en:Please star this project on GitHub
1|punctuation:,
2|en:Thanks you
3|punctuation:.
4|en:I love you
5|zh:请加星这个项目
6|punctuation:,
7|zh:谢谢你
8|punctuation:。
9|zh:我爱你
10|ja:この項目をスターしてください
11|punctuation:、
12|ja:ありがとうございます
13|punctuation:!
14|ja:愛してる
3.3. Advanced
3.3.1. threshold
the threshold used in wtpsplit
, default to 1e-4, the smaller the more substring you will get in wtpsplit
stage
[!NOTE] Check GitHub Repo
tests/split_acc.py
to find best threshold for your use case
3.3.2. usage of lang_map
(for better result)
[!IMPORTANT] Add lang code for your usecase if other languages are needed
- default
lang_map
looks like below- if
langdetect
orfasttext
or any other language detector detect the language that is NOT included inlang_map
will be set to'x'
- every 'x' would be merge to the near substring
- if
- default
default_lang
is'en'
LANG_MAP = {
"zh": "zh",
"zh-cn": "zh",
"zh-tw": "x",
"ko": "ko",
"ja": "ja",
"de": "de",
"fr": "fr",
"en": "en",
}
DEFAULT_LANG = "en"
Acknowledgement
- Inspired by LlmKira/fast-langdetect
- Text segmentation depends on segment-any-text/wtpsplit
- Language detection depends on zafercavdar/fasttext-langdetect and Mimino666/langdetect (fix miss detecting Chinese as Korean in DoodleBears/langdetect)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for split_lang-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a5e76aed0c686a7510be9dc91ed80c2b412d9002fedb9557e21040f792cf62f |
|
MD5 | a56488932fa0089d0b7bdd6c7720a518 |
|
BLAKE2b-256 | e48162f24be9adc12c55695d28158bdd46aeecff4742de194d92109ceb3fec31 |