A package for splitting sentences by language (concatenating over-split substrings based on their language)
Project description
split-lang
Splitting sentences by languages through concatenating over split substrings based on their language powered by
- splitting:
budoux
and rule-base splitting - language detection:
fast-langdetect
andlingua-py
1. Idea
Stage 1: rule-based split using punctuation
hello, how are you
->hello
|,
|how are you
Stage 2: then, over-split text to substrings by budoux
,
(space) and regex
你喜欢看アニメ吗
->你
|喜欢
|看
|アニメ
|吗
昨天見た映画はとても感動的でした
->昨天
|見た
|映画
|はとても
|感動的
|でした
我朋友是日本人彼はとても優しいです
->我
|朋友
|是
|日本人
|彼は
|とても
|優しいです
how are you
->how
|are
|you
Stage 3: concatenate substrings based on their languages using fast-langdetect
and langdetect
你
|喜欢
|看
|アニメ
|吗
->你喜欢看
|アニメ
|吗
昨天
|見た
|映画
|はとても
|感動的
|でした
->昨天
|見た映画はとても感動的でした
我
|朋友
|是
|日本人
|彼は
|とても
|優しいです
->我朋友是日本人
|彼はとても優しいです
how
|are
|you
->how are you
2. Motivation
- TTS (Text-To-Speech) model often fails on multi-language sentence, separate sentence based on language will bring better result
- Existed NLP toolkit (e.g.
SpaCy
) is helpful for parsing text in one language, however when it comes to multi-language texts like below are hard to deal with:
你喜欢看アニメ吗?
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか?요즘 어떻게 지내요?sky is clear and sunny。
3. Usage
3.1. Installation
You can install the package using pip:
pip install split-lang
3.2. Basic
3.2.1. split_by_lang
from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "你喜欢看アニメ吗"
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗
from split_lang import LangSplitter
lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
"你喜欢看アニメ吗?我也喜欢看",
"Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
time1 = time.time()
for text in texts:
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
print("----------------------")
time2 = time.time()
print(time2 - time1)
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗?我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
0.007998466491699219
3.2.2. merge_across_digit
lang_splitter.merge_across_digit = False
texts = [
"衬衫的价格是9.15便士",
]
for text in texts:
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
0|zh:衬衫的价格是
1|digit:9.15
2|zh:便士
3.3. Advanced
3.3.1. usage of lang_map
and default_lang
(for better result)
[!IMPORTANT] Add lang code for your usecase if other languages are needed
- default
lang_map
looks like below- if
langua-py
orfasttext
or any other language detector detect the language that is NOT included inlang_map
will be set todefault_lang
- if you set
default_lang
orvalue
ofkey:value
inlang_map
tox
, this substring will be merged to the near substringzh
|x
|jp
->zh
|jp
(x
been merged to one side)- In example below,
zh-tw
is set tox
because character inzh
andjp
sometimes been detected as Traditional Chinese
- if
- default
default_lang
isx
DEFAULT_LANG_MAP = {
"zh": "zh",
"yue": "zh", # 粤语
"wuu": "zh", # 吴语
"zh-cn": "zh",
"zh-tw": "x",
"ko": "ko",
"ja": "ja",
"de": "de",
"fr": "fr",
"en": "en",
"hr": "en",
}
DEFAULT_LANG = "x"
4. Acknowledgement
- Inspired by LlmKira/fast-langdetect
- Text segmentation depends on google/budoux
- Language detection depends on zafercavdar/fasttext-langdetect and lingua-py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
split_lang-1.3.1.tar.gz
(16.4 kB
view hashes)
Built Distribution
split_lang-1.3.1-py3-none-any.whl
(16.9 kB
view hashes)
Close
Hashes for split_lang-1.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2eeda1fc11f422159b21902737adf6754b981c0ac512ac6fa1b7be90861225c4 |
|
MD5 | 01066be336ababc92dfc52330dba91a2 |
|
BLAKE2b-256 | bb9474ba8ceb36a03e63342dd5219792472eb8a3ec15805a3aa2175fe4dad8ed |