A package for splitting sentences by language (concatenating over-split substrings based on their language)

These details have not been verified by PyPI

Project links

Homepage

Project description

split-lang

Splitting sentences by languages through concatenating over split substrings based on their language powered by

splitting: budoux and rule-base splitting
language detection: fast-langdetect and lingua-py

GitHub Repo stars

1. Idea

Stage 1: rule-based split using punctuation

hello, how are you -> hello | , | how are you

Stage 2: then, over-split text to substrings by budoux, (space) and regex

你喜欢看アニメ吗 -> 你 | 喜欢 | 看 | アニメ | 吗
昨天見た映画はとても感動的でした -> 昨天 | 見た | 映画 | はとても | 感動的 | でした
我朋友是日本人彼はとても優しいです -> 我 | 朋友 | 是 | 日本人 | 彼は | とても | 優しいです
how are you -> how | are | you

Stage 3: concatenate substrings based on their languages using fast-langdetect and langdetect

你 | 喜欢 | 看 | アニメ | 吗 -> 你喜欢看 | アニメ | 吗
昨天 | 見た | 映画 | はとても | 感動的 | でした -> 昨天 | 見た映画はとても感動的でした
我 | 朋友 | 是 | 日本人 | 彼は | とても | 優しいです -> 我朋友是日本人 | 彼はとても優しいです
how | are | you -> how are you

2. Motivation

TTS (Text-To-Speech) model often fails on multi-language sentence, separate sentence based on language will bring better result
Existed NLP toolkit (e.g. SpaCy) is helpful for parsing text in one language, however when it comes to multi-language texts like below are hard to deal with:

你喜欢看アニメ吗？
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか？요즘 어떻게 지내요？sky is clear and sunny。

1. Idea
2. Motivation
3. Usage
4. Acknowledgement

3. Usage

3.1. Installation

You can install the package using pip:

pip install split-lang

3.2. Basic

3.2.1. `split_by_lang`

from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "你喜欢看アニメ吗"

substr = lang_splitter.split_by_lang(
    text=text,
)
for index, item in enumerate(substr):
    print(f"{index}|{item.lang}:{item.text}")

0|zh:你喜欢看
1|ja:アニメ
2|zh:吗

from split_lang import LangSplitter
lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
    "你喜欢看アニメ吗？我也喜欢看",
    "Please star this project on GitHub, Thanks you. I love you请加星这个项目，谢谢你。我爱你この項目をスターしてください、ありがとうございます！愛してる",
]
time1 = time.time()
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
    print("----------------------")
time2 = time.time()
print(time2 - time1)

0|zh:你喜欢看
1|ja:アニメ
2|zh:吗？我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目，谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます！愛してる
----------------------
0.007998466491699219

3.2.2. `merge_across_digit`

lang_splitter.merge_across_digit = False
texts = [
    "衬衫的价格是9.15便士",
]
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")

0|zh:衬衫的价格是
1|digit:9.15
2|zh:便士

3.3. Advanced

3.3.1. usage of `lang_map` and `default_lang` (for better result)

[!IMPORTANT] Add lang code for your usecase if other languages are needed

default lang_map looks like below
- if langua-py or fasttext or any other language detector detect the language that is NOT included in lang_map will be set to default_lang
- if you set default_lang or value of key:value in lang_map to x, this substring will be merged to the near substring
  - zh | x | jp -> zh | jp (x been merged to one side)
  - In example below, zh-tw is set to x because character in zh and jp sometimes been detected as Traditional Chinese
default default_lang is x

DEFAULT_LANG_MAP = {
    "zh": "zh",
    "yue": "zh",  # 粤语
    "wuu": "zh",  # 吴语
    "zh-cn": "zh",
    "zh-tw": "x",
    "ko": "ko",
    "ja": "ja",
    "de": "de",
    "fr": "fr",
    "en": "en",
    "hr": "en",
}
DEFAULT_LANG = "x"

4. Acknowledgement

Inspired by LlmKira/fast-langdetect
Text segmentation depends on google/budoux
Language detection depends on zafercavdar/fasttext-langdetect and lingua-py

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.5

Oct 31, 2024

2.0.4

Oct 30, 2024

2.0.3

Oct 18, 2024

2.0.2

Oct 17, 2024

2.0.1

Oct 17, 2024

2.0.0

Oct 17, 2024

1.4.1

Oct 16, 2024

1.4.0

Oct 16, 2024

1.3.9

Oct 3, 2024

1.3.8

Jul 13, 2024

1.3.7

Jul 13, 2024

1.3.6

Jul 13, 2024

1.3.5

Jul 8, 2024

1.3.4

Jul 7, 2024

1.3.3

Jul 7, 2024

1.3.2

Jul 7, 2024

1.3.1

Jul 6, 2024

This version

1.3.0

Jul 6, 2024

1.2.0

Jul 3, 2024

1.1.1

Jul 3, 2024

1.0.1

Jul 1, 2024

1.0.0

Jun 30, 2024

0.4.4

Jun 30, 2024

0.4.3

Jun 30, 2024

0.4.2

Jun 30, 2024

0.4.1

Jun 30, 2024

0.4.0

Jun 30, 2024

0.3.1

Jun 29, 2024

0.3.0

Jun 29, 2024

0.2.0

Jun 28, 2024

0.1.0

Jun 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

split_lang-1.3.0.tar.gz (16.5 kB view hashes)

Uploaded Jul 6, 2024 Source

Built Distribution

split_lang-1.3.0-py3-none-any.whl (17.0 kB view hashes)

Uploaded Jul 6, 2024 Python 3

Hashes for split_lang-1.3.0.tar.gz

Hashes for split_lang-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`38ca1ac6254f67853552ed427a1540424097c4067b73d5a995b66415654768c8`
MD5	`8dc8d535760a9bc99d04cdbb7ed829dd`
BLAKE2b-256	`18cea4196645d354fa1f8acf8038baf7b33af986cadbe9804ea474270792a17f`

Hashes for split_lang-1.3.0-py3-none-any.whl

Hashes for split_lang-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`887412e1c1c4a468fa976929ea64d60aa5ace1320d210a4309e6855e1ced4dc9`
MD5	`f1823118209a048ba992a592ba58e537`
BLAKE2b-256	`511f1da44c35741fb19c4cf4d57019fb4afb543cfd48bc2b33c6078d537ab9ec`

split-lang 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

split-lang

1. Idea

2. Motivation

3. Usage

3.1. Installation

3.2. Basic

3.2.1. `split_by_lang`

3.2.2. `merge_across_digit`

3.3. Advanced

3.3.1. usage of `lang_map` and `default_lang` (for better result)

4. Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

split-lang 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

split-lang

1. Idea

2. Motivation

3. Usage

3.1. Installation

3.2. Basic

3.2.1. split_by_lang

3.2.2. merge_across_digit

3.3. Advanced

3.3.1. usage of lang_map and default_lang (for better result)

4. Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

3.2.1. `split_by_lang`

3.2.2. `merge_across_digit`

3.3.1. usage of `lang_map` and `default_lang` (for better result)