This is a sentence cutting tool, currently support English & Chinese

These details have not been verified by PyPI

Project description

sentence-spliter

[toc]

1. 介绍

sentence-spliter 句子切分工具，将一个长的句子，切分为短句的 List 。支持自然切分，最长句切分，最短句合并。

目前支持语言：中文，英文

2. 项目结构

├── README.md
├── automata             	# 存放切句的基本单元
│   ├── abc.py            	# 定义状态机与图的基本单元
│   ├── condition.py		 	# 切句条件
│   ├── operation.py     	    # 切句操作
│   ├── sequence.py			# 封装状态传递的数据
│   ├── state_machine.py      # 状态机
│   ├──  self.symbols.py  # 保存标点符号
│   ├── weights_list.py   # 英文短句切割 权重名单 (优先截断并列句，重句)
│   └── white_list.py     # 英文短句白名单 e.g. Mr. U.S. Ph.D
└── sentence_spliter
    ├── logic_graph.py     # 逻辑图(中文)
    ├── logic_graph_en.py  # 逻辑图(英文)
    └── spliter.py		 # 主要函数，调用切句

3. Setup

PYPI 安装

pip install sentence_spliter

4. Demo

4.1. Simple Demo

from sentence_spliter import spliter
# 中文切句调用 cut_to_sentences
paragraph = "在很久很久以前......。。... 有座山，山里有座庙啊!!!!!!!庙里竟然有个老和尚！？。。。。"
result =  spliter.cut_to_sentences(paragraph)
print(result)

# outputs
['在很久很久以前......。。...', ' 有座山，山里有座庙啊!!!!!!!', '庙里竟然有个老和尚！？。。。。']

切句支持以整片文章为输入。

如果输入太长，终端会自动显示百分比process cutting 87.5%

from sentence_spliter import spliter
paragraph = "A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."
result = spliter.cut_to_sentences_en(paragraph)

# outputs 
['A long time ago..... there is a mountain, and there is a temple in the mountain!!!', ' And here is an old monk in the temple!?....']

4.2 中文切句逻辑

中文切句工具默认有三种逻辑

4.2.1. 简单切句

from sentence_spliter.logic_graph import simple_cuter
from sentence_spliter.automata.state_machine import StateMachine
from sentence_spliter.automata.sequence import StrSequence

# -- 初始化 状态机器 -- #
cuter = StateMachine(simple_cuter())

# -- 处理句子 -- #
paragraph = "在很久很久以前1.2.3......。。... 有座山。山里有座庙啊!!!!!!!庙里竟然有个老和尚！？。。。。www.baidu.com"
sequence = cuter.run(StrSequence(paragraph))
out = sequence.sentence_list()

# -- 展示结果 -- #
print(out)

# outputs
['在很久很久以前1.2.3......。。...', ' 有座山。', '山里有座庙啊!!!!!!!', '庙里竟然有个老和尚！？。。。。', 'www.baidu.com']

基本逻辑：

遇到断句标点（。？！）准备切句：

如果短句标点出现在引号 、括号、书名号不执行切句

其他情况下执行切句

4.2.2. 长短处理切句

from sentence_spliter.logic_graph import long_short_cuter
from sentence_spliter.automata.state_machine import StateMachine
from sentence_spliter.automata.sequence import StrSequence

# -- 初始化 状态机器 -- #
cuter = StateMachine(long_short_cuter(hard_max = 128, max_len= 128, min_len = 15))

# -- 处理句子 -- #
paragraph = "在很久很久以前1.2.3......。。... 有座山。山里有座庙啊!!!!!!!庙里竟然有个老和尚！？。。。。www.baidu.com"
sequence = cuter.run(StrSequence(paragraph))
out = sequence.sentence_list()

# -- 展示结果 -- #
print(out)
['在很久很久以前1.2.3......。。...', ' 有座山。山里有座庙啊!!!!!!!', '庙里竟然有个老和尚！？。。。。', 'www.baidu.com']

from sentence_spliter.logic_graph import long_short_cuter
from sentence_spliter.automata.state_machine import StateMachine
from sentence_spliter.automata.sequence import StrSequence

# -- 初始化 状态机器 -- #
cuter = StateMachine(long_short_cuter(hard_max = 2, max_len= 2, min_len = 1))

# -- 处理句子 -- #
paragraph = "在很久很久以前1.2.3......。。... 有座山。山里有座庙啊!!!!!!!庙里竟然有个老和尚！？。。。。www.baidu.com"
sequence = cuter.run(StrSequence(paragraph))
out = sequence.sentence_list()

# -- 展示结果 -- #
print(out)
['在很', '久很', '久以', '前1', '.2', '.3', '..', '..', '..', '。。', '..', '.', ' 有', '座山', '。', '山里', '有座', '庙啊', '!!', '!!', '!!', '!', '庙里', '竟然', '有个', '老和', '尚！', '？。', '。。', '。', 'ww', 'w.', 'ba', 'id', 'u.', 'com']

需要参数：

hard_max （默认300）: 最大可接受长度，要保证所有句子（除了最后一句）<= hard_max

max_len （默认128）: 长句判断条件，如果句子大于 max_len则为长句

max_len 可以等于 hard_max

min_len （默认15）: 短句判断条件，如果句子小于min_len则为短句

基本逻辑：

如果是短句：则该句子与右边句子合并

如果是长句：

优先基于断句标点做二次切句

再

from sentence_spliter.logic_graph import long_short_cuter
from sentence_spliter.automata.state_machine import StateMachine
from sentence_spliter.automata.sequence import StrSequence

# -- 初始化 状态机器 -- #
cuter = StateMachine(long_short_cuter(hard_max = 128, max_len= 128, min_len = 15))

# -- 处理句子 -- #
paragraph = "“我和你讨论的不是一个东西，死亡率与死亡比例是不同的”“你知道么？CNN你们总是制造假新闻。。。”"
sequence = cuter.run(StrSequence(paragraph))
out = sequence.sentence_list()

# -- 展示结果 -- #
print(out)
['“我和你讨论的不是一个东西，死亡率与死亡比例是不同的”', '“你知道么？CNN你们总是制造假新闻。。。”']

# -- 处理句子 -- #
paragraph = "“我和你讨论的不是一个东西，死亡率与死亡比例是不同的”      “你知道么？CNN你们总是制造假新闻。。。”"
sequence = cuter.run(StrSequence(paragraph))
out = sequence.sentence_list()

# -- 展示结果 -- #
print(out)
['“我和你讨论的不是一个东西，死亡率与死亡比例是不同的”', '“你知道么？CNN你们总是制造假新闻。。。”']

# -- 处理句子 -- #
paragraph = "张晓风笑着说道，“我们这些年可比过去强多了！“过去吃不起饭，穿不暖衣服。 现在呢？要啥有啥！"
sequence = cuter.run(StrSequence(paragraph))
out = sequence.sentence_list()

# -- 展示结果 -- #
print(out)
['张晓风笑着说道，“我们这些年可比过去强多了！', '“过去吃不起饭，穿不暖衣服。 现在呢？要啥有啥！']

基本逻辑：

除了支持长短句以外，增加双引号处理特殊规则

右边引号后面跟有左引号，切句

中文中只有做引号，从第二个做引号开始切句

4.3 英文切句逻辑

英文切句目前只有一种逻辑

4.3.2 长短句处理

短句展示 demo

from sentence_spliter.logic_graph_en import long_cuter_en 
from sentence_spliter.automata.state_machine import StateMachine 
from sentence_spliter.automata.sequence import EnSequence  #调取英文 Sequence

# -- 初始化 状态机 -- #
__long_machine_en = StateMachine(long_cuter_en(min_len=5)) # 令句子长度不能小于10个单词

# --- 处理句子 --- #
paragraph = "A long time ago. there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."
m_input = EnSequence(paragraph)
__long_machine_en.run(m_input)

# -- output -- #
# ['A long time ago. there is a mountain, and there is a temple in the mountain!!!', ' And here is an old monk in the temple!?....']
# 可以看到 long time ago. 并没有被切开

# -- 改变句子最小长度 -- #
__long_machine_en = StateMachine(long_cuter_en(min_len=4)) # 令句子长度不能小于4个单词
paragraph = "A long time ago. there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."
m_input = EnSequence(paragraph)
__long_machine_en.run(m_input)

# -- output -- #
# ['A long time ago.', ' there is a mountain, and there is a temple in the mountain!!!', ' And here is an old monk in the temple!?....']
# 可以看到 long time ago. 被切开

长句demo

from sentence_spliter.logic_graph_en import long_cuter_en 
from sentence_spliter.automata.state_machine import StateMachine 
from sentence_spliter.automata.sequence import EnSequence  #调取英文 Sequence

# -- 初始化 状态机 -- #
__long_machine_en = StateMachine(long_cuter_en(max_len=12, min_len=3)) # 令句子长度不能小于3个单词，并且不能大于12个单词

# -- 处理句子 -- #
paragraph = "A long time ago. there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."
m_input = EnSequence(paragraph)
__long_machine_en.run(m_input)

# -- output -- #
# ['A long time ago.', ' there is a mountain, and there is a temple in the mountain!!!', ' And here is an old monk in the temple!?....']
# 可以看到中间句子并咩有被切开


# -- 改变句子最大长度 -- #
__long_machine_en = StateMachine(long_cuter_en(max_len=11, min_len=3)) # 令句子长度不能小于3个单词，并且不能大于12个单词
paragraph = "A long time ago. there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."
m_input = EnSequence(paragraph)
__long_machine_en.run(m_input)

# -- output -- #
#['A long time ago.', ' There is a mountain, ', 'and there is a temple in the mountain!!! And here is an old monk in the temple!?....']
# 可以看到中间句子被切成了两个

长句切除优先级切除对话中的句子>切除并列句子>切除从句如果以上都不满足，不切句

from sentence_spliter.logic_graph_en import long_cuter_en 
from sentence_spliter.automata.state_machine import StateMachine 
from sentence_spliter.automata.sequence import EnSequence  #调取英文 Sequence

sentence = "\"What would a stranger do here, Mrs. Price?\" he inquired angrily, remembering, with a pang, that certain new, unaccountable, engrossing emotions had quite banished Fiddy from his thoughts and notice, when he might have detected the signs of approaching illness, met them and vanquished them before their climax."
__long_machine_en = StateMachine(long_cuter_en(max_len=9, min_len=3))
m_input = EnSequence(sentence)
__long_machine_en.run(m_input)
for v in m_input.sentence_list():
    print(v)

# 输出
'''
"What would a stranger do here, Mrs. Price?" 
he inquired angrily, remembering, with a pang, 
that certain new, unaccountable, engrossing emotions had quite banished Fiddy from his thoughts and notice, 
when he might have detected the signs of approaching illness, 
met them and vanquished them before their climax.
'''
# 可以看到 优先切割 对话，其次从句， 最后是'，'

4.3.3 英文白名单

切句会照护一些特殊缩写，e.g. U.S.A 这类词是不会被切句。我们的白名单全部来自于《牛津词典第八版本》用户也可以通过修改white_list.py增加白名单内容

from sentence_spliter.spliter import cut_to_sentences_en
sentence = "Notice that U.S.A. can also be written USA, but U.S. is better with the periods. Also, we can use U.S. as a modifier (the U.S. policy on immigration) but not as a noun (He left the U.S. U.S.A.)."
out = cut_to_sentences_en(sentence)
for v in out:
    print(v)

'''
Notice that U.S.A. can also be written USA, but U.S. is better with the periods.
Also, we can use U.S. as a modifier (the U.S. policy on immigration) but not as a noun (He left the U.S. U.S.A.).
'''

5. 自定义规则

切句规则都是基于图来实现的，其中

automata/condition.py 存放图的边
automata/operation.py 存放节点

5.1. 定义一个简单的切句规则

规则如下：

如果遇到断句标点执行切句

代码：

from automata import condition, operation
from automata.state_machine import StateMachine
from automata.sequence import StrSequence

# step 1: 初始所以需要的边和节点

edges = {
    'is_end_symbol': condition.IsEndSymbolCN(),
    'is_end_state': condition.IsEndState(),
}

nodes = {
    'init': operation.Indolent(),
    'do_cut': operation.Normal(),
    'end': operation.EndState()
}

# step 2: 构建图

simple_logic = {
    nodes['init']: [
        {'edge': edges['is_end_state'],
         'node': nodes['end']},

        {'edge': edges['is_end_symbol'],
         'node': nodes['do_cut']}
    ],
    nodes['do_cut']: [
        {'edge': edges['is_end_state'],
         'node': nodes['end']},

        {'edge': None,
         'node': nodes['init']}  # else 状态

    ],
}

# step 3: 初始化状态机
machine = StateMachine(simple_logic)

# step 4: 切句测试
sentences = '万壑树参天，千山响杜鹃。山中一夜雨，树杪百重泉。汉女输橦布，巴人讼芋田。文翁翻教授，不敢倚先贤。'

out = machine.run(StrSequence(sentences))

print('\n'.join(out.sentence_list()))

输出：

万壑树参天，千山响杜鹃。
山中一夜雨，树杪百重泉。
汉女输橦布，巴人讼芋田。
文翁翻教授，不敢倚先贤。

注意：

machine 是从 Indolent状态开始便利，遍历顺序从上倒下

每个状态必须先有end_state条件否则报错

5.2. 自定义节点和边

节点要保证输入和输出都是 sequece.StrSequence就可以放在状态机中运行
边要保证输入是 sequece.StrSequence输出是bool就可以放在状态机中运行

这里不再做演示

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.1.8

May 11, 2022

2.1.7

Mar 28, 2022

2.1.6

Feb 15, 2022

2.1.5

Feb 15, 2022

2.1.4

Feb 11, 2022

2.1.3

Feb 10, 2022

2.1.2

Feb 8, 2022

2.1.1

Feb 8, 2022

2.1.0

Feb 8, 2022

2.0.6

Jan 24, 2022

2.0.5

Jan 24, 2022

2.0.4

Jan 24, 2022

2.0.3

Jan 24, 2022

2.0.2

Jan 24, 2022

2.0.1

Jan 24, 2022

2.0.0

Jan 24, 2022

1.2.4

Jun 10, 2021

1.2.3

May 26, 2021

1.2.2

May 26, 2021

1.2.1

May 26, 2021

1.2.0

May 25, 2021

1.1.19

Apr 30, 2021

1.1.18

Apr 29, 2021

This version

1.1.17

Apr 26, 2021

1.1.16

Apr 26, 2021

1.1.15

Apr 16, 2021

1.1.14

Apr 16, 2021

1.1.13

Apr 16, 2021

1.1.12

Apr 16, 2021

1.1.11

Apr 12, 2021

1.1.10

Apr 10, 2021

1.1.9

Apr 9, 2021

1.1.8

Apr 9, 2021

1.1.7

Apr 9, 2021

1.1.6

Apr 8, 2021

1.1.5

Apr 8, 2021

1.1.4

Apr 6, 2021

1.1.3

Mar 19, 2021

1.1.2

Mar 1, 2021

1.1.1

Mar 1, 2021

1.1.0

Feb 7, 2021

1.0.13

Oct 28, 2020

1.0.12

Oct 28, 2020

1.0.11

Oct 27, 2020

1.0.10

Oct 27, 2020

1.0.9

Oct 27, 2020

1.0.8

Oct 27, 2020

1.0.7

Oct 27, 2020

1.0.6

Oct 27, 2020

1.0.5

Oct 27, 2020

1.0.4

Aug 21, 2020

1.0.3

Aug 19, 2020

1.0.2

Aug 19, 2020

1.0.1

Aug 19, 2020

1.0.0

Aug 18, 2020

0.1.12

Jul 21, 2020

0.1.11

Jul 20, 2020

0.1.10

Jul 17, 2020

0.1.9

Jul 15, 2020

0.1.8

Jul 11, 2020

0.1.7

Jul 11, 2020

0.1.5

Jul 10, 2020

0.1.4

Jun 18, 2020

0.1.3

Jun 17, 2020

0.1.2

Jun 17, 2020

0.1.1

Jun 12, 2020

0.1.0

Jun 12, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence-spliter-1.1.17.tar.gz (34.8 kB view details)

Uploaded Apr 26, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sentence_spliter-1.1.17-py3-none-any.whl (39.4 kB view details)

Uploaded Apr 26, 2021 Python 3

File details

Details for the file sentence-spliter-1.1.17.tar.gz.

File metadata

Download URL: sentence-spliter-1.1.17.tar.gz
Upload date: Apr 26, 2021
Size: 34.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.0

File hashes

Hashes for sentence-spliter-1.1.17.tar.gz
Algorithm	Hash digest
SHA256	`fa4815af2f86a09a55a37ef410f49ab0eb87abb6794f05864fd451cd76863233`
MD5	`07597fe945e9dc41f36ff51263c36b2a`
BLAKE2b-256	`978a094abd71ff19cc4cb7cf0fb18a316cf32ca557a160626fa7d211027b0516`

See more details on using hashes here.

File details

Details for the file sentence_spliter-1.1.17-py3-none-any.whl.

File metadata

Download URL: sentence_spliter-1.1.17-py3-none-any.whl
Upload date: Apr 26, 2021
Size: 39.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.0

File hashes

Hashes for sentence_spliter-1.1.17-py3-none-any.whl
Algorithm	Hash digest
SHA256	`452333ab05bd115d16d9449f542e9dff96562b0f33b8eda704f4b7233401c8c5`
MD5	`d9f57b82f9a70cb7654808c3cd05e4c3`
BLAKE2b-256	`35c53e7d02649523f54f7d0841ff03866aeb06b555aa60c51d9ce74d941a83ec`

See more details on using hashes here.

sentence-spliter 1.1.17

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

sentence-spliter

1. 介绍

2. 项目结构

3. Setup

4. Demo

4.1. Simple Demo

4.2 中文切句逻辑

4.2.1. 简单切句

4.2.2. 长短处理切句

4.3 英文切句逻辑

4.3.2 长短句处理

4.3.3 英文白名单

5. 自定义规则

5.1. 定义一个简单的切句规则

5.2. 自定义节点和边

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes