This is a sentence cutting tool, currently support English & Chinese

These details have not been verified by PyPI

Project description

sentence-spliter

[toc]

简介

sentence-spliter 句子切分工具：将一个长句或者段落，切分为若干短句的 List 。支持自然切分，中间切分等。

目前支持语言：中文，英文，韩语

Architechture

项目结构

.
├── doc								# 补充文档
├── LICENSE							# 许可证
├── MANIFEST.in						# 用于setup时包含其他文件
├── pyproject.toml					# 用于构建项目
├── README.md
├── requirements.txt
├── sentence_spliter
│   ├── architect					# 存放切句的基本单元
│   ├── cutter4grammar				# 语法纠错定制的切句
│   ├── en_cutter					# 英文切句
│   ├── test						# 单元测试
│   ├── utility						# 其他工具函数
│   └── zh_cutter					# 中文切句
└── setup.py						# setup.py

更详细的目录结构见链接

Setup

git 安装

git clone git@git.duowan.com:ai/nlp/sentence-spliter.git
pip install -U pip
pip install -r requirements.txt

PYPI 安装

pip install sentence_spliter

API

请求示例

curl --location --request POST 'https://rosetta-nlp-api.duowan.com/api/v1/sentence-spliter/en-sentence-spliter' \
--header 'Content-Type: application/json' \
--data-raw '{
  "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] ,
  "options": {
      "max_len": 30,
      "min_len": 6
  }
} '

Request

{
  "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] ,
  "options": {
      "max_len": 30,
      "min_len": 6
  }
}

Response

{
    "code": 0,
    "data": {
        "paragraphs": [
            "A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."
        ],
        "sub_sentences": [
            [
                [
                    "A long time ago..... there is a mountain, and there is a temple in the mountain!!!"
                ],
                [
                    " And here is an old monk in the temple!?...."
                ]
            ]
        ],
        "version": "1.0.0"
    },
    "message": "success"
}

响应参数说明

字段名	类型	说明
paragraphs	String	需要切分的段落列表
sub_sentences	String	切分完成的子句

接口相关更多内容见接口文档

特别注意：version字段改动涉及广东部门是否需要重跑流水线 链接

状态机

Data

需要用到的主要辅助数据为以下两个：

白名单表： /white_list.txt
权重表：/weights_list.txt

Format

白名单表格式：

Dr.
U!S!A!
No.
abbr.
Brig.
Ltd.
b.
N.
hr.

每行一个字符串，算法扫描到白名单中被记录字符串中的结束符号将会不计为一种象征结束的标志。

权重表:

and 10
or 10
but 10
even 10
however 10
whenever 10
whatever 10
although 10
thought 10

每行为：word+weight的格式，表示各个有转折、承接上下文等作用含义的词在需要句内切割时的权重大小。

介绍

以下句子作为样本：

sentence = 'I like chicken. I like chicken.'

Sequence

Sequence模块首先将需要切割的句子转换为某种特殊的序列格式。

graph LR
A[I like chicken.] -->B[I]
subgraph sequence
    B -->C[<space>]
    C -->D[like]
    D --> E[<space>]
    E --> F[chicken]
    F --> G[.]
end

sequence将直接进入状态机

Condition and Operation

Condition模块表示执行某个动作之前的某个条件或者判断，若满足该条件则执行，否则执行不满足该条件的动作。

Operation模块表示某个动作或者称为操作

    graph LR
A{Condition} -->|True| B[Operation1]
A -->|False| C[Operation2]

Condition&Operation模块

由一系列上图Condition&Operation组成的模块

表示一连串的判断、动作序列组合叠加

进而

    graph LR
A{Condition1} -->|True| B[Condition&Operation1]
A -->|False| C[Condition&Operation2]
B -->D[Condition&Operation3]
C -->E[Condition&Operation4]

Logic

上述Condition&Operation模块形成了整个Logic

所有的Condition&Operation模块进一步叠加得到整个大的逻辑图

运行

导入相关包

from sentence_spliter.en_cutter.en_sequence import Sequence 				# 导入英文切句框架内的sequence类
from sentence_spliter.en_cutter.logic import SimpleLogic, LongShortLogic	# 导入英文切句框架内的logic类

加载句子为sequence类

sentence = 'I like chicken. I like chicken.'								# 例句
seq = Sequence(sentence)                                                    # 转化为sequence
simple_logic = SimpleLogic()												# 自然切句逻辑
long_logic = LongShortLogic(max_len=max_len, min_len=min_len)				# 切割长短句

执行切句

simple_result = simple_logic.run(seq, debug=True)
long_results = long_logic.run(seq, debug=True)

打包上传

打开setup.py，修改相应的配置（version等）

from setuptools import setup, find_packages

setup(
    name="sentence-spliter",
    version="X.X.X",
    author="<your name>",
    author_email="<your email>",
	...
)

./bin/package.sh

键入账号和密码

Enter your username: <your username>
Enter your password: <your password>

等待上传即可

详细教程可见链接

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.1.8

May 11, 2022

2.1.7

Mar 28, 2022

2.1.6

Feb 15, 2022

2.1.5

Feb 15, 2022

2.1.4

Feb 11, 2022

2.1.3

Feb 10, 2022

2.1.2

Feb 8, 2022

2.1.1

Feb 8, 2022

2.1.0

Feb 8, 2022

2.0.6

Jan 24, 2022

2.0.5

Jan 24, 2022

2.0.4

Jan 24, 2022

2.0.3

Jan 24, 2022

2.0.2

Jan 24, 2022

2.0.1

Jan 24, 2022

2.0.0

Jan 24, 2022

1.2.4

Jun 10, 2021

1.2.3

May 26, 2021

1.2.2

May 26, 2021

1.2.1

May 26, 2021

1.2.0

May 25, 2021

1.1.19

Apr 30, 2021

1.1.18

Apr 29, 2021

1.1.17

Apr 26, 2021

1.1.16

Apr 26, 2021

1.1.15

Apr 16, 2021

1.1.14

Apr 16, 2021

1.1.13

Apr 16, 2021

1.1.12

Apr 16, 2021

1.1.11

Apr 12, 2021

1.1.10

Apr 10, 2021

1.1.9

Apr 9, 2021

1.1.8

Apr 9, 2021

1.1.7

Apr 9, 2021

1.1.6

Apr 8, 2021

1.1.5

Apr 8, 2021

1.1.4

Apr 6, 2021

1.1.3

Mar 19, 2021

1.1.2

Mar 1, 2021

1.1.1

Mar 1, 2021

1.1.0

Feb 7, 2021

1.0.13

Oct 28, 2020

1.0.12

Oct 28, 2020

1.0.11

Oct 27, 2020

1.0.10

Oct 27, 2020

1.0.9

Oct 27, 2020

1.0.8

Oct 27, 2020

1.0.7

Oct 27, 2020

1.0.6

Oct 27, 2020

1.0.5

Oct 27, 2020

1.0.4

Aug 21, 2020

1.0.3

Aug 19, 2020

1.0.2

Aug 19, 2020

1.0.1

Aug 19, 2020

1.0.0

Aug 18, 2020

0.1.12

Jul 21, 2020

0.1.11

Jul 20, 2020

0.1.10

Jul 17, 2020

0.1.9

Jul 15, 2020

0.1.8

Jul 11, 2020

0.1.7

Jul 11, 2020

0.1.5

Jul 10, 2020

0.1.4

Jun 18, 2020

0.1.3

Jun 17, 2020

0.1.2

Jun 17, 2020

0.1.1

Jun 12, 2020

0.1.0

Jun 12, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence-spliter-2.1.8.tar.gz (29.6 kB view details)

Uploaded May 11, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sentence_spliter-2.1.8-py3-none-any.whl (40.0 kB view details)

Uploaded May 11, 2022 Python 3

File details

Details for the file sentence-spliter-2.1.8.tar.gz.

File metadata

Download URL: sentence-spliter-2.1.8.tar.gz
Upload date: May 11, 2022
Size: 29.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.7.0

File hashes

Hashes for sentence-spliter-2.1.8.tar.gz
Algorithm	Hash digest
SHA256	`c9a41b9b3fd79a822eba662b96fbf1fdfee5d63ff600aceaf6d5536f4cea388e`
MD5	`f8878a7230ba3551ca89b460cd346e62`
BLAKE2b-256	`b81aad39b7a6ca2588352824775570348db8a0db8791ecdef69e6dc9c1c98762`

See more details on using hashes here.

File details

Details for the file sentence_spliter-2.1.8-py3-none-any.whl.

File metadata

Download URL: sentence_spliter-2.1.8-py3-none-any.whl
Upload date: May 11, 2022
Size: 40.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.7.0

File hashes

Hashes for sentence_spliter-2.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24262976e7a772caa5881ebe4a469a50b566bff3e2e4144a95ec2cc87951b6c0`
MD5	`686e04749238edbe64772520c2ce0c1d`
BLAKE2b-256	`0c898547773fd13f5ba8d4c919cee342910a591ff1a9136fea16d84e66278f1b`

See more details on using hashes here.

sentence-spliter 2.1.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

sentence-spliter

简介

Architechture

Setup

API

请求示例

响应参数说明

状态机

Data

Format

介绍

Sequence

Condition and Operation

Condition&Operation模块

Logic

运行

打包上传

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes