中文停用词大全Python接口
Project description
pystopwords
简介
中文停用词大全,支持Python接口, 可选择百度,哈工大,中科院等公开停用词典。
目前只专注于中文,未来考虑加入多语言支持。
安装
pip install pystopwords
使用方法
from pystopwords import stopwords
stopwords函数返回一个停用词set,有两个参数:
- langs: string,支持的语言,目前仅支持中文(zh)
- source: string, 停用词来源,目前支持
- baidu: 百度停用词表
- hit: 哈工大停用词表
- ict: 中科院计算所停用词表
- scu: 四川大学机器智能实验室停用词库
- cn: 广为流传未知来源的中文停用词表
- marimo: Marimo multi-lingual stopwords collection 内的中文停用词
- iso: Stopwords ISO 内的中文停用词
- all: 上述所有停用词并集
默认参数是stopwords(langs='zh', source='all')
from pystopwords import stopwords
import jieba
# 默认的参数为:
# all_stopwords = stopwords(langs='zh', source='all')
all_stopwords = stopwords()
# 可以选择不同的来源
baidu_stopwords = stopwords(source='baidu')
hit_stopwords = stopwords(source='hit')
word_list = jieba.lcut('我想找一个简单好用的停用词典')
word_list_drop_stopwords = [word for word in word_list if word not in all_stopwords]
print(word_list_drop_stopwords)
# Stdout: ['想', '找', '简单', '好用', '停用', '词典']
来源说明
名称 | 来源 | 来源url | 个数 | 备注 |
---|---|---|---|---|
ict | 中科院计算所 | 1207 | 网络上大部分很多链接失效,而且一共1207个,不是网传的1208个 | |
baidu | 百度 | 1429 | ||
hit | 哈工大 | 767 | ||
scu | 四川大学机器智能实验室 | 976 | ||
cn | 未知来源 | 746 | ||
marimo | koheiw | https://github.com/koheiw/marimo | 387 | 原始文件有更细致的分类体系 |
iso | stopwords-iso | https://github.com/stopwords-iso/stopwords-iso | 794 | 原始文件支持很多语言 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pystopwords-0.0.2.tar.gz
(118.5 kB
view details)
Built Distribution
File details
Details for the file pystopwords-0.0.2.tar.gz
.
File metadata
- Download URL: pystopwords-0.0.2.tar.gz
- Upload date:
- Size: 118.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61497f4c70a85f35ae4d6d4e46911c0095b984bed566bcc7ae8b2d72f04724c7 |
|
MD5 | b34ac2b46d3568a81264436f6285127c |
|
BLAKE2b-256 | af51dbdfcd158e413dc574749187dd6d8bae9bfc391bf24c48de72a5c14fc653 |
File details
Details for the file pystopwords-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: pystopwords-0.0.2-py3-none-any.whl
- Upload date:
- Size: 37.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 454c5f49bb6a5efdb921fa57447f4cfec7e3d7c439fc1e7f0726321c62b9d8d7 |
|
MD5 | ab55da4947cdbfe72156ed64de396ef6 |
|
BLAKE2b-256 | 324674aa49737e9b0be37141ad377f71f4251b4ba499f2a65ed2ae069f9296e3 |