Skip to main content

中文停用词大全Python接口

Project description

pystopwords

简介

中文停用词大全,支持Python接口, 可选择百度,哈工大,中科院等公开停用词典。

目前只专注于中文,未来考虑加入多语言支持。

安装

pip install pystopwords

使用方法

from pystopwords import stopwords

stopwords函数返回一个停用词set,有两个参数:

  • langs: string,支持的语言,目前仅支持中文(zh)
  • source: string, 停用词来源,目前支持
    • baidu: 百度停用词表
    • hit: 哈工大停用词表
    • ict: 中科院计算所停用词表
    • scu: 四川大学机器智能实验室停用词库
    • cn: 广为流传未知来源的中文停用词表
    • marimo: Marimo multi-lingual stopwords collection 内的中文停用词
    • iso: Stopwords ISO 内的中文停用词
    • all: 上述所有停用词并集

默认参数是stopwords(langs='zh', source='all')

from pystopwords import stopwords
import jieba

# 默认的参数为:
# all_stopwords = stopwords(langs='zh', source='all')
all_stopwords = stopwords()

# 可以选择不同的来源
baidu_stopwords = stopwords(source='baidu')
hit_stopwords = stopwords(source='hit')

word_list = jieba.lcut('我想找一个简单好用的停用词典')
word_list_drop_stopwords = [word for word in word_list if word not in all_stopwords]
print(word_list_drop_stopwords)

# Stdout: ['想', '找', '简单', '好用', '停用', '词典']

来源说明

名称 来源 来源url 个数 备注
ict 中科院计算所 1207 网络上大部分很多链接失效,而且一共1207个,不是网传的1208个
baidu 百度 1429
hit 哈工大 767
scu 四川大学机器智能实验室 976
cn 未知来源 746
marimo koheiw https://github.com/koheiw/marimo 387 原始文件有更细致的分类体系
iso stopwords-iso https://github.com/stopwords-iso/stopwords-iso 794 原始文件支持很多语言

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystopwords-0.0.2.tar.gz (118.5 kB view details)

Uploaded Source

Built Distribution

pystopwords-0.0.2-py3-none-any.whl (37.9 kB view details)

Uploaded Python 3

File details

Details for the file pystopwords-0.0.2.tar.gz.

File metadata

  • Download URL: pystopwords-0.0.2.tar.gz
  • Upload date:
  • Size: 118.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for pystopwords-0.0.2.tar.gz
Algorithm Hash digest
SHA256 61497f4c70a85f35ae4d6d4e46911c0095b984bed566bcc7ae8b2d72f04724c7
MD5 b34ac2b46d3568a81264436f6285127c
BLAKE2b-256 af51dbdfcd158e413dc574749187dd6d8bae9bfc391bf24c48de72a5c14fc653

See more details on using hashes here.

File details

Details for the file pystopwords-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pystopwords-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 37.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for pystopwords-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 454c5f49bb6a5efdb921fa57447f4cfec7e3d7c439fc1e7f0726321c62b9d8d7
MD5 ab55da4947cdbfe72156ed64de396ef6
BLAKE2b-256 324674aa49737e9b0be37141ad377f71f4251b4ba499f2a65ed2ae069f9296e3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page