NCHU-nlptoolkit

nlplab dictionary, stopwords module

Project description

自己蒐集的training data、字典和stopwords並且包成package，讓大家不用重複造輪子。

Usage

安裝：pip install NCHU_nlptoolkit

濾掉stopwords, remove stopwords 並且斷詞 p.s. rm stop words時就會跟著載入實驗室字典了

from NCHU_nlptoolkit.cut import *

# minword 是最小詞的字數(斷詞最少幾個字)

# default
cut_sentence(input string, flag=False, minword=1)

# return segmentation with part of speech.
cut_sentence(input string, flag=True, minword=1)

載入法律辭典

from NCHU_nlptoolkit.cut import *

load_law_dict()

demo:

zh:

>>> doc = '首先，對區塊鏈需要的第一個理解是，它是一種「將資料寫錄的技術」。'
>>> cut_sentence(doc, flag=True)
[('區塊鏈', 'n'), ('需要', 'n'), ('第一個', 'm'), ('理解', 'n'), ('一種', 'm'), ('資料', 'n'), ('寫錄', 'v'), ('技術', 'n')]

en:

>>> doc = 'The City of New York, often called New York City (NYC) or simply New York, is the most populous city in the United States.'
>>> list(cut_sentence_en(doc))
['City', 'New York', 'called', 'New York City', 'NYC', 'simply', 'New York', 'populous', 'city', 'United States']

>>> list(cut_sentence_en(doc, flag=True))
>>> [('City', 'NNP'), ('New York', 'NNP/NNP'), ('called', 'VBN'), ('New York City', 'NNP/NNP/NNP'), ('NYC', 'NN'), ('simply', 'RB'), ('New York', 'NNP/NNP'), ('populous', 'JJ'), ('city', 'NN'), ('United States', 'NNP/NNS')]

Project details

Release history Release notifications | RSS feed

This version

2.0.5

Sep 19, 2023

2.0.4

Sep 19, 2023

2.0.3

Sep 19, 2023

2.0.2

Mar 24, 2023

2.0.1

Mar 24, 2023

2.0.0

Mar 24, 2023

1.0.5 yanked

Mar 23, 2023

Reason this release was yanked:

improve loading efficiency

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

NCHU_nlptoolkit-2.0.5.tar.gz (12.9 MB view hashes)

Uploaded Sep 19, 2023 Source

Hashes for NCHU_nlptoolkit-2.0.5.tar.gz

Hashes for NCHU_nlptoolkit-2.0.5.tar.gz
Algorithm	Hash digest
SHA256	`86afedaacca1d798fc30a8aea34ee2994f9f61aeb007abe32142f000915e43bc`
MD5	`e728ec2652dd9c9707675b54ef74f373`
BLAKE2b-256	`ef516465d9bfe7dd1ec49b44e709209135832485acedbe2a58e54b33099ad679`