nlplab dictionary, stopwords module
Project description
自己蒐集的training data、字典和stopwords並且包成package,讓大家不用重複造輪子。
Usage
安裝:pip install NCHU_nlptoolkit
- 濾掉stopwords, remove stopwords 並且斷詞 p.s. rm stop words時就會跟著載入實驗室字典了
from NCHU_nlptoolkit.cut import *
# minword 是最小詞的字數(斷詞最少幾個字)
# default
cut_sentence(input string, flag=False, minword=1)
# return segmentation with part of speech.
cut_sentence(input string, flag=True, minword=1)
- 載入法律辭典
from NCHU_nlptoolkit.cut import * load_law_dict() - demo:
-
zh:
>>> doc = '首先,對區塊鏈需要的第一個理解是,它是一種「將資料寫錄的技術」。' >>> cut_sentence(doc, flag=True) [('區塊鏈', 'n'), ('需要', 'n'), ('第一個', 'm'), ('理解', 'n'), ('一種', 'm'), ('資料', 'n'), ('寫錄', 'v'), ('技術', 'n')] -
en:
>>> doc = 'The City of New York, often called New York City (NYC) or simply New York, is the most populous city in the United States.' >>> list(cut_sentence_en(doc)) ['City', 'New York', 'called', 'New York City', 'NYC', 'simply', 'New York', 'populous', 'city', 'United States'] >>> list(cut_sentence_en(doc, flag=True)) >>> [('City', 'NNP'), ('New York', 'NNP/NNP'), ('called', 'VBN'), ('New York City', 'NNP/NNP/NNP'), ('NYC', 'NN'), ('simply', 'RB'), ('New York', 'NNP/NNP'), ('populous', 'JJ'), ('city', 'NN'), ('United States', 'NNP/NNS')]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
NCHU_nlptoolkit-2.0.5.tar.gz
(12.9 MB
view details)
File details
Details for the file NCHU_nlptoolkit-2.0.5.tar.gz.
File metadata
- Download URL: NCHU_nlptoolkit-2.0.5.tar.gz
- Upload date:
- Size: 12.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86afedaacca1d798fc30a8aea34ee2994f9f61aeb007abe32142f000915e43bc
|
|
| MD5 |
e728ec2652dd9c9707675b54ef74f373
|
|
| BLAKE2b-256 |
ef516465d9bfe7dd1ec49b44e709209135832485acedbe2a58e54b33099ad679
|