Download and pre-processing data for nlp tasks

These details have not been verified by PyPI

Project links

Homepage

Project description

NLPrep - download and pre-processing data for nlp tasks

Example

nlprep --dataset clner --task tagRow --outdir ./clner_row --util s2t

Installation

Installing via pip

pip install nlprep

Running nlprep

Once you've installed nlprep, you can run with

python -m nlprep.main # local version
or
nlprep # pip installed version

and the following parameter:

$ nlprep
arguments:
  --dataset     which dataset to use    ['clner','udicstm','pttgen','pttposgen','cged','drcdtag']
  --task        type of training task   ['gen', 'classification', 'tagRow', 'tagCol']  
  --outdir      processed result output directory       

optional arguments:
  -h, --help    show this help message and exit
  --util    data preprocessing utility, support multiple utility    ['s2t','t2s','splittrain','splittest','splitvalid','tagsamerate']
  --cachedir   dir for caching raw dataset

Dataset detail

clner

Chinese-Literature-NER-RE-Dataset

A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text

We provide a new Chinese literature dataset for Named Entity Recognition (NER) and Relation Extraction (RE). The dataset is described at https://arxiv.org/pdf/1711.07010.pdf

From: https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset

udicstm

UDIC sentiment analysis Dataset
UDIC從PTT黑特版+好人版等等清理的訓練資料

From: https://github.com/UDICatNCHU/Swinger

pttgen

Gossiping-Chinese-Corpus
PTT 八卦版問答中文語料
蒐集了 PTT 八卦版於 2015 年至 2017 年 6 月的文章，每一行都是一個問答配對

From: https://github.com/zake7749/Gossiping-Chinese-Corpus

pttposgen

Gossiping-Chinese-Positive-Corpus PTT 八卦版-正面-問答中文語料來自 Gossiping-QA-Dataset-2_0.csv 資料集，從其中 774,114 筆問答配對中做情緒分析，抽取所有預測正面情緒的句子(正面機率>50%)，最終整理出 197926 筆資料。

From: https://github.com/voidful/Gossiping-Chinese-Positive-Corpus

drcdtag

Delta Reading Comprehension Dataset 台達閱讀理解資料集資料集從2,108篇維基條目中整理出10,014篇段落，並從段落中標註出30,000多個問題

From: https://github.com/DRCKnowledgeTeam/DRCD

cged

Chinese Grammatical Error Diagnosis
中文語法錯誤診斷
The grammatical errors are broadly categorized into 4 error types: word ordering, redundant, missing, and incorrect selection of linguistic components.

From: http://nlp.ee.ncu.edu.tw/resource/cged.html

Utility detail

s2t

using opencc-python-reimplemented to turn Simplified Chinese to Traditional Chinese

t2s

using opencc-python-reimplemented to turn Traditional Chinese to Simplified Chinese

splittrain

split 80% data as training data

splittest

split 20% data as testing data

splitvalid

split 10% data as validation data

Add a new dataset

create a folder with dataset name as --dataset parameter
eg: /clner
create a blank init.py and dataset.py
add DATASET_FILE_MAP inside dataset.py, value will be dataset url
eg:

DATASET_FILE_MAP = {
    "train": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/train.txt",
    "test": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/test.txt",
    "validation": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/validation.txt",
}

4 create function call toMiddleFormat(path) to turn raw dataset into middleformat middleformat:

{
    "input": [
        example1 input,
        example2 input,
        ...
    ],
    "target": [
        example1 target,
        example2 target,
        ...
    ]
}

Add a new utility

sentence level: add function into utils/sentlevel.py, function name will be --util parameter
paris level - add function into utils/parislevel.py, function name will be --util parameter

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.1

Jul 30, 2021

0.2.0

Mar 30, 2021

0.1.58

Mar 8, 2021

0.1.57

Feb 28, 2021

0.1.56

Oct 22, 2020

0.1.55

Oct 17, 2020

0.1.54

Oct 17, 2020

0.1.53

Sep 7, 2020

0.1.52

Aug 26, 2020

0.1.51

Aug 11, 2020

0.1.50

Aug 8, 2020

0.1.49

Aug 8, 2020

0.1.48

Aug 3, 2020

0.1.47

Aug 1, 2020

0.1.46

Jul 28, 2020

0.1.45

Jul 28, 2020

0.1.42

Jul 27, 2020

0.1.41

Jul 23, 2020

0.1.38

Jul 21, 2020

0.1.37

Jul 20, 2020

0.1.36

Jul 20, 2020

0.1.35

Jul 20, 2020

0.1.33

Jul 20, 2020

0.1.32

Jul 19, 2020

0.1.31

Jul 19, 2020

0.1.30

Jul 19, 2020

0.1.29

Jul 19, 2020

0.1.28

Jul 19, 2020

0.1.27

Jul 15, 2020

0.1.26

Jul 15, 2020

0.1.25

Jul 15, 2020

0.1.24

Jul 13, 2020

0.1.23

Jul 13, 2020

0.1.22

Jul 13, 2020

0.1.21

Jul 13, 2020

0.1.20

Jul 13, 2020

0.1.12

Jul 8, 2020

0.1.11

Jul 8, 2020

0.1.10

Jul 6, 2020

0.1.9

Jul 6, 2020

0.1.8

Jul 5, 2020

0.1.7

Jul 5, 2020

0.1.6

Jul 5, 2020

0.1.5

Jul 5, 2020

0.1.4

Jul 4, 2020

0.1.3

Jul 4, 2020

0.1.2

Jul 3, 2020

0.1.1

Jun 24, 2020

0.1.0

Jun 24, 2020

This version

0.0.90

Jun 8, 2020

0.0.20

Apr 28, 2020

0.0.19

Apr 6, 2020

0.0.18

Apr 3, 2020

0.0.17

Mar 24, 2020

0.0.16

Mar 24, 2020

0.0.15

Mar 24, 2020

0.0.14

Mar 19, 2020

0.0.13

Mar 19, 2020

0.0.12

Mar 16, 2020

0.0.11

Mar 16, 2020

0.0.10

Mar 14, 2020

0.0.9

Mar 5, 2020

0.0.8

Mar 3, 2020

0.0.7

Mar 3, 2020

0.0.6

Feb 27, 2020

0.0.5

Feb 27, 2020

0.0.4

Feb 26, 2020

0.0.3

Feb 20, 2020

0.0.2

Feb 20, 2020

0.0.1

Feb 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlprep-0.0.90.tar.gz (18.4 kB view hashes)

Uploaded Jun 8, 2020 Source

Built Distributions

nlprep-0.0.90-py3.7.egg (52.7 kB view hashes)

Uploaded Jun 24, 2020 Source

nlprep-0.0.90-py3-none-any.whl (28.6 kB view hashes)

Uploaded Jun 8, 2020 Python 3

Hashes for nlprep-0.0.90.tar.gz

Hashes for nlprep-0.0.90.tar.gz
Algorithm	Hash digest
SHA256	`a4b18a3a80967db36a56ec19f727e78bc0445ac78367a6ef339b1ab230156335`
MD5	`3afe51106d8e318925381fb2dcd8bb42`
BLAKE2b-256	`bd9b049308bae34fd7a93f9628d3c995dfec4e4cfceb0904b942de0691a5d8bc`

Hashes for nlprep-0.0.90-py3.7.egg

Hashes for nlprep-0.0.90-py3.7.egg
Algorithm	Hash digest
SHA256	`36cdb982e3aee847a447c8583e754c91b26452f3bb501232365bdcb8aee1da5b`
MD5	`29b244bd26c53c54d34222a902634e50`
BLAKE2b-256	`a13fe1aede7f765faccc5e2597121bae3652fe91fbcefa08831cb35f4b361729`

Hashes for nlprep-0.0.90-py3-none-any.whl

Hashes for nlprep-0.0.90-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c572cec9aed485b77e8de8727c396350b0e81899b38843dc999d2954d55c6b75`
MD5	`a2cc018f2f4d1267f67970502caada06`
BLAKE2b-256	`7677eef200a6daccff4ca89fc688a977e0dd8390e97c7df7ece189ee54022ea6`