Korean Corpus Downloader

Project description

Korpora: Korean Corpora Archives

This package provides easy-download and easy-usage for various Korean corpora

Install

From source

git clone https://github.com/lovit/Korpora
python setup.py install

Using pip

pip install Korpora

Usage

from Korpora import Korpora

nsmc = Korpora.load('nsmc')
# nsmc = Korpora.load('nsmc', root_dir='path/to/Korpora')
# nsmc = Korpora.load('nsmc', root_dir='path/to/Korpora', force_download=True)
len(nsmc.train.texts)   # 150000
len(nsmc.train.labels)  # 50000

from Korpora import NSMC

nsmc = NSMC()
nsmc = NSMC(root_dir='./Korpora/')
nsmc = NSMC(force_download=True)

Naming

All corpus follows corpus_name.mode.type

mode: one of [train, dev, test, all]
type: one of [texts, labels, ...]
normalization: one of [normed, raw]
tokenization: one of [.bpe, .mecab, ...]

nsmc.train.texts

File structure Korpora/corpus_name/mode.type[.normalization][.tokenization].

Korpora/nsmc/rating_train.txt
Korpora/nsmc/rating_train.txt.texts
Korpora/nsmc/train.texts.raw
Korpora/nsmc/train.texts.normed
Korpora/nsmc/train.labels
Korpora/nsmc/train.texts.normed.mecab
Korpora/nsmc/test.texts.normed.mecab

Project details

Release history Release notifications | RSS feed

0.2.0

Jan 11, 2021

0.2.0rc1 pre-release

Nov 19, 2020

0.1.1

Sep 21, 2020

0.1.0

Sep 10, 2020

0.1.0rc0 pre-release

Sep 10, 2020

This version

0.0.2

Aug 30, 2020

0.0.1

Aug 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Korpora-0.0.2.tar.gz (3.9 kB view hashes)

Uploaded Aug 30, 2020 Source

Built Distribution

Korpora-0.0.2-py3-none-any.whl (5.4 kB view hashes)

Uploaded Aug 30, 2020 Python 3

Hashes for Korpora-0.0.2.tar.gz

Hashes for Korpora-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`f0f46d5ac0b0eefd6e7cc6546aa5c16853c4bdfbf4933e39d5d8752374f7e034`
MD5	`1a20c39b2b93dce421a6bd85be273301`
BLAKE2b-256	`f4098bdef1c24a5198b19097bb409bd570063a0c242ac9c7b2acff9c57248d3b`

Hashes for Korpora-0.0.2-py3-none-any.whl

Hashes for Korpora-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`88399cfaa6eab585482bfbc71b6a862892f1e027289e06a1580d1ba6dbd1278b`
MD5	`62294ee95a4fbddeadc867e17c0e67fa`
BLAKE2b-256	`7f511cef7fd115b8ed1c70cdd82540b7799d8eb74649286568e06592ed59fa8e`