Skip to main content

Korean Corpus Downloader

Project description

Korpora: Korean Corpora Archives

This package provides easy-download and easy-usage for various Korean corpora

Install

From source

git clone https://github.com/lovit/Korpora
python setup.py install

Using pip

pip install Korpora

Usage

from Korpora import Korpora

nsmc = Korpora.load('nsmc')
# nsmc = Korpora.load('nsmc', root_dir='path/to/Korpora')
# nsmc = Korpora.load('nsmc', root_dir='path/to/Korpora', force_download=True)
len(nsmc.train.texts)   # 150000
len(nsmc.train.labels)  # 50000
from Korpora import NSMC

nsmc = NSMC()
nsmc = NSMC(root_dir='./Korpora/')
nsmc = NSMC(force_download=True)

Naming

All corpus follows corpus_name.mode.type

  • mode: one of [train, dev, test, all]
  • type: one of [texts, labels, ...]
  • normalization: one of [normed, raw]
  • tokenization: one of [.bpe, .mecab, ...]
nsmc.train.texts

File structure Korpora/corpus_name/mode.type[.normalization][.tokenization].

Korpora/nsmc/rating_train.txt
Korpora/nsmc/rating_train.txt.texts
Korpora/nsmc/train.texts.raw
Korpora/nsmc/train.texts.normed
Korpora/nsmc/train.labels
Korpora/nsmc/train.texts.normed.mecab
Korpora/nsmc/test.texts.normed.mecab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Korpora-0.0.2.tar.gz (3.9 kB view hashes)

Uploaded Source

Built Distribution

Korpora-0.0.2-py3-none-any.whl (5.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page