indic-nlp-datasets

Indic NLP dataset loader

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Overview

This library provides Indian regional language datasets in an easy to use sklearn.dataset API format. You are free to use it in an application intended for commercial uses.

indic-nlp-datasets

Installation

You can use pip to install this library

pip install indic-nlp-datasets

To install the latest version of the datasets, use

pip install git+https://github.com/rahul1990gupta/indic-nlp-datasets.git@master

Datasets Available

These are the datasets available in the library

Name	Size	submodule	language
Wikipedia	275 MB	`load_wikipedia`	hi
Oscar Common Crawl	17 GB	`load_occ`	hi
News Crawl	472 MB	`load_news_crawl`	hi
Monlingual	2.45 GB	`load_monolingual`	hi
Tweet Corpus	875 MB	`load_tweets`	hi
Hinglish Corpus	18 MB	`load_hinglish`	hi
Devdas	300 KB	`load_devdas`	hi

Getting started

After installation, you can start by importing the dataset

from idatasets import load_devdas
devdas = load_devdas()
print(devdas.desc) # prints description of the data
print(devdas.created_at) # date/year when dataset was created
for sent in devdas.data:
    # process text chunks

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.2

Aug 21, 2020

0.1.1

Aug 21, 2020

0.1

Aug 16, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indic-nlp-datasets-0.1.2.tar.gz (41.1 kB view hashes)

Uploaded Aug 21, 2020 Source

Built Distribution

indic_nlp_datasets-0.1.2-py3-none-any.whl (131.7 kB view hashes)

Uploaded Aug 21, 2020 Python 3

Hashes for indic-nlp-datasets-0.1.2.tar.gz

Hashes for indic-nlp-datasets-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`594e9839b6e9f1101af2a55b82ee5282ab57d3285554029aa42ad11595dd3a09`
MD5	`dc25e010092d31619c5cac545ae24c9d`
BLAKE2b-256	`9a87c7033a4ed4f1e087eb61690938d80b9c8185e31252e9d1e6a0436ad4c1a7`

Hashes for indic_nlp_datasets-0.1.2-py3-none-any.whl

Hashes for indic_nlp_datasets-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`893528c8f16583746fdcead1612bf8b6af15c1e33a15ac05c4ed1f390d5789fd`
MD5	`e51369d497ed29b27420d21c72399cd0`
BLAKE2b-256	`aa5805cec39d97a552f5b874a5c93073a66af9482c5bb250a1395026592ff053`