Skip to main content

Indic NLP dataset loader

Project description

Overview

This library provides Indian regional language datasets in an easy to use sklearn.dataset API format. You are free to use it in an application intended for commercial uses.

indic-nlp-datasets Coverage

Installation

You can use pip to install this library

pip install indic-nlp-datasets

To install the latest version of the datasets, use

pip install git+https://github.com/rahul1990gupta/indic-nlp-datasets.git@master

Datasets Available

These are the datasets available in the library

Name Size submodule language
Wikipedia 275 MB load_wikipedia hi
Oscar Common Crawl 17 GB load_occ hi
News Crawl 472 MB load_news_crawl hi
Monlingual 2.45 GB load_monolingual hi
Tweet Corpus 875 MB load_tweets hi
Hinglish Corpus 18 MB load_hinglish hi
Devdas 300 KB load_devdas hi

Getting started

After installation, you can start by importing the dataset

from idatasets import load_devdas
devdas = load_devdas()
print(devdas.desc) # prints description of the data
print(devdas.created_at) # date/year when dataset was created
for sent in devdas.data:
    # process text chunks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indic-nlp-datasets-0.1.2.tar.gz (41.1 kB view hashes)

Uploaded Source

Built Distribution

indic_nlp_datasets-0.1.2-py3-none-any.whl (131.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page