Skip to main content

Download and pre-processing data for nlp tasks

Project description

🍳 NLPrep - natural language processing dataset tool for many task



PyPI Download Size


Feature

  • handle over 100 dataset
  • generate statistic report about processed dataset
  • support many pre-processing ways
  • Provide a panel for entering your parameters at runtime
  • easy to adapt your own dataset and pre-processing utility

Installation

Installing via pip

pip install nlprep

Running nlprep

Once you've installed nlprep, you can run with

pip installed version nlprep
or
local version python -m nlprep.main

and the following parameter:

$ nlprep
arguments:
  --dataset     which dataset to use     
  --outdir      processed result output directory       

optional arguments:
  -h, --help    show this help message and exit
  --util    data preprocessing utility, multiple utility are supported 
  --cachedir   dir for caching raw dataset
  --infile
  --report generate a html statistics report

Add a new dataset

  1. create a folder with task and dataset name as --dataset parameter
    eg: /tag_clner
  2. create a blank init.py and dataset.py
  3. add DATASET_FILE_MAP inside dataset.py, value will be dataset url
    eg:
DATASET_FILE_MAP = {
    "train": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/train.txt",
    "test": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/test.txt",
    "validation": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/validation.txt",
}

4 create function call toMiddleFormat(path) to turn raw dataset into middleformat middleformat:

{
    "input": [
        example1 input,
        example2 input,
        ...
    ],
    "target": [
        example1 target,
        example2 target,
        ...
    ]
}

Add a new utility

  • sentence level: add function into utils/sentlevel.py, function name will be --util parameter
  • paris level - add function into utils/parislevel.py, function name will be --util parameter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlprep-0.1.0.tar.gz (16.6 kB view hashes)

Uploaded Source

Built Distributions

nlprep-0.1.0-py3.7.egg (58.1 kB view hashes)

Uploaded Source

nlprep-0.1.0-py3-none-any.whl (29.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page