Download and pre-processing data for nlp tasks
Project description
🍳 NLPrep - natural language processing dataset tool for many task
Feature
- handle over 100 dataset
- generate statistic report about processed dataset
- support many pre-processing ways
- Provide a panel for entering your parameters at runtime
- easy to adapt your own dataset and pre-processing utility
Installation
Installing via pip
pip install nlprep
Running nlprep
Once you've installed nlprep, you can run with
pip installed version nlprep
or
local version python -m nlprep.main
and the following parameter:
$ nlprep
arguments:
--dataset which dataset to use
--outdir processed result output directory
optional arguments:
-h, --help show this help message and exit
--util data preprocessing utility, multiple utility are supported
--cachedir dir for caching raw dataset
--infile
--report generate a html statistics report
Add a new dataset
- create a folder with task and dataset name as --dataset parameter
eg: /tag_clner - create a blank init.py and dataset.py
- add DATASET_FILE_MAP inside dataset.py, value will be dataset url
eg:
DATASET_FILE_MAP = {
"train": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/train.txt",
"test": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/test.txt",
"validation": "https://raw.githubusercontent.com/lancopku/Chinese-Literature-NER-RE-Dataset/master/ner/validation.txt",
}
4 create function call toMiddleFormat(path) to turn raw dataset into middleformat middleformat:
{
"input": [
example1 input,
example2 input,
...
],
"target": [
example1 target,
example2 target,
...
]
}
Add a new utility
- sentence level: add function into utils/sentlevel.py, function name will be --util parameter
- paris level - add function into utils/parislevel.py, function name will be --util parameter
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlprep-0.1.0.tar.gz
(16.6 kB
view hashes)
Built Distributions
nlprep-0.1.0-py3.7.egg
(58.1 kB
view hashes)
nlprep-0.1.0-py3-none-any.whl
(29.0 kB
view hashes)