Deliver the ready-to-train data to your NLP model.

Project description

chariot

Deliver the ready-to-train data to your NLP model.

Prepare Dataset
- You can prepare typical NLP datasets through the chazutsu.
Build & Run Preprocess
- You can build the preprocess pipeline like scikit-learn Pipeline.
- Preprocesses for each dataset column are executed in parallel by Joblib.
- Multi-language text tokenization is supported by spaCy.
Format Batch
- Sampling a batch from preprocessed dataset and format it to train the model (padding etc).
- You can use pre-trained word vectors through the chakin.

chariot enables you to concentrate on training your model!

chariot flow

Install

pip install chariot

Prepare dataset

You can download various dataset by using chazutsu.

import chazutsu
from chariot.storage import Storage


storage = Storage("your/data/root")
r = chazutsu.datasets.MovieReview.polarity().download(storage.data_path("raw"))

df = storage.chazutsu(r.root).data()
df.head(5)

Then

	polarity	review
0	0	synopsis : an aging master art thief , his sup...
1	0	plot : a separated , glamorous , hollywood cou...
2	0	a friend invites you to a movie . this film wo...

Storage class manage the directory structure that follows cookie-cutter datascience.

Project root
  └── data
       ├── external     <- Data from third party sources (ex. word vectors).
       ├── interim      <- Intermediate data that has been transformed.
       ├── processed    <- The final, canonical datasets for modeling.
       └── raw          <- The original, immutable data dump.

Build & Run Preprocess

Build a preprocess pipeline

All preprocessors are defined at chariot.transformer.
Transformers are implemented by extending scikit-learn Transformer.
Because of this, the API of Transformer is familiar to you. And you can mix scikit-learn's preprocessors.

import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


preprocessor = Preprocessor()
preprocessor\
    .stack(ct.text.UnicodeNormalizer())\
    .stack(ct.Tokenizer("en"))\
    .stack(ct.token.StopwordFilter("en"))\
    .stack(ct.Vocabulary(min_df=5, max_df=0.5))\
    .fit(train_data)

preprocessor.save("my_preprocessor.pkl")

loaded = Preprocessor.load("my_preprocessor.pkl")

There is 6 type of transformers are prepared in chariot.

TextPreprocessor
- Preprocess the text before tokenization.
- TextNormalizer: Normalize text (replace some character etc).
- TextFilter: Filter the text (delete some span in text stc).
Tokenizer
- Tokenize the texts.
- It powered by spaCy and you can choose MeCab or Janome for Japanese.
TokenPreprocessor
- Normalize/Filter the tokens after tokenization.
- TokenNormalizer: Normalize tokens (to lower, to original form etc).
- TokenFilter: Filter tokens (extract only noun etc).
Vocabulary
- Make vocabulary and convert tokens to indices.
Formatter
- Format (preprocessed) data for training your model.
Generator
- Genrate target data to train your (language) model.

Build a preprocess for dataset

When you want to make preprocess to each of your dataset column, you can use DatasetPreprocessor.

from chariot.dataset_preprocessor import DatasetPreprocessor
from chariot.transformer.formatter import Padding


dp = DatasetPreprocessor()
dp.process("review")\
    .by(ct.text.UnicodeNormalizer())\
    .by(ct.Tokenizer("en"))\
    .by(ct.token.StopwordFilter("en"))\
    .by(ct.Vocabulary(min_df=5, max_df=0.5))\
    .by(Padding(length=pad_length))\
    .fit(train_data["review"])
dp.process("polarity")\
    .by(ct.formatter.CategoricalLabel(num_class=3))


preprocessed = dp.preprocess(data)

# DatasetPreprocessor has multiple preprocessor.
# Because of this, save file format is `tar.gz`.
dp.save("my_dataset_preprocessor.tar.gz")

loaded = DatasetPreprocessor.load("my_dataset_preprocessor.tar.gz")

Train your model with chariot

chariot has feature to traing your model.

formatted = dp(train_data).preprocess().format().processed

model.fit(formatted["review"], formatted["polarity"], batch_size=32,
          validation_split=0.2, epochs=15, verbose=2)

for batch in dp(train_data.preprocess().iterate(batch_size=32, epoch=10):
    model.train_on_batch(batch["review"], batch["polarity"])

You can use pre-trained word vectors by chakin.

from chariot.storage import Storage
from chariot.transformer.vocabulary import Vocabulary

# Download word vector
storage = Storage("your/data/root")
storage.chakin(name="GloVe.6B.50d")

# Make embedding matrix
vocab = Vocabulary()
vocab.set(["you", "loaded", "word", "vector", "now"])
embed = vocab.make_embedding(storage.data_path("external/glove.6B.50d.txt"))
print(embed.shape)  # (len(vocab.count), 50)

Project details

Release history Release notifications | RSS feed

0.5.6

Dec 26, 2019

This version

0.5.5

Oct 18, 2019

0.5.2

Apr 25, 2019

0.5.1

Feb 27, 2019

0.5.0

Feb 21, 2019

0.4.9

Dec 25, 2018

0.4.8

Nov 29, 2018

0.4.7

Oct 23, 2018

0.4.6

Oct 11, 2018

0.4.5

Oct 3, 2018

0.4.4

Oct 1, 2018

0.4.3

Sep 27, 2018

0.4.2

Sep 27, 2018

0.4.1

Aug 8, 2018

0.4.0

Aug 6, 2018

0.3.0

Jun 18, 2018

0.2.0

Jun 18, 2018

0.1.0

Jun 14, 2018

0.0.1

Jun 11, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chariot-0.5.5.tar.gz (3.6 MB view hashes)

Uploaded Oct 18, 2019 Source

Hashes for chariot-0.5.5.tar.gz

Hashes for chariot-0.5.5.tar.gz
Algorithm	Hash digest
SHA256	`a2d8e6e5b6e8c1df5f89cb8d713fffcb36c1c4ee9cda2df0cdd7e68f10fdd96c`
MD5	`7de0cbafc36d834111409199b16cd4ab`
BLAKE2b-256	`301581efc8955beaba63353fd0d3f8e121286b1035650b4637fba58d6b8ed12c`