Skip to main content

Utilities for preprocessing texts

Project description

preptext

Utilities of preparing datasets for deep learning models. Can easily convert your text-formatted data into a DataLoader of pytorch.

Installation

pip install preptext

or build from source code using:

git clone https://github.com/styxjedi/preptext.git
cd preptext
python setup.py install

Usage

Each dataset consists of many entries, and each entry contains many fields.

This tool is built following the logic above.

For a specific dataset, you just need to follow the steps below.

1. Define some fields

Suppose we are working on a translation task that contains two fields, src and trg, in the dataset.

import preptext
fields = preptext.Fields()
fields.add_field(
    "src", # name
    init_token="<bos>", # bos token
    eos_token="<eos>", # eos token
    # ... ...
)
fields.add_field(
    "trg", # name
    init_token="<bos>", # bos token
    eos_token="<eos>", # eos token
    # ... ...
)

You would see more parameters when using.

Basicly, the same field used in different datasets (train, valid, test) would share the same vocab. And vocabs of different fields are isolated from each other.

2. Add each entry to a DataStorage

datastorage = preptext.DataStorage()
for src, trg in dataset:
    entry = preptext.Entry([src, trg], fields)
    datastorage.add_entry(entry)

dataset here is your own translation dataset.

Or you can collect all entries into a list, then convert the list into a DataStorage.

data = []
for src, trg in dataset:
    entry = preptext.Entry([src, trg], fields)
    data.append(entry)
datastorage = preptext.DataStorage(data)

3. Build Vocabulary

Create a vocab for each field:

fields.src.build_vocab(datastorage.src)
fields.trg.build_vocab(datastorage.trg)

Create a shared vocab:

fields.src.build_vocab(
    datastorage.src,
    datastorage.trg
)
fields.trg.vocab = fields.src.vocab

Specify pretrained word vectors:

vec = preptext.Vectors('file/path/to/the/word2vec.txt') # need to be txt format
fields.src.build_vocab(datastorage.src, vetors=vec)
fields.trg.build_vocab(datastorage.trg, vectors=vec)

NOTE: If vectors are specified, this package will cache the vector file into a folder named .vector_cache automatically.

Now you're almost ready with your data. The following methods is ready to be used.

# save the prepared data into a binary file
datastorage.dump('data.pkl')
# load from a binary file
datastorage = preptext.DataStorage.load('data.pkl')
# get the i-th data of text format
i = 0
datastorage.get_entry(i)
# get the i-th data of numpy array format
datastorage.get_array(i)
# get vocab matrix
src_matrix = datastorage.fields.src.vocab.vectors
trg_matrix = datastorage.fields.trg.vocab.vectors
# convert into pytorch dataloader
dl = preptext.converter.to_dataloader(
    datastorage,
    bach_size=50,
    shuffle=True,
    num_workers=4
)
# convert into bucketdataloader (minimized padding in each minibatch)
bucketdl = preptext.converter.to_bucketdataloader(
    datastorage,
    key=lambda x: len(x.src),
    batch_size=50,
    shuffle=True,
    num_workers=4
)
# convert into distributed dataloader
distributeddl = preptext.converter.to_distributeddataloader(
    datastorage,
    1, # world_size
    1, # rank
    batch_size=50,
    num_workers=4
)

Enjoy!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preptext-0.1.21.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

preptext-0.1.21-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file preptext-0.1.21.tar.gz.

File metadata

  • Download URL: preptext-0.1.21.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.4

File hashes

Hashes for preptext-0.1.21.tar.gz
Algorithm Hash digest
SHA256 85671a3ba086d3b6e33d8f9227ffad312db96bd1fd392865652f716c44a99f19
MD5 211a09dc45a1f384e5473f6fbfb8a6d4
BLAKE2b-256 3538f89bae6fd94a52b611158db98afc230c6a4f44a500beaadb4766e540fc3a

See more details on using hashes here.

File details

Details for the file preptext-0.1.21-py3-none-any.whl.

File metadata

  • Download URL: preptext-0.1.21-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.4

File hashes

Hashes for preptext-0.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 b4232848bf5de5a6c6e5166e7f40935b0198bd66ee2849af991a69b17a9ec5b7
MD5 70ab4b73e3698a11367fee55210ebd1b
BLAKE2b-256 5348ea3ccee45b52f0af887fd45424021158cf8ae37c9c7debc36d262341e9b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page