Utilities for preprocessing texts
Project description
preptext
Utilities of preparing datasets for deep learning models.
Can easily convert your text-formatted data into a DataLoader
of pytorch
.
Installation
pip install preptext
or build from source code using:
git clone https://github.com/styxjedi/preptext.git
cd preptext
python setup.py install
Usage
Each dataset consists of many entries, and each entry contains many fields.
This tool is built following the logic above.
For a specific dataset, you just need to follow the steps below.
1. Define some fields
Suppose we are working on a translation task that contains two fields, src
and trg
, in the dataset.
import preptext
fields = preptext.Fields()
fields.add_field(
"src", # name
init_token="<bos>", # bos token
eos_token="<eos>", # eos token
# ... ...
)
fields.add_field(
"trg", # name
init_token="<bos>", # bos token
eos_token="<eos>", # eos token
# ... ...
)
You would see more parameters when using.
Basicly, the same field used in different datasets (train, valid, test) would share the same vocab. And vocabs of different fields are isolated from each other.
2. Add each entry to a DataStorage
datastorage = preptext.DataStorage()
for src, trg in dataset:
entry = preptext.Entry([src, trg], fields)
datastorage.add_entry(entry)
dataset
here is your own translation dataset.
Or you can collect all entries into a list
, then convert the list into a DataStorage.
data = []
for src, trg in dataset:
entry = preptext.Entry([src, trg], fields)
data.append(entry)
datastorage = preptext.DataStorage(data)
3. Build Vocabulary
Create a vocab for each field:
fields.src.build_vocab(datastorage.src)
fields.trg.build_vocab(datastorage.trg)
Create a shared vocab:
fields.src.build_vocab(
datastorage.src,
datastorage.trg
)
fields.trg.vocab = fields.src.vocab
Specify pretrained word vectors:
vec = preptext.Vectors('file/path/to/the/word2vec.txt') # need to be txt format
fields.src.build_vocab(datastorage.src, vetors=vec)
fields.trg.build_vocab(datastorage.trg, vectors=vec)
NOTE: If vectors are specified, this package will cache the vector file into a folder named .vector_cache
automatically.
Now you're almost ready with your data. The following methods is ready to be used.
# save the prepared data into a binary file
datastorage.dump('data.pkl')
# load from a binary file
datastorage = preptext.DataStorage.load('data.pkl')
# get the i-th data of text format
i = 0
datastorage.get_entry(i)
# get the i-th data of numpy array format
datastorage.get_array(i)
# get vocab matrix
src_matrix = datastorage.fields.src.vocab.vectors
trg_matrix = datastorage.fields.trg.vocab.vectors
# convert into pytorch dataloader
dl = preptext.converter.to_dataloader(
datastorage,
bach_size=50,
shuffle=True,
num_workers=4
)
# convert into bucketdataloader (minimized padding in each minibatch)
bucketdl = preptext.converter.to_bucketdataloader(
datastorage,
key=lambda x: len(x.src),
batch_size=50,
shuffle=True,
num_workers=4
)
# convert into distributed dataloader
distributeddl = preptext.converter.to_distributeddataloader(
datastorage,
1, # world_size
1, # rank
batch_size=50,
num_workers=4
)
Enjoy!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file preptext-0.1.21.tar.gz
.
File metadata
- Download URL: preptext-0.1.21.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85671a3ba086d3b6e33d8f9227ffad312db96bd1fd392865652f716c44a99f19 |
|
MD5 | 211a09dc45a1f384e5473f6fbfb8a6d4 |
|
BLAKE2b-256 | 3538f89bae6fd94a52b611158db98afc230c6a4f44a500beaadb4766e540fc3a |
File details
Details for the file preptext-0.1.21-py3-none-any.whl
.
File metadata
- Download URL: preptext-0.1.21-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4232848bf5de5a6c6e5166e7f40935b0198bd66ee2849af991a69b17a9ec5b7 |
|
MD5 | 70ab4b73e3698a11367fee55210ebd1b |
|
BLAKE2b-256 | 5348ea3ccee45b52f0af887fd45424021158cf8ae37c9c7debc36d262341e9b7 |