No project description provided
Project description
opensubtitles-dataloader
Download, preprocess and use sentences from the OpenSubtitles v2018 dataset without ever needing to load all of it into memory.
Download
See possible languages here.
opensubtitles-download en
Load tokenized version.
opensubtitles-download en --token
Use in Python
Load
opensubtites_dataset = OpenSubtitlesDataset('en')
Load only the first 1 million lines.
opensubtites_dataset = OpenSubtitlesDataset('en', first_n_lines=1_000_000)
Group sentences into groups of 5.
opensubtites_dataset = OpenSubtitlesDataset('en', 5)
Group sentences into groups ranging from 2 to 5.
opensubtites_dataset = OpenSubtitlesDataset('en', (2,5))
Split sentences using "\n".
opensubtites_dataset = OpenSubtitlesDataset('en', delimiter="\n")
Do preprocessing.
opensubtites_dataset = OpenSubtitlesDataset('en', preprocess_function=my_preprocessing_function)
Split for Training
train, valid, test = opensubtites_dataset.split()
Set the fractions of the original dataset.
train, valid, test = opensubtites_dataset.split([0.7, 0.15, 0.15])
Use a seed.
train, valid, test = opensubtites_dataset.split(seed=42)
Access
index.
train, valid, text = OpenSubtitlesDataset('en').splits()
train[20_000]
pytorch.
from torch.utils.data import DataLoader
train, valid, text = OpenSubtitlesDataset('en').splits()
train_loader = DataLoader(train, batch_size=16)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for opensubtitles-dataloader-0.1.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a09b9d5fabd4c19d7cbcfe6456655dd1cc5268f28cf7f4fdb77142403f4c08f |
|
MD5 | 202a1d3d78cb5b3ba9f836db6d18302b |
|
BLAKE2b-256 | 0e800744420e273b559c2b54ae477f5393a48a3238f54282cace0151fbe659d3 |
Close
Hashes for opensubtitles_dataloader-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e95e7a3a2f07e29eac0d25170294b4f0ede805cfe770870267e282b015d14876 |
|
MD5 | a93e4905e91c0ddf236f6f944205a1ea |
|
BLAKE2b-256 | b151d701fa7ae54318e3464b78a1f8de23e3e98f7819264625f7c0647dabcbb6 |