Skip to main content

No project description provided

Project description

opensubtitles-dataloader

PyPI version

Download, preprocess and use sentences from the OpenSubtitles v2018 dataset without ever needing to load all of it into memory.

Download

See possible languages here.

opensubtitles-download en

Load tokenized version.

opensubtitles-download en --token

Use in Python

Load

opensubtites_dataset = OpenSubtitlesDataset('en')

Load only the first 1 million lines.

opensubtites_dataset = OpenSubtitlesDataset('en', first_n_lines=1_000_000)

Group sentences into groups of 5.

opensubtites_dataset = OpenSubtitlesDataset('en', 5)

Group sentences into groups ranging from 2 to 5.

opensubtites_dataset = OpenSubtitlesDataset('en', (2,5))

Split sentences using "\n".

opensubtites_dataset = OpenSubtitlesDataset('en', delimiter="\n")

Do preprocessing.

opensubtites_dataset = OpenSubtitlesDataset('en', preprocess_function=my_preprocessing_function)

Split for Training

train, valid, test = opensubtites_dataset.split()

Set the fractions of the original dataset.

train, valid, test = opensubtites_dataset.split([0.7, 0.15, 0.15])

Use a seed.

train, valid, test = opensubtites_dataset.split(seed=42)

Access

index.

train, valid, text = OpenSubtitlesDataset('en').splits()
train[20_000]

pytorch.

from torch.utils.data import DataLoader
train, valid, text = OpenSubtitlesDataset('en').splits()
train_loader = DataLoader(train, batch_size=16)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opensubtitles-dataloader-0.1.4.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

opensubtitles_dataloader-0.1.4-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file opensubtitles-dataloader-0.1.4.tar.gz.

File metadata

  • Download URL: opensubtitles-dataloader-0.1.4.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.10 CPython/3.8.2 Linux/5.4.0-7629-generic

File hashes

Hashes for opensubtitles-dataloader-0.1.4.tar.gz
Algorithm Hash digest
SHA256 4a09b9d5fabd4c19d7cbcfe6456655dd1cc5268f28cf7f4fdb77142403f4c08f
MD5 202a1d3d78cb5b3ba9f836db6d18302b
BLAKE2b-256 0e800744420e273b559c2b54ae477f5393a48a3238f54282cace0151fbe659d3

See more details on using hashes here.

File details

Details for the file opensubtitles_dataloader-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for opensubtitles_dataloader-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e95e7a3a2f07e29eac0d25170294b4f0ede805cfe770870267e282b015d14876
MD5 a93e4905e91c0ddf236f6f944205a1ea
BLAKE2b-256 b151d701fa7ae54318e3464b78a1f8de23e3e98f7819264625f7c0647dabcbb6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page