Open Speech Datasets

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Open-speech

open-speech is a collection of popular speech datasets. Datasets included in the collection are:

Datasets have been pre-processed as follows:

Audio files have been resampled to 16kHz.
Audio files longer than 68kB (~21.25 seconds) have been discarded.
Data has been sharded into ~256MB TFRecord files.

If you find this project useful, please consider a small donation to help me pay for data storage:

Usage examples

open-speech can be used as either one large dataset or individual datasets can be accessed and used on their own.

Get data on each dataset:

import open_speech

for dataset in open_speech.datasets:

    print("         name:", dataset.name)
    print("  sample_rate:", dataset.sample_rate)
    print("        dtype:", dataset.dtype)
    print("   # of files:", len(dataset.files))
    print("# of examples:",
        "train=", len(dataset.train_labels),
        "valid=", len(dataset.valid_labels), "test=", len(dataset.test_labels)
    )
    print()

Output:

         name: common_voice
  sample_rate: 16000
        dtype: <dtype: 'float32'>
   # of files: 631
# of examples: train= 435943 valid= 16028 test= 16012

         name: voxforge
  sample_rate: 16000
        dtype: <dtype: 'float32'>
   # of files: 108
# of examples: train= 76348 valid= 9534 test= 9553

         name: librispeech
  sample_rate: 16000
        dtype: <dtype: 'float32'>
   # of files: 450
# of examples: train= 132542 valid= 2661 test= 2558

Use entire collection as one large dataset:

import open_speech
import tensorflow as tf

print("  sample_rate:", open_speech.sample_rate)
print("        dtype:", open_speech.dtype)
print("   # of files:", len(open_speech.files))
print("# of examples:",
    "train=", len(open_speech.train_labels),
    "valid=", len(open_speech.valid_labels), "test=", len(open_speech.test_labels)
)
print()

# get a clean set of labels:
#    - convert unicode characters to their ascii equivalents
#    - strip leading and trailing whitespace
#    - convert to lower case
#    - strip all punctuation except for the apostrophe (')
#
clean_labels = {
    uuid: open_speech.clean(label) for uuid, label in open_speech.labels.items()
}

chars = set()
for label in clean_labels.values(): chars |= set(label)
print("alphabet:", sorted(chars))

max_len = len(max(clean_labels.values(), key=len))
print("longest sentence:", max_len, "chars")
print()

def transform(dataset):
    # use open_speech.parse_serial to de-serialize examples;
    # this function will return tuples of (uuid, audio)
    dataset = dataset.map(open_speech.parse_serial)

    # use open_speech.lookup_table to look up and replace uuids
    # with corresponding labels
    table = open_speech.lookup_table(clean_labels)
    dataset = dataset.map(lambda uuid, audio: (audio, table.lookup(uuid)))

    # ... do something ...

    return dataset

train_dataset = transform( open_speech.train_recordset )
valid_dataset = transform( open_speech.valid_recordset )

hist = model.fit(x=train_dataset, validation_data=valid_dataset,
    # ... other parameters ...
)

test_dataset = transform( open_speech.test_recordset )

loss, metrics = model.evaluate(x=test_dataset,
    # ... other parameters ...
)

Output:

  sample_rate: 16000
        dtype: <dtype: 'float32'>
   # of files: 1189
# of examples: train= 644833 valid= 28223 test= 28123

alphabet: [' ', "'", '0', '1', '2', '3', '4', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
longest sentence: 398 chars

...

Use individual dataset

import open_speech
from open_speech import common_voice
import tensorflow as tf

print("name:", common_voice.name)
print("sample_rate:", common_voice.sample_rate)
print("dtype:", common_voice.dtype)

def transform(dataset):
    # use open_speech.parse_serial to de-serialize examples;
    # this function will return tuples of (uuid, audio)
    dataset = dataset.map(open_speech.parse_serial)

    # use open_speech.lookup_table to look up and replace uuids
    # with corresponding labels
    table = open_speech.lookup_table(common_voice.labels)
    dataset = dataset.map(lambda uuid, audio: (audio, table.lookup(uuid)))

    # ... do something ...

    return dataset

train_dataset = transform( common_voice.train_recordset )
valid_dataset = transform( common_voice.valid_recordset )

hist = model.fit(x=train_dataset, validation_data=valid_dataset,
    # ... other parameters ...
)

Output:

name: common_voice
sample_rate: 16000
dtype: <dtype: 'float32'>

...

Authors

Dimitry Ishenko - dimitry (dot) ishenko (at) (gee) mail (dot) com

License

This project is distributed under the GNU GPL license. See the LICENSE.md file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

5.5

Oct 15, 2020

5.4

Oct 15, 2020

5.3

Oct 14, 2020

5.2 yanked

Sep 28, 2020

Reason this release was yanked:

Data moved to a new bucket. Please update to the latest version.

5.1 yanked

Aug 6, 2020

Reason this release was yanked:

Data moved to a new bucket. Please update to the latest version.

5.0 yanked

Aug 5, 2020

Reason this release was yanked:

Data moved to a new bucket. Please update to the latest version.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_speech-5.5.tar.gz (9.3 kB view details)

Uploaded Oct 15, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_speech-5.5-py3-none-any.whl (22.6 kB view details)

Uploaded Oct 15, 2020 Python 3

File details

Details for the file open_speech-5.5.tar.gz.

File metadata

Download URL: open_speech-5.5.tar.gz
Upload date: Oct 15, 2020
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for open_speech-5.5.tar.gz
Algorithm	Hash digest
SHA256	`e9f5f9f8c88c4355d909aae92bb6655b92cec2ef467ff6cef3d8ffea8c880dde`
MD5	`eca6eed54bfc6a6754c6bf67b9f33b6c`
BLAKE2b-256	`36cb509d7124c93b8901335886a660c63f5847e1f424854c8c0835bba177a3e9`

See more details on using hashes here.

File details

Details for the file open_speech-5.5-py3-none-any.whl.

File metadata

Download URL: open_speech-5.5-py3-none-any.whl
Upload date: Oct 15, 2020
Size: 22.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for open_speech-5.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0182b0c8f54189b2fcd7518b02b58d3aab454b13da976fe78afed4b7a7d80ac2`
MD5	`6472d6e8b5acd7afd35db5b26ef8ce50`
BLAKE2b-256	`cffdde581fd910f161e3128da33ae1fcafad5e40cab28d1abc60413231a4b50f`

See more details on using hashes here.

open-speech 5.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Open-speech

Usage examples

Get data on each dataset:

Use entire collection as one large dataset:

Use individual dataset

Authors

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes