Open Speech Datasets
Project description
Open-speech
open-speech is a collection of popular speech datasets. Datasets included in
the collection are:
Datasets have been pre-processed as follows:
- Audio files have been resampled to 16kHz.
- Audio files longer than 68kB (~21.25 seconds) have been discarded.
- Data has been sharded into ~256MB TFRecord files.
If you find this project useful, please consider a small donation to help me pay for data storage:
Usage examples
open-speech can be used as either one large dataset or individual datasets can
be accessed and used on their own.
Get data on each dataset:
import open_speech
for dataset in open_speech.datasets:
print(" name:", dataset.name)
print(" sample_rate:", dataset.sample_rate)
print(" dtype:", dataset.dtype)
print(" # of files:", len(dataset.files))
print("# of examples:",
"train=", len(dataset.train_labels),
"valid=", len(dataset.valid_labels), "test=", len(dataset.test_labels)
)
print()
Output:
name: common_voice
sample_rate: 16000
dtype: <dtype: 'float32'>
# of files: 631
# of examples: train= 435943 valid= 16028 test= 16012
name: voxforge
sample_rate: 16000
dtype: <dtype: 'float32'>
# of files: 108
# of examples: train= 76348 valid= 9534 test= 9553
name: librispeech
sample_rate: 16000
dtype: <dtype: 'float32'>
# of files: 450
# of examples: train= 132542 valid= 2661 test= 2558
Use entire collection as one large dataset:
import open_speech
import tensorflow as tf
print(" sample_rate:", open_speech.sample_rate)
print(" dtype:", open_speech.dtype)
print(" # of files:", len(open_speech.files))
print("# of examples:",
"train=", len(open_speech.train_labels),
"valid=", len(open_speech.valid_labels), "test=", len(open_speech.test_labels)
)
print()
# get a clean set of labels:
# - convert unicode characters to their ascii equivalents
# - strip leading and trailing whitespace
# - convert to lower case
# - strip all punctuation except for the apostrophe (')
#
clean_labels = {
uuid: open_speech.clean(label) for uuid, label in open_speech.labels.items()
}
chars = set()
for label in clean_labels.values(): chars |= set(label)
print("alphabet:", sorted(chars))
max_len = len(max(clean_labels.values(), key=len))
print("longest sentence:", max_len, "chars")
print()
def transform(dataset):
# use open_speech.parse_serial to de-serialize examples;
# this function will return tuples of (uuid, audio)
dataset = dataset.map(open_speech.parse_serial)
# use open_speech.lookup_table to look up and replace uuids
# with corresponding labels
table = open_speech.lookup_table(clean_labels)
dataset = dataset.map(lambda uuid, audio: (audio, table.lookup(uuid)))
# ... do something ...
return dataset
train_dataset = transform( open_speech.train_recordset )
valid_dataset = transform( open_speech.valid_recordset )
hist = model.fit(x=train_dataset, validation_data=valid_dataset,
# ... other parameters ...
)
test_dataset = transform( open_speech.test_recordset )
loss, metrics = model.evaluate(x=test_dataset,
# ... other parameters ...
)
Output:
sample_rate: 16000
dtype: <dtype: 'float32'>
# of files: 1189
# of examples: train= 644833 valid= 28223 test= 28123
alphabet: [' ', "'", '0', '1', '2', '3', '4', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
longest sentence: 398 chars
...
Use individual dataset
import open_speech
from open_speech import common_voice
import tensorflow as tf
print("name:", common_voice.name)
print("sample_rate:", common_voice.sample_rate)
print("dtype:", common_voice.dtype)
def transform(dataset):
# use open_speech.parse_serial to de-serialize examples;
# this function will return tuples of (uuid, audio)
dataset = dataset.map(open_speech.parse_serial)
# use open_speech.lookup_table to look up and replace uuids
# with corresponding labels
table = open_speech.lookup_table(common_voice.labels)
dataset = dataset.map(lambda uuid, audio: (audio, table.lookup(uuid)))
# ... do something ...
return dataset
train_dataset = transform( common_voice.train_recordset )
valid_dataset = transform( common_voice.valid_recordset )
hist = model.fit(x=train_dataset, validation_data=valid_dataset,
# ... other parameters ...
)
Output:
name: common_voice
sample_rate: 16000
dtype: <dtype: 'float32'>
...
Authors
- Dimitry Ishenko - dimitry (dot) ishenko (at) (gee) mail (dot) com
License
This project is distributed under the GNU GPL license. See the LICENSE.md file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_speech-5.5.tar.gz.
File metadata
- Download URL: open_speech-5.5.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9f5f9f8c88c4355d909aae92bb6655b92cec2ef467ff6cef3d8ffea8c880dde
|
|
| MD5 |
eca6eed54bfc6a6754c6bf67b9f33b6c
|
|
| BLAKE2b-256 |
36cb509d7124c93b8901335886a660c63f5847e1f424854c8c0835bba177a3e9
|
File details
Details for the file open_speech-5.5-py3-none-any.whl.
File metadata
- Download URL: open_speech-5.5-py3-none-any.whl
- Upload date:
- Size: 22.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0182b0c8f54189b2fcd7518b02b58d3aab454b13da976fe78afed4b7a7d80ac2
|
|
| MD5 |
6472d6e8b5acd7afd35db5b26ef8ce50
|
|
| BLAKE2b-256 |
cffdde581fd910f161e3128da33ae1fcafad5e40cab28d1abc60413231a4b50f
|