It's lib for using speechkit api by yandex.
Project description
Yandex Speechkit Python SDK
It's lib for using speechkit api by yandex.
For more information please visit Yandex Speechkit API Docs. This lib supports short and long audio recognition of speechkit
Getting Started
Assuming that you have Python and virtualenv
installed, set up your environment and install the required dependencies like this, or you can install the library using pip
:
$ git clone https://github.com/TikhonP/yandex-speechkit-lib-python.git
$ cd yandex-speechkit-lib-python
$ virtualenv venv
...
$ . venv/bin/activate
$ python -m pip install -r requirements.txt
$ python -m pip install .
python -m pip install speechkit
Using speechkit
There are support of recognizing long and short audio and synthesis. For more information please read docs below.
For short audio
From a Python interpreter:
>>> import speechkit
>>> recognizeShortAudio = speechkit.RecognizeShortAudio('<yandex_passport_oauth_token>')
>>> with open('/Users/tikhon/Desktop/out.wav', 'rb') as f:
... data = f.read()
...
>>> recognizeShortAudio.recognize(data, folderId='<folder id>', format='lpcm', sampleRateHertz='48000')
'Текст который нужно распознать'
For synthesis
>>> import speechkit
>>> synthesizeAudio = speechkit.SynthesizeAudio('<yandex_passport_oauth_token>')
>>> synthesizeAudio.synthesize('/Users/tikhon/Desktop/outtt.wav', text='Текст который нужно синтезировать', voice='oksana', format='lpcm', sampleRateHertz='16000', folderId='<folder id>')
Read documentation for more methods
Speechkit documentation
Module contents
speechkit Python SDK for using Yandex Speech recognition and synthesis
exception speechkit.InvalidDataError()
Bases: ValueError
Exception raised for errors when data not valid
class speechkit.ObjectStorage(**kwargs)
Bases: object
Interact with AWS object storage.
is entirely optional, and if not provided, the credentials configured for the session will automatically be used.
You only need to provide this argument if you want to override the credentials used for this specific client.
Same semantics as aws_access_key_id above.
create_presigned_url(bucket_name, aws_file_name, expiration=3600)
Generate a presigned URL to share an S3 object
-
Parameters
-
aws_file_name (string) – Name of file in object storage
-
expiration (integer) – Time in seconds for the presigned URL to remain valid
-
-
Returns
Resigned URL as string.
delete_object(aws_file_name, bucket_name)
Delete object in bucket
-
Parameters
aws_file_name (string) – Name of file in object storage
list_objects_in_bucket(bucket_name)
Get list of all objects in backet
upload_file(file_path, baket_name, aws_file_name)
Upload a file to object storage
-
Parameters
-
file_path (string) – Path to input file
-
aws_file_name (string) – Name of file in object storage
-
class speechkit.RecognizeLongAudio(api_key)
Bases: object
Long audio fragment recognition can be used
for multi-channel audio files up to 1 GB.
To recognize long audio fragments, you need to execute 2 requests:
* Send a file for recognition.
* Get recognition results.
```python
>>> recognizeLongAudio = RecognizeLongAudio('<Api-Key>')
>>> recognizeLongAudio.send_for_recognition('<object storage uri>')
>>> if recognizeLongAudio.get_recognition_results():
... data = recognizeLongAudio.get_data()
...
>>> recognizeLongAudio.get_raw_text()
'raw recognized text'
```
Initialize Api-Key for recognizing long audio
-
Parameters
api_key (string) – The API key is a private key used for simplified authorization in the Yandex.Cloud API.
get_data()
Get the response.
Use RecognizeLongAudio.get_recognition_results()
first to store answer_data
Contain a list of recognition results (chunks[]).
-
Returns
Each result in the chunks[] list contains the following fields:
- alternatives[]: List of recognized text alternatives. Each alternative contains the following fields:
* words[]: List of recognized words:
* startTime: Time stamp of the beginning of the word in the recording. An error of 1-2 seconds
is possible.
* endTime: Time stamp of the end of the word. An error of 1-2 seconds is possible.
* word: Recognized word. Recognized numbers are written in words (for example, twelve rather
than 12).
* confidence: This field currently isn’t supported. Don’t use it.
* text: Full recognized text. By default, numbers are written in figures. To output the entire
text in words, specify true in the raw_results field.
* confidence: This field currently isn’t supported. Don’t use it.
- channelTag: Audio channel that recognition was performed for.
get_raw_text()
Get raw text from answer_data data
-
Returns
Text
get_recognition_results()
Monitor the recognition results using the received ID. The number of result monitoring requests is limited, so consider the recognition speed: it takes about 10 seconds to recognize 1 minute of single-channel audio.
send_for_recognition(uri, **kwargs)
Send a file for recognition
-
Parameters
-
uri (string) – The URI of the audio file for recognition. Supports only links to files stored in Yandex Object Storage.
-
languageCode (string) – The language that recognition will be performed for. Only Russian is currently supported (ru-RU).
-
model (string) – The language model to be used for recognition. Default value: general.
-
profanityFilter (boolean) – The profanity filter.
-
audioEncoding (string) – The format of the submitted audio. Acceptable values:
-
LINEAR16_PCM: LPCM with no WAV header.
-
OGG_OPUS (default): OggOpus format.
-
-
sampleRateHertz (integer) – The sampling frequency of the submitted audio. Required if format is set to LINEAR16_PCM. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.
-
audioChannelCount (integer) – The number of channels in LPCM files. By default, 1. Don’t use this field for OggOpus files.
-
rawResults (boolean) – Flag that indicates how to write numbers. true: In words. false (default): In figures.
-
class speechkit.RecognizeShortAudio(yandex_passport_oauth_token)
Bases: object
Short audio recognition ensures fast response time and is suitable for single-channel audio of small length.
Audio requirements:
* Maximum file size: 1 MB.
* Maximum length: 30 seconds.
* Maximum number of audio channels: 1.
Gets IAM token and stores in RecognizeShortAudio.token
-
Parameters
yandex_passport_oauth_token (string) – OAuth token from
Yandex.OAuth
recognize(data, **kwargs)
Recognize text from BytesIO data given, which is audio
-
Parameters
-
data (io.BytesIO) – Data with audio samples to recognize
-
lang (string) – The language to use for recognition. Acceptable values: * ru-RU (by default) — Russian. * en-US — English. * tr-TR — Turkish.
-
topic (string) – The language model to be used for recognition. Default value: general.
-
profanityFilter (boolean) – This parameter controls the profanity filter in recognized speech.
-
format (string) – The format of the submitted audio. Acceptable values: * lpcm — LPCM with no WAV header. * oggopus (default) — OggOpus.
-
sampleRateHertz (string) – The sampling frequency of the submitted audio. Used if format is set to lpcm. Acceptable values: * 48000 (default) — Sampling rate of 48 kHz. * 16000 — Sampling rate of 16 kHz. * 8000 — Sampling rate of 8 kHz.
-
you make a request on behalf of a service account.
-
Returns
The recognized text, string
exception speechkit.RequestError(answer: dict)
Bases: Exception
Exception raised for errors while yandex api request
class speechkit.SynthesizeAudio(yandex_passport_oauth_token)
Bases: object
Generates speech from received text.
-
Parameters
yandex_passport_oauth_token (string) – OAuth token from
Yandex.OAuth
synthesize(file_path, **kwargs)
Generates speech from received text and saves it to file
-
Parameters
-
file_path (string) – The path to file where store data
-
text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.
-
ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.
-
lang (string) – Language. Acceptable values: * ru-RU (default) — Russian. * en-US — English. * tr-TR — Turkish.
-
voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.
-
speed (string) – Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where: * 3.0 — Fastest rate. * 1.0 (default) — Average human speech rate. * 0.1 — Slowest speech rate.
-
format (string) – The format of the synthesized audio. Acceptable values: * lpcm — Audio file is synthesized in LPCM format with no WAV header. Audio properties:
* Sampling — 8, 16, or 48 kHz, depending on the value of the sampleRateHertz parameter.
* Bit depth — 16-bit.
* Byte order — Reversed (little-endian).
* Audio data is stored as signed integers.
* oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
the OGG container format (OggOpus).
-
if format is set to lpcm. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.
-
Parameters
folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.
synthesize_stream(**kwargs)
Generates speech from received text and return io.BytesIO object
with data.
-
Parameters
-
text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.
-
ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.
-
lang (string) – Language. Acceptable values: * ru-RU (default) — Russian. * en-US — English. * tr-TR — Turkish.
-
voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.
-
speed (string) – Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where: * 3.0 — Fastest rate. * 1.0 (default) — Average human speech rate. * 0.1 — Slowest speech rate.
-
format (string) – The format of the synthesized audio. Acceptable values: * lpcm — Audio file is synthesized in LPCM format with no WAV header. Audio properties:
* Sampling — 8, 16, or 48 kHz, depending on the value of the sampleRateHertz parameter.
* Bit depth — 16-bit.
* Byte order — Reversed (little-endian).
* Audio data is stored as signed integers.
* oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
the OGG container format (OggOpus).
-
sampleRateHertz (string) – The sampling frequency of the synthesized audio. Used if format is set to lpcm. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.
-
folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.
-
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for speechkit-1.3.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a7ed2f3051e3736cb65ed8247c868e13c5df1503cf8850c50f9560905d62152 |
|
MD5 | 75b6dd17763ed1cf0512be8d83ef9608 |
|
BLAKE2b-256 | 7c92aada73269be2f4efd18bed5076df0f26440a951e4c535621f780733454e7 |