Skip to main content

It's lib for using speechkit api by yandex.

Project description

Yandex Speechkit Python SDK

PyPI GitHub PyPI - Format

It's lib for using speechkit api by yandex.

For more information please visit Yandex Speechkit API Docs. This lib supports short and long audio recognition of speechkit

Getting Started

Assuming that you have Python and virtualenv installed, set up your environment and install the required dependencies like this, or you can install the library using pip:

$ git clone https://github.com/TikhonP/yandex-speechkit-lib-python.git
$ cd yandex-speechkit-lib-python
$ virtualenv venv
...
$ . venv/bin/activate
$ python -m pip install -r requirements.txt
$ python -m pip install .
python -m pip install speechkit

Using speechkit

There are support of recognizing long and short audio and synthesis. For more information please read docs below.

For short audio

From a Python interpreter:

>>> import speechkit
>>> recognizeShortAudio = speechkit.RecognizeShortAudio('<yandex_passport_oauth_token>')
>>> with open('/Users/tikhon/Desktop/out.wav', 'rb') as f:
...     data = f.read()
... 
>>> recognizeShortAudio.recognize(data, folderId='<folder id>', format='lpcm', sampleRateHertz='48000')
'Текст который нужно распознать'

For synthesis

>>> import speechkit
>>> synthesizeAudio = speechkit.SynthesizeAudio('<yandex_passport_oauth_token>')
>>> synthesizeAudio.synthesize('/Users/tikhon/Desktop/outtt.wav', text='Текст который нужно синтезировать', voice='oksana', format='lpcm', sampleRateHertz='16000', folderId='<folder id>')

Read documentation for more methods

Speechkit documentation

Module contents

speechkit Python SDK for using Yandex Speech recognition and synthesis

exception speechkit.InvalidDataError()

Bases: ValueError

Exception raised for errors when data not valid

class speechkit.ObjectStorage(**kwargs)

Bases: object

Interact with AWS object storage.

is entirely optional, and if not provided, the credentials configured for the session will automatically be used.

You only need to provide this argument if you want to override the credentials used for this specific client.

Same semantics as aws_access_key_id above.

create_presigned_url(bucket_name, aws_file_name, expiration=3600)

Generate a presigned URL to share an S3 object

  • Parameters

    • aws_file_name (string) – Name of file in object storage

    • expiration (integer) – Time in seconds for the presigned URL to remain valid

  • Returns

    Resigned URL as string.

delete_object(aws_file_name, bucket_name)

Delete object in bucket

  • Parameters

    aws_file_name (string) – Name of file in object storage

list_objects_in_bucket(bucket_name)

Get list of all objects in backet

upload_file(file_path, baket_name, aws_file_name)

Upload a file to object storage

  • Parameters

    • file_path (string) – Path to input file

    • aws_file_name (string) – Name of file in object storage

class speechkit.RecognizeLongAudio(api_key)

Bases: object

Long audio fragment recognition can be used

for multi-channel audio files up to 1 GB.

To recognize long audio fragments, you need to execute 2 requests:

* Send a file for recognition.


* Get recognition results.

```python
>>> recognizeLongAudio = RecognizeLongAudio('<Api-Key>')
>>> recognizeLongAudio.send_for_recognition('<object storage uri>')
>>> if recognizeLongAudio.get_recognition_results():
...     data = recognizeLongAudio.get_data()
...
>>> recognizeLongAudio.get_raw_text()
'raw recognized text'
```

Initialize Api-Key for recognizing long audio

  • Parameters

    api_key (string) – The API key is a private key used for simplified authorization in the Yandex.Cloud API.

get_data()

Get the response. Use RecognizeLongAudio.get_recognition_results() first to store answer_data

Contain a list of recognition results (chunks[]).

  • Returns

    Each result in the chunks[] list contains the following fields:

    • alternatives[]: List of recognized text alternatives. Each alternative contains the following fields:
    * words[]: List of recognized words:
    
        * startTime: Time stamp of the beginning of the word in the recording. An error of 1-2 seconds
    

    is possible.

        * endTime: Time stamp of the end of the word. An error of 1-2 seconds is possible.
    
        * word: Recognized word. Recognized numbers are written in words (for example, twelve rather
    

    than 12).

        * confidence: This field currently isn’t supported. Don’t use it.
    
        * text: Full recognized text. By default, numbers are written in figures. To output the entire
    

    text in words, specify true in the raw_results field.

        * confidence: This field currently isn’t supported. Don’t use it.
    
    • channelTag: Audio channel that recognition was performed for.

get_raw_text()

Get raw text from answer_data data

  • Returns

    Text

get_recognition_results()

Monitor the recognition results using the received ID. The number of result monitoring requests is limited, so consider the recognition speed: it takes about 10 seconds to recognize 1 minute of single-channel audio.

send_for_recognition(uri, **kwargs)

Send a file for recognition

  • Parameters

    • uri (string) – The URI of the audio file for recognition. Supports only links to files stored in Yandex Object Storage.

    • languageCode (string) – The language that recognition will be performed for. Only Russian is currently supported (ru-RU).

    • model (string) – The language model to be used for recognition. Default value: general.

    • profanityFilter (boolean) – The profanity filter.

    • audioEncoding (string) – The format of the submitted audio. Acceptable values:

      • LINEAR16_PCM: LPCM with no WAV header.

      • OGG_OPUS (default): OggOpus format.

    • sampleRateHertz (integer) – The sampling frequency of the submitted audio. Required if format is set to LINEAR16_PCM. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.

    • audioChannelCount (integer) – The number of channels in LPCM files. By default, 1. Don’t use this field for OggOpus files.

    • rawResults (boolean) – Flag that indicates how to write numbers. true: In words. false (default): In figures.

class speechkit.RecognizeShortAudio(yandex_passport_oauth_token)

Bases: object

Short audio recognition ensures fast response time and is suitable for single-channel audio of small length.

Audio requirements:

* Maximum file size: 1 MB.


* Maximum length: 30 seconds.


* Maximum number of audio channels: 1.

Gets IAM token and stores in RecognizeShortAudio.token

  • Parameters

    yandex_passport_oauth_token (string) – OAuth token from Yandex.OAuth

recognize(data, **kwargs)

Recognize text from BytesIO data given, which is audio

  • Parameters

    • data (io.BytesIO) – Data with audio samples to recognize

    • lang (string) – The language to use for recognition. Acceptable values: * ru-RU (by default) — Russian. * en-US — English. * tr-TR — Turkish.

    • topic (string) – The language model to be used for recognition. Default value: general.

    • profanityFilter (boolean) – This parameter controls the profanity filter in recognized speech.

    • format (string) – The format of the submitted audio. Acceptable values: * lpcm — LPCM with no WAV header. * oggopus (default) — OggOpus.

    • sampleRateHertz (string) – The sampling frequency of the submitted audio. Used if format is set to lpcm. Acceptable values: * 48000 (default) — Sampling rate of 48 kHz. * 16000 — Sampling rate of 16 kHz. * 8000 — Sampling rate of 8 kHz.

you make a request on behalf of a service account.

  • Returns

    The recognized text, string

exception speechkit.RequestError(answer: dict)

Bases: Exception

Exception raised for errors while yandex api request

class speechkit.SynthesizeAudio(yandex_passport_oauth_token)

Bases: object

Generates speech from received text.

  • Parameters

    yandex_passport_oauth_token (string) – OAuth token from Yandex.OAuth

synthesize(file_path, **kwargs)

Generates speech from received text and saves it to file

  • Parameters

    • file_path (string) – The path to file where store data

    • text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.

    • ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.

    • lang (string) – Language. Acceptable values: * ru-RU (default) — Russian. * en-US — English. * tr-TR — Turkish.

    • voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.

    • speed (string) – Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where: * 3.0 — Fastest rate. * 1.0 (default) — Average human speech rate. * 0.1 — Slowest speech rate.

    • format (string) – The format of the synthesized audio. Acceptable values: * lpcm — Audio file is synthesized in LPCM format with no WAV header. Audio properties:

    * Sampling — 8, 16, or 48 kHz, depending on the value of the sampleRateHertz parameter.
    
    * Bit depth — 16-bit.
    
    * Byte order — Reversed (little-endian).
    
    * Audio data is stored as signed integers.
    
      * oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
    

    the OGG container format (OggOpus).

if format is set to lpcm. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.

  • Parameters

    folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.

synthesize_stream(**kwargs)

Generates speech from received text and return io.BytesIO object

with data.
  • Parameters

    • text (string) – UTF-8 encoded text to be converted to speech. You can only use one text and ssml field. For homographs, place a + before the stressed vowel. For example, contr+ol or def+ect. To indicate a pause between words, use -. Maximum string length: 5000 characters.

    • ssml (string) – Text in SSML format to be converted into speech. You can only use one text and ssml fields.

    • lang (string) – Language. Acceptable values: * ru-RU (default) — Russian. * en-US — English. * tr-TR — Turkish.

    • voice (string) – Preferred speech synthesis voice from the list. Default value: oksana.

    • speed (string) – Rate (speed) of synthesized speech. The rate of speech is set as a decimal number in the range from 0.1 to 3.0. Where: * 3.0 — Fastest rate. * 1.0 (default) — Average human speech rate. * 0.1 — Slowest speech rate.

    • format (string) – The format of the synthesized audio. Acceptable values: * lpcm — Audio file is synthesized in LPCM format with no WAV header. Audio properties:

    * Sampling — 8, 16, or 48 kHz, depending on the value of the sampleRateHertz parameter.
    
    * Bit depth — 16-bit.
    
    * Byte order — Reversed (little-endian).
    
    * Audio data is stored as signed integers.
    
      * oggopus (default) — Data in the audio file is encoded using the OPUS audio codec and compressed using
    

    the OGG container format (OggOpus).

    • sampleRateHertz (string) – The sampling frequency of the synthesized audio. Used if format is set to lpcm. Acceptable values: * 48000 (default): Sampling rate of 48 kHz. * 16000: Sampling rate of 16 kHz. * 8000: Sampling rate of 8 kHz.

    • folderId (string) – ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don’t specify this field if you make a request on behalf of a service account.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speechkit-1.3.4.tar.gz (11.0 kB view hashes)

Uploaded Source

Built Distribution

speechkit-1.3.4-py3-none-any.whl (10.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page