Skip to main content

framework for synchronous batch speech-to-text transcription using backends like AWS, Watson, etc.

Project description

py-transcribe

Implementation-agnostic framework for synchronous batch text-to-speech transcription with backend services such as AWS, Watson, etc.

This module itself does NOT include a full implementation or an integration with any transcription service. The intention instead is that you include a specific implementation in your project. For example, for AWS Transcribe, use (py-transcribe-aws)[https://github.com/ICTLearningSciences/py-transcribe-aws]

Python Installation

pip install py-transcribe

Usage

You first need to install some concrete implementation of py-transcribe. If you are using AWS, then you can install transcribe-aws like this:

pip install py-transcribe-aws

...once the implementation is installed, you can configure that one of two ways:

Setting the implementation module path

Set ENV var TRANSCRIBE_MODULE_PATH, e.g.

export TRANSCRIBE_MODULE_PATH=transcribe_aws

or pass the module path at service-creation time, e.g.

from transcribe import init_transcription_service


service = init_transcription_service(
    module_path="transcribe_aws"
)

Basic usage

Once you're set up, basic usage looks like this:

from transcribe import (
    init_transcription_service
    TranscribeJobRequest,
    TranscribeJobStatus
)


service = init_transcription_service()
requests = [
    TranscribeJobRequest(
        jobId="j1",
        sourceFile="/some/path/j1.wav"
    ),
    TranscribeJobRequest(
        jobId="j2",
        sourceFile="/some/other/path/j2.wav"
    )
]
result = service.transcribe(requests)
for j in result.jobs():
    if j.status == TranscribeJoStatus.SUCCEEDED:
        print(j.transcript)
    else:
        print(j.error)

Handling updates on large/long-running batch jobs

The main transcribe method is synchronous to hide the async/polling-based complexity of most transcribe services. But for any non-trivial batch of transcriptions, you probably do want to receive periodic updates, for example to save any completed transcriptions. You can do that by passing an on_update callback as follows:

from transcribe import (
    init_transcription_service
    TranscribeJobRequest,
    TranscribeJobStatus,
    TranscribeJobsUpdate
)


service = init_transcription_service()
requests = [
    TranscribeJobRequest(
        jobId="j1",
        sourceFile="/some/path/j1.wav"
    ),
    TranscribeJobRequest(
        jobId="j2",
        sourceFile="/some/other/path/j2.wav"
    )
]


def _on_update(u: TranscribeJobsUpdate) -> None:
    for j in u.jobs():
        if j.status == TranscribeJoStatus.SUCCEEDED:
            print(f"save result: {j.transcript}")
        else:
            print(j.error)

result = service.transcribe(
    requests,
    on_update=_on_update
)

Configuring the environment for your implementation

Most implementations will also require other configuration, which you can either set in your environment or pass to init_transcription_service as config={}. See your implementation docs for details.

Development

Run tests during development with

make test-all

Once ready to release, create a release tag, currently using semver-ish numbering, e.g. 1.0.0(-alpha.1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py-transcribe-1.3.1.tar.gz (10.2 kB view hashes)

Uploaded Source

Built Distribution

py_transcribe-1.3.1-py3-none-any.whl (14.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page