Skip to main content

A WhisperX-powered tool for automatically transcribing speech in historical Swedish newsreels

Project description

test-coverage pre-commit Hatch project

SweScribe

SweScribe has been developed and tested for the use case of transcribing speech from Swedish historical newsreels, making their content searchable, indexable, and easier to study within the broader framework of media history research. It is essentially a wrapper around WhisperX, which in turn combines Whisper and Wav2Vec2 to create transcriptions with timestamps.

The primary files used for this project are publicly available on Filmarkivet.se, a web resource containing curated parts of Swedish film archives.

Installation and usage

This repository is built around the WhisperX, which is built and tested on python3.10.

Installation

  1. Download this repository and install the dependencies with:
git clone https://github.com/Modern36/swescribe.git
cd swescribe
python -m pip install .[whisperx]

or

python -m pip install swescribe
  1. Install ffmpeg:
    • MacOS (with Homebrew): brew install ffmpeg
    • Debian: sudo apt-get install ffmpeg

Usage

Once installed swescribe can be run from the command line:

swescribe -i {input file/directory} -o {output file/directory}

Run the help command to get list of available options:

swescribe --help

Contributing

We welcome contributions from the community! Please read our Contributing Guidelines to understand how to contribute to this project. We follow a code review process, so all pull requests will be reviewed before merging.

License

This project is licensed under the CC-BY-NC 4.0 license. - see the LICENSE file for details.

Project structure

File structure for test data

/data
├── /wav_input   (Audio extracted from videos)
│   └── file_x.wav
│   └── file_y.wav
│   └── file_z.wav
├── /txt_output
│   └── file_x.txt
│   └── file_y.txt
│   └── file_z.txt
├── /srt_output
│   └── file_x.srt
│   └── file_y.srt
│   └── file_z.srt
├── /ground_truth
│   └── file_x.txt
│   └── file_y.txt
│   └── file_z.txt

The pipeline

  flowchart LR
  Videos@{ shape: docs}
  .srts@{ shape: procs, label:.srt}
  Filter{{Clean text from known errors}}

  Videos --> SweScribe --> .srts --> QA

  subgraph SweScribe
    Video -.-> audio.wav -.->  WhisperX -.-> raw.srt -.-> Filter

    Video -->  WhisperX --> Filter

    subgraph TemporaryFiles
      audio.wav
      raw.srt
    end
  end


  subgraph QA[Automated Quality Control]
    .srt --convert--> .txt --calculate--> WER
    GT.txt --calculate--> WER
  end

The intended final product of this repo is an audio transcription library suitable for swedish 1930s newsreels: SweScribe in the diagram. It will process video files to produce transcribed subtitles using the .srt standard.

The parts

Practically ths consists of several steps (following the solid lines)

  1. Extract the video's audio

  2. Pass the audio through WhisperX to get a transcription

  3. Clean these transcriptions from identified systematic errors before:

  4. Producing the cleaned .srt file.

Need for interim parts

Some of these steps are computationally intensive and we do not want to rerun the whole pipeline every time we introduce a change. Which goes in stark contrast against our need to run tests after each change to understand how it affects the quality of the pipeline. During development, we will therefore need to save the output from each step so that we can rerun as few parts of the pipeline as needed, following the dashed lines down the flowchart.

Continuous testing

The quality control is primarily carried out through the use of the WER metric calculated between the pair of each .srt file and a set of ground truth files. 27 of these are manually transcribed. The rest are initially transcribed with WhisperX, and then manually corrected.

In the step converting from .srt files to .txt files we also have a chance to remove segments (using timestamps) that are nigh impossible to understand, without speech or speech in a different language. This needs to be carried before we implement any change into the codebase -- and is always run automatically upon PullRequests. The WER results for each file and the descriptive statistics are stored alongside the code so that we can quickly and easily see how the performance changes with each run (and with some effort we can later recreate a history of this metric).

xzy as a replacement tokens

Words or sequences that were imperceptible during manual correction of the automatically generated Ground Truth files were replaced by the token xzy. These replacement tokens are filtered out in the WER step of the pipeline; they primarily exist to enhance searchability.

Running the tests manually

The pytest module can also be run with

python -m pytest

Research Context and Licensing

Modern Times 1936

SweScribe was developed for the Modern Times 1936 research project at Lund University, Sweden. The project investigates what software "sees," "hears," and "perceives" when pattern recognition technologies such as 'AI' are applied to media historical sources. The project is funded by Riksbankens Jubileumsfond.

License

SweScribe is licensed under the CC-BY-NC 4.0 International license.

References

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}
@inproceedings{malmsten2022hearing,
  title={Hearing voices at the national library : a speech corpus and acoustic model for the Swedish language},
  author={Malmsten, Martin and Haffenden, Chris and B{\"o}rjeson, Love},
  booktitle={Proceeding of Fonetik 2022 : Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR},
  volume={3},
  year={2022}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swescribe-0.1.0.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swescribe-0.1.0-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file swescribe-0.1.0.tar.gz.

File metadata

  • Download URL: swescribe-0.1.0.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for swescribe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1b85d1c3e822e71e7dbaad474cc6b4b5aa0375b3a275260c3af9a13d689c9a63
MD5 a6e2765005c7a15eaec2ca5d4cefdaeb
BLAKE2b-256 b86062040917da9f0253f9ec496736e3b40d04e320683d1d5d447bdba583c3d3

See more details on using hashes here.

File details

Details for the file swescribe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: swescribe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for swescribe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e9afc725eee96a45c32b999f5fce425b0b0af05d7a6dba27ac4929bfad6877e
MD5 835edae1eeb6e2b2765b033946370c66
BLAKE2b-256 6561663850f6dfe79f291d3845672e3d834ca7da91b41210ebc2c9e9d88546e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page