A WhisperX-powered tool for automatically transcribing speech in historical Swedish newsreels

These details have not been verified by PyPI

Project description

SweScribe

SweScribe has been developed and tested for the use case of transcribing speech from Swedish historical newsreels, making their content searchable, indexable, and easier to study within the broader framework of media history research. It is essentially a wrapper around WhisperX, which in turn combines Whisper and Wav2Vec2 to create transcriptions with timestamps.

The primary files used for this project are publicly available on Filmarkivet.se, a web resource containing curated parts of Swedish film archives.

Installation and usage

This repository is built around the WhisperX, which is built and tested on python3.10.

Installation

Download this repository and install the dependencies with:

git clone https://github.com/Modern36/swescribe.git
cd swescribe
python -m pip install .[whisperx]

python -m pip install swescribe

Install ffmpeg:
- MacOS (with Homebrew): brew install ffmpeg
- Debian: sudo apt-get install ffmpeg

Usage

Once installed swescribe can be run from the command line:

swescribe -i {input file/directory} -o {output file/directory}

Run the help command to get list of available options:

swescribe --help

Contributing

We welcome contributions from the community! Please read our Contributing Guidelines to understand how to contribute to this project. We follow a code review process, so all pull requests will be reviewed before merging.

License

This project is licensed under the CC-BY-NC 4.0 license. - see the LICENSE file for details.

Project structure

File structure for test data

/data
├── /wav_input   (Audio extracted from videos)
│   └── file_x.wav
│   └── file_y.wav
│   └── file_z.wav
├── /txt_output
│   └── file_x.txt
│   └── file_y.txt
│   └── file_z.txt
├── /srt_output
│   └── file_x.srt
│   └── file_y.srt
│   └── file_z.srt
├── /ground_truth
│   └── file_x.txt
│   └── file_y.txt
│   └── file_z.txt

The pipeline

  flowchart LR
  Videos@{ shape: docs}
  .srts@{ shape: procs, label:.srt}
  Filter{{Clean text from known errors}}

  Videos --> SweScribe --> .srts --> QA

  subgraph SweScribe
    Video -.-> audio.wav -.->  WhisperX -.-> raw.srt -.-> Filter

    Video -->  WhisperX --> Filter

    subgraph TemporaryFiles
      audio.wav
      raw.srt
    end
  end


  subgraph QA[Automated Quality Control]
    .srt --convert--> .txt --calculate--> WER
    GT.txt --calculate--> WER
  end

The intended final product of this repo is an audio transcription library suitable for swedish 1930s newsreels: SweScribe in the diagram. It will process video files to produce transcribed subtitles using the .srt standard.

The parts

Practically ths consists of several steps (following the solid lines)

Extract the video's audio
Pass the audio through WhisperX to get a transcription
Clean these transcriptions from identified systematic errors before:
Producing the cleaned .srt file.

Need for interim parts

Some of these steps are computationally intensive and we do not want to rerun the whole pipeline every time we introduce a change. Which goes in stark contrast against our need to run tests after each change to understand how it affects the quality of the pipeline. During development, we will therefore need to save the output from each step so that we can rerun as few parts of the pipeline as needed, following the dashed lines down the flowchart.

Continuous testing

The quality control is primarily carried out through the use of the WER metric calculated between the pair of each .srt file and a set of ground truth files. 27 of these are manually transcribed. The rest are initially transcribed with WhisperX, and then manually corrected.

In the step converting from .srt files to .txt files we also have a chance to remove segments (using timestamps) that are nigh impossible to understand, without speech or speech in a different language. This needs to be carried before we implement any change into the codebase -- and is always run automatically upon PullRequests. The WER results for each file and the descriptive statistics are stored alongside the code so that we can quickly and easily see how the performance changes with each run (and with some effort we can later recreate a history of this metric).

`xzy` as a replacement tokens

Words or sequences that were imperceptible during manual correction of the automatically generated Ground Truth files were replaced by the token xzy. These replacement tokens are filtered out in the WER step of the pipeline; they primarily exist to enhance searchability.

Running the tests manually

The pytest module can also be run with

python -m pytest

Research Context and Licensing

Modern Times 1936

SweScribe was developed for the Modern Times 1936 research project at Lund University, Sweden. The project investigates what software "sees," "hears," and "perceives" when pattern recognition technologies such as 'AI' are applied to media historical sources. The project is funded by Riksbankens Jubileumsfond.

License

SweScribe is licensed under the CC-BY-NC 4.0 International license.

References

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}

@inproceedings{malmsten2022hearing,
  title={Hearing voices at the national library : a speech corpus and acoustic model for the Swedish language},
  author={Malmsten, Martin and Haffenden, Chris and B{\"o}rjeson, Love},
  booktitle={Proceeding of Fonetik 2022 : Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR},
  volume={3},
  year={2022}
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swescribe-0.1.0.tar.gz (23.9 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

swescribe-0.1.0-py3-none-any.whl (26.2 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file swescribe-0.1.0.tar.gz.

File metadata

Download URL: swescribe-0.1.0.tar.gz
Upload date: Jun 2, 2025
Size: 23.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for swescribe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1b85d1c3e822e71e7dbaad474cc6b4b5aa0375b3a275260c3af9a13d689c9a63`
MD5	`a6e2765005c7a15eaec2ca5d4cefdaeb`
BLAKE2b-256	`b86062040917da9f0253f9ec496736e3b40d04e320683d1d5d447bdba583c3d3`

See more details on using hashes here.

File details

Details for the file swescribe-0.1.0-py3-none-any.whl.

File metadata

Download URL: swescribe-0.1.0-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 26.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for swescribe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e9afc725eee96a45c32b999f5fce425b0b0af05d7a6dba27ac4929bfad6877e`
MD5	`835edae1eeb6e2b2765b033946370c66`
BLAKE2b-256	`6561663850f6dfe79f291d3845672e3d834ca7da91b41210ebc2c9e9d88546e6`

See more details on using hashes here.

swescribe 0.1.0

Navigation

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Meta

Classifiers

Project description

SweScribe

Installation and usage

Installation

Usage

Contributing

License

Project structure

File structure for test data

The pipeline

xzy as a replacement tokens

Running the tests manually

Research Context and Licensing

Modern Times 1936

License

References

Project details

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`xzy` as a replacement tokens