Skip to main content

Transcriptions from the Swedish newsreel archive Journal Digital

Project description

Journal Digital Corpus

The Journal Digital Corpus is a curated, timestamped transcription corpus derived from Swedish historical newsreels. It combines speech-to-text transcriptions and intertitle OCR to enable scalable and searchable analysis of early-to-mid 20th-century audiovisual media.

The SF Veckorevy newsreels—-screened weekly across Sweden for over five decades—-form one of the most extensive audiovisual records of 20th-century Swedish life. Yet their research potential has remained largely untapped due to barriers to access and analysis. The Journal Digital Corpus offers the first comprehensive transcription of both speech and intertitles from this material.

This corpus is the result of two purpose-built libraries:

  • SweScribe – an ASR pipeline developed for transcription of speech in historical Swedish newsreels.
  • stum – an OCR tool for detecting and transcribing intertitles in silent film footage.
The corpus consists of 2,229,854 words transcribed from 204 hours of speech across 2,553 videos and 302,342 words from 49,119 intertitles from 4,333 videos.

The primary files used for this project are publicly available on Filmarkivet.se, a web resource containing curated parts of Swedish film archives.

Installation

Git clone repository, cd in to the directory and run: python -m pip install -e .

python -m pip install journal_digital

2025-06-04

Created with SweScribe==v0.1.0 and stum==v.0.2.0 on 2025-06-04 without manual editing.

Files

/corpus
├── /intertitle
│   ├── /collection_1
│   ├── /collection_2
│   └── /collection_3
│       ├── /1920
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       ├── /1921
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       └── /1922
│           ├── video_1.srt
│           ├── video_2.srt
│           └── video_3.srt
├── /speech
│   ├── /collection_1
│   ├── /collection_2
│   └── /collection_3
│       ├── /1920
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       ├── /1921
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       └── /1922
│           ├── video_1.srt
│           ├── video_2.srt
│           └── video_3.srt

Development Setup

python -m pip install '.[dev]' pre-commit install

Add your path to videos got JOURNAL_DIGITALROOT in .env.

Research Context and Licensing

Modern Times 1936

The Journal Digital Corpus was developed for the Modern Times 1936 research project at Lund University, Sweden. The project investigates what software "sees," "hears," and "perceives" when pattern recognition technologies such as 'AI' are applied to media historical sources. The project is funded by Riksbankens Jubileumsfond.

License

The Journal Digital Corpus is licensed under the CC-BY-NC 4.0 International license.

References

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}
@inproceedings{malmsten2022hearing,
  title={Hearing voices at the national library : a speech corpus and acoustic model for the Swedish language},
  author={Malmsten, Martin and Haffenden, Chris and B{\"o}rjeson, Love},
  booktitle={Proceeding of Fonetik 2022 : Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR},
  volume={3},
  year={2022}
}
@inproceedings{zhou2017east,
  title={East: an efficient and accurate scene text detector},
  author={Zhou, Xinyu and Yao, Cong and Wen, He and Wang, Yuzhi and Zhou, Shuchang and He, Weiran and Liang, Jiajun},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={5551--5560},
  year={2017}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

journal_digital-2025.6.4.tar.gz (10.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

journal_digital-2025.6.4-py2.py3-none-any.whl (15.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file journal_digital-2025.6.4.tar.gz.

File metadata

  • Download URL: journal_digital-2025.6.4.tar.gz
  • Upload date:
  • Size: 10.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for journal_digital-2025.6.4.tar.gz
Algorithm Hash digest
SHA256 18747021fc7b62ebfbfdce66f773082416d80eb98d657849707cc2c9252b4228
MD5 d4c090cb1f50d34f617f259cec6b89e8
BLAKE2b-256 95f1c3fcacd4ad5659b2a49ec59be5d6eafbe0ec2c379f00765a3a5be4855c37

See more details on using hashes here.

File details

Details for the file journal_digital-2025.6.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for journal_digital-2025.6.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e731a3d77e758e20da16e476638ae035cfc8585bdeb09f30e8b7928f367e9ffb
MD5 204c5ff406ed20671b0f8fd8aae386b7
BLAKE2b-256 1b5ae2a2ad012b331eda21d37df23fed2ca7d3bcc51abf030bd62a37f5eb0f0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page