Skip to main content

Transcriptions from the Swedish newsreel archive Journal Digital

Project description

Publish Python Package to PyPI pre-commit Hatch project DOI

Journal Digital Corpus

The Journal Digital Corpus is a curated, timestamped transcription corpus derived from Swedish historical newsreels. It combines speech-to-text transcriptions and intertitle OCR to enable scalable and searchable analysis of early-to-mid 20th-century audiovisual media.

The SF Veckorevy newsreels—-screened weekly across Sweden for over five decades—-form one of the most extensive audiovisual records of 20th-century Swedish life. Yet their research potential has remained largely untapped due to barriers to access and analysis. The Journal Digital Corpus offers the first comprehensive transcription of both speech and intertitles from this material.

This corpus is the result of two purpose-built libraries:

  • SweScribe – an ASR pipeline developed for transcription of speech in historical Swedish newsreels.
  • stum – an OCR tool for detecting and transcribing intertitles in silent film footage.
The corpus consists of 2,225,334 words transcribed from 204 hours of speech across 2,544 videos and 302,312 words from 49,107 intertitles from 4,327 videos.

The primary files used for this project are publicly available on Filmarkivet.se, a web resource containing curated parts of Swedish film archives.

Installation

Git clone repository, cd in to the directory and run: python -m pip install -e .

python -m pip install journal_digital

Usage

Reading the corpus in text mode

from journal_digital import Corpus

# Iterate over speech transcriptions as plain text
corpus = Corpus(mode="txt")
for file, text in corpus:
    print(f"{file.stem}: {text[:100]}...")

Reading the corpus in SRT mode

from journal_digital import Corpus

# Iterate over speech transcriptions as timestamped segments
corpus = Corpus(mode="srt")
for file, segments in corpus:
    for segment in segments:
        print(f"[{segment.start} --> {segment.end}] {segment.text}")

Each segment is a SubtitleSegment namedtuple with fields:

  • idx: Segment index (starts at 1)
  • start: Start timestamp (format: HH:MM:SS,mmm)
  • end: End timestamp (format: HH:MM:SS,mmm)
  • text: Transcribed text
  • num_words: Word count (optional, None by default)
  • duration_seconds: Segment duration in seconds (optional, None by default)

2025-06-04

Created with SweScribe==v0.1.0 and stum==v.0.2.0 on 2025-06-04 without manual editing.

Files

/src/journal_digital/corpus
├── /intertitle
│   ├── /collection_1
│   ├── /collection_2
│   └── /collection_3
│       ├── /1920
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       ├── /1921
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       └── /1922
│           ├── video_1.srt
│           ├── video_2.srt
│           └── video_3.srt
├── /speech
│   ├── /collection_1
│   ├── /collection_2
│   └── /collection_3
│       ├── /1920
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       ├── /1921
│       │   ├── video_1.srt
│       │   ├── video_2.srt
│       │   └── video_3.srt
│       └── /1922
│           ├── video_1.srt
│           ├── video_2.srt
│           └── video_3.srt

Development Setup

python -m pip install -e '.[dev]'
pre-commit install
pre-commit install --hook-type pre-push

Add your path to videos got JOURNAL_DIGITALROOT in .env.

Contributing Manual Corrections

The corpus uses standard git workflows to preserve manual edits when transcription pipelines are updated.

Making Corrections

Edit SRT files and commit to git:

# Fix typos, character encoding, or timing in any SRT file
vim src/journal_digital/corpus/speech/sf/1935/SF855B.1.mpg.srt

# Commit your changes
git add src/journal_digital/corpus/speech/sf/1935/SF855B.1.mpg.srt
git commit -m "Fix: Change 'C4' to 'Sefyr' (character encoding)"

Updating Transcription Pipelines

When the underlying transcription tools (SweScribe/stum) improve:

# 1. Run the pipeline manually
python -m journal_digital.transcribe

# 2. Commit and tag the pipeline output
git add src/journal_digital/corpus/
git commit -m "Run transcription pipeline (swescribe 2.1.0)"
git tag -a pipeline-2025-12-08 -m "Pipeline run with swescribe 2.1.0"

# 3. Cherry-pick manual edits (excluding pipeline commits)
# The --reverse flag ensures commits are applied in chronological order
COMMITS=$(git rev-list pipeline-2025-12-06..HEAD~1 --reverse --format='%H %D' |
  grep -v '^commit' |
  grep -v 'tag: pipeline-' |
  cut -d' ' -f1)

echo $COMMITS | xargs git cherry-pick

Git's conflict resolution will show you exactly where manual edits conflict with pipeline changes.

Research Context and Licensing

Modern Times 1936

The Journal Digital Corpus was developed for the Modern Times 1936 research project at Lund University, Sweden. The project investigates what software "sees," "hears," and "perceives" when pattern recognition technologies such as 'AI' are applied to media historical sources. The project is funded by Riksbankens Jubileumsfond.

License

The Journal Digital Corpus is licensed under the CC-BY-NC 4.0 International license.

How to Cite

If you use this corpus in your research, please cite both the data paper and the repository:

@article{aspenskog2025journal,
  title={Journal Digital Corpus: Swedish Newsreel Transcriptions},
  author={Aspenskog, Robert and Johansson, Mathias and Snickars, Pelle},
  journal={Journal of Open Humanities Data},
  volume={11},
  number={1},
  pages={44},
  year={2025},
  publisher={Ubiquity Press},
  doi={10.5334/johd.344},
  url={https://doi.org/10.5334/johd.344}
}
@software{johansson2025corpus,
  author={Johansson, Mathias and Aspenskog, Robert},
  title={Modern36/journal\_digital\_corpus},
  year={2025},
  publisher={Zenodo},
  version={2025.10.13},
  doi={10.5281/zenodo.15596191},
  url={https://doi.org/10.5281/zenodo.15596191}
}

References

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}
@inproceedings{malmsten2022hearing,
  title={Hearing voices at the national library : a speech corpus and acoustic model for the Swedish language},
  author={Malmsten, Martin and Haffenden, Chris and B{\"o}rjeson, Love},
  booktitle={Proceeding of Fonetik 2022 : Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR},
  volume={3},
  year={2022}
}
@inproceedings{zhou2017east,
  title={East: an efficient and accurate scene text detector},
  author={Zhou, Xinyu and Yao, Cong and Wen, He and Wang, Yuzhi and Zhou, Shuchang and He, Weiran and Liang, Jiajun},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={5551--5560},
  year={2017}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

journal_digital-2025.12.30.2.tar.gz (10.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

journal_digital-2025.12.30.2-py2.py3-none-any.whl (13.2 MB view details)

Uploaded Python 2Python 3

File details

Details for the file journal_digital-2025.12.30.2.tar.gz.

File metadata

  • Download URL: journal_digital-2025.12.30.2.tar.gz
  • Upload date:
  • Size: 10.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for journal_digital-2025.12.30.2.tar.gz
Algorithm Hash digest
SHA256 1e6018296227aca97d6c57faf301203ac6f147a8980e993c8e8b4e4a34b14c96
MD5 962efe1e8a5ad4f949ae946522fa2bd6
BLAKE2b-256 61fdfab4711fbf3d01ec1409c4053c0a3b85138990c580aca3964b9a76b77bd1

See more details on using hashes here.

File details

Details for the file journal_digital-2025.12.30.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for journal_digital-2025.12.30.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6edbd0143b7eb654e01a1965c60dffa2c7b977d783029b073816ad2a3b33938d
MD5 42488db5f0bed8ec82608c9df813ae70
BLAKE2b-256 5466212c4035c40478a7b2a8c7286f38fd85b265cfde3c3f3041e370147fdd7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page