Skip to main content

Tool for detecting and extracting text from intertitles in Swedish newsreels.

Project description

test-coverage pre-commit Hatch project Publish Python Package to PyPI DOI

stum

stum is a tool for detecting and extracting text from intertitles in silent films. It was developed and tested specifically for Swedish newsreels.

Installation

  1. Install stum either:

    • pipx install stum
    • python -m pip install stum
    • python -m pip install -e '.[dev]' (for development)
  2. Install ffmpeg:

    • MacOS (with Homebrew): brew install ffmpeg
    • Debian: sudo apt-get install ffmpeg
  3. Install tesseract:

    • MacOS (with Homebrew): brew install tesseract
    • Debian: sudo apt-get install tesseract
  4. Install the Swedish OCR model for Tesseract:

    • Download the Swedish model
    • Place it in the appropriate tessdata folder:
      • MacOS (with Homebrew): /opt/homebrew/share/tessdata
      • Linux: /usr/share/tesseract-ocr/4.00/tessdata
  5. Download the EAST model for text detection:

    • EAST model
    • Place it in the ./models folder of this repository.
  6. For development, also run:

    • pre-commit install (to set up pre-commit hooks)

CLI Usage

stum -i input.mpg
stum -i input_dir_with_mp4s/

Input:

  • Video file.
  • Directory of video files.

Output:

  • One .srt file per input video.

Flags:

  -i, --input INPUT  Input video (.mpg) or directory of videos (.mpg)
  -s, --skip         Skip files that already have an .srt file.
  -d, --debug        Activate Debug mode. Saves intertitle frames for each video.

Test data

The test data, frames and videos used to test the library, are placed in another repo as a submodule. This repo is only necessary for testing, and is not necessary for using the library. Some of the test data is of a somewhat sensitive nature, and the repo is therefore private.

Tesseract accuracy

The following frame is extracted from one of our sample videos, manually selected in an early experiment to ascertain the accuracy of raw Tesseract.

English:

C= PULMUNDESTRIS VECKOREVY 1038

MR G.
har invigt en ny tennisbana pa
Sard — ett led i 70-arsjubileets
hugfastande.

Which already is very good, the Swedish model:

(%nm- FILMINDUSTRIS VECKOREVY 1028

MR G.
har invigt en ny tennisbana på
Särö — ett led i 70-årsiubileets
hugfästande.

Which is notably better, but fails to recognize the 'j' in 'årsjubileets'.

How many frames? All the frames

Normally one would not need to extract all frames from a video. However, many of the digitized Swedish newsreels only contain a single, or very few, frames of every intertitle as shown by the following three sequential frames:

It also showcases another problem that needs to be corrected for: some intertitles are mirrored vertically.

Workflow

By first grouping the frames into sequences/shots first, the computational time is significantly reduced. This is logical since the OCR portion is the most time-consuming. The currently implemented approach works according to the following diagram:

flowchart LR
  B@{ shape: procs, label: Frames}
  C@{shape: decision, label: filter}
  D@{label: Delete, shape: text}
  E@{shape: procs, label: keep}
  F1@{shape: procs, label: "...000.png\n...001.png\n...002.png\n...\n...010.png"}
  F3@{shape: procs, label: "...600.png\n...601.png\n...602.png\n...\n...610.png"}
  F4@{shape: procs, label: "...980.png\n...981.png\n...982.png\n...\n...980.png"}
  out@{shape: procs, label: "OCR w/ timestamp"}

  A[Video] --> B --> C
  C -->|intertitle| E
  C -->|not intertitle| D

  E --> F1
  E --> F3
  E --> F4


  subgraph Intertitles
    F1 --> O1[OCR w/ timestamp]
    F3 --> O3[OCR w/ timestamp]
    F4 --> O4[OCR w/ timestamp]
  end

 O1 --> out
 O3 --> out
 O4 --> out

 out --> .srt
  1. Extract all the frames.
  2. Group frames into subdirectories based on image similarity (MSE).
  3. Only keep the sequences that pass the contour + OCR filters.
  4. Merge consequtive frequences that have a very similar OCR output.
  5. Use frame numbers and .txt to generate an .srt string.

Research Context and Licensing

Modern Times 1936

stum was developed for the Modern Times 1936 research project at Lund University, Sweden. The project investigates what software "sees," "hears," and "perceives" when pattern recognition technologies such as 'AI' are applied to media historical sources. The project is funded by Riksbankens Jubileumsfond.

License

stum is licensed under the CC-BY-NC 4.0 International license.

References

@inproceedings{zhou2017east,
  title={East: an efficient and accurate scene text detector},
  author={Zhou, Xinyu and Yao, Cong and Wen, He and Wang, Yuzhi and Zhou, Shuchang and He, Weiran and Liang, Jiajun},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={5551--5560},
  year={2017}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stum-0.2.0.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stum-0.2.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file stum-0.2.0.tar.gz.

File metadata

  • Download URL: stum-0.2.0.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for stum-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9a3c6dc140d387c4df3c006c34fcab417e3e580c715a6b3c5d96b6d4a567f26c
MD5 3f3ee80ee78a10c1337cdcc16598b378
BLAKE2b-256 3292856b93c2e76b684738f8a839c38b6354aa73a819fd34ce1f3d868a253ee4

See more details on using hashes here.

File details

Details for the file stum-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: stum-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for stum-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 09d974624641041b3a53d9cfd6400000846ee6fa52341426e4fe2f059afa76c3
MD5 7f0eaab0ec775977e3a6cc64896f9b53
BLAKE2b-256 1bfa47aae3d366d1dee82462b77310f573b3563d7eed06acc328edff3b625877

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page