Skip to main content

Subtitles extremely clean

Project description

CleanIt

Subtitles extremely clean.

Latest Version

tests

License

CleanIt is a command line tool that helps you to keep your subtitles clean. You can specify your own rules to detect entries to be removed or patterns to be replaced. Simple text matching or complex regex can be used. It comes with standard rules out of the box:

  • ocr: Fix common OCR errors
  • tidy: Fix common formatting issues (e.g.: extra/missing spaces after punctuation)
  • no-sdh: Remove SDH descriptions
  • no-lyrics: Remove lyrics
  • no-spam: Remove ads and spams
  • no-style: Remove font style tags like <i> and <b>
  • minimal: includes only ocr and tidy rules
  • default: includes all rules except no-style

Usage

CLI

Clean subtitles:

$ cat mysubtitle.srt
1
00:00:46,464 --> 00:00:48,549
-And then what?
-| don't know.

2
00:49:07,278 --> 00:49:09,363
- If you cross the sea
with an army you bought ...


$ cleanit -t default mysubtitle.en.srt
1 subtitle collected / 0 subtitle filtered out / 0 path ignored
1 subtitle saved / 0 subtitle unchanged

$ cat mysubtitle.srt
1
00:00:46,464 --> 00:00:48,549
- And then what?
- I don't know.

2
00:49:07,278 --> 00:49:09,363
If you cross the sea
with an army you bought...


$ cleanit -t ocr -t no-sdh -t tidy -l en -l pt-BR ~/subtitles/
423 subtitles collected / 107 subtitles filtered out / 0 path ignored
Cleaning subtitles  [####################################]  100%
268 subtitles saved / 155 subtitles unchanged

Using docker:

$ docker run -it --rm -v /medias:/medias -u $(id -u username):$(id -g username) ratoaq2/cleanit -t default /medias
1072 subtitles collected / 0 subtitle filtered out / 0 path ignored
Cleaning subtitles  [####################################]  100%
980 subtitle saved / 92 subtitles unchanged

API

from cleanit import Config, Subtitle

sub = Subtitle('/subtitle/path/subtitle.en.srt')
cfg = Config.from_path('/config/path')
rules = cfg.select_rules(tags={'ocr'})
if sub.clean(rules):
    sub.save()

YAML Configuration file

templates:
  - &ocr
    tags:
      - ocr
      - minimal
      - default
    priority: 10000
    languages: en

rules:
  replace-l-to-I-character[ocr:en]:
    <<: *ocr
    patterns: '\bl\b'
    replacement: 'I'
    examples:
      ? |
        And if l refuse?
      : |
        And if I refuse?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanit-0.4.8.tar.gz (21.6 kB view details)

Uploaded Source

Built Distribution

cleanit-0.4.8-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file cleanit-0.4.8.tar.gz.

File metadata

  • Download URL: cleanit-0.4.8.tar.gz
  • Upload date:
  • Size: 21.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.5.0-1022-azure

File hashes

Hashes for cleanit-0.4.8.tar.gz
Algorithm Hash digest
SHA256 1b19fe2dd2712695ebbf9d429c4d3366a1b51300738bb034c13ea221c84a6ae9
MD5 0a0b5adf9cc322e6683f457fc51e5d41
BLAKE2b-256 57e3d08d7980c4a04f3e23c8adf33717cb92b0e009ac96f6c05e5867bca0edf1

See more details on using hashes here.

File details

Details for the file cleanit-0.4.8-py3-none-any.whl.

File metadata

  • Download URL: cleanit-0.4.8-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.5.0-1022-azure

File hashes

Hashes for cleanit-0.4.8-py3-none-any.whl
Algorithm Hash digest
SHA256 8ae8853871a8664a8781f8f82940ac559322263058f9d94b245780c1750681f2
MD5 dcdf9c28fd79b49e63c9f061b55644be
BLAKE2b-256 79b9fcf9e3b833bff99e1d2d63c31dad1d10c1d650f29971b541846295d96513

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page