Skip to main content

Subtitles extremely clean

Project description

CleanIt

Subtitles extremely clean.

.. image:: https://img.shields.io/pypi/v/cleanit.svg :target: https://pypi.python.org/pypi/cleanit :alt: Latest Version

.. image:: https://travis-ci.org/ratoaq2/cleanit.svg?branch=master :target: https://travis-ci.org/ratoaq2/cleanit :alt: Travis CI build status

.. image:: https://img.shields.io/github/license/ratoaq2/cleanit.svg :target: https://github.com/ratoaq2/cleanit/blob/master/LICENSE :alt: License

:Project page: https://github.com/ratoaq2/cleanit

CleanIt is a command line tool that helps you to keep your subtitles clean. You can specify your own rules to detect entries to be removed or patterns to be replaced. Simple text matching or complex regex can be used. It comes with standard rules out of the box:

  • ocr: Fix common OCR errors
  • tidy: Fix common formatting issues (e.g.: extra/missing spaces after punctuation)
  • no-sdh: Remove SDH descriptions
  • no-lyrics: Remove lyrics
  • no-spam: Remove ads and spams
  • no-style: Remove font style tags like and
  • minimal: includes only ocr and tidy rules
  • default: includes all rules except no-style

Usage

CLI ^^^ Clean subtitles::

$ cat mysubtitle.srt
1
00:00:46,464 --> 00:00:48,549
-And then what?
-| don't know.

2
00:49:07,278 --> 00:49:09,363
- If you cross the sea
with an army you bought ...


$ cleanit -t default mysubtitle.en.srt
1 subtitle collected / 0 subtitle filtered out / 0 path ignored
1 subtitle saved / 0 subtitle unchanged

$ cat mysubtitle.srt
1
00:00:46,464 --> 00:00:48,549
- And then what?
- I don't know.

2
00:49:07,278 --> 00:49:09,363
If you cross the sea
with an army you bought...


$ cleanit -t ocr -t no-sdh -t tidy -l en -l pt-BR ~/subtitles/
423 subtitles collected / 107 subtitles filtered out / 0 path ignored
Cleaning subtitles  [####################################]  100%
268 subtitles saved / 155 subtitles unchanged

Using docker::

$ docker run -it --rm -v /medias:/medias -u $(id -u username):$(id -g username) ratoaq2/cleanit -t default /medias
1072 subtitles collected / 0 subtitle filtered out / 0 path ignored
Cleaning subtitles  [####################################]  100%
980 subtitle saved / 92 subtitles unchanged

API ^^^ .. code:: python

from cleanit import Config, Subtitle

sub = Subtitle('/subtitle/path/subtitle.en.srt')
cfg = Config.from_path('/config/path')
rules = cfg.select_rules(tags={'ocr'})
if sub.clean(rules):
    sub.save()

YAML Configuration file ^^^^^^^^^^^^^^^^^^^^^^^

.. code:: yaml

templates:
  - &ocr
    tags:
      - ocr
      - minimal
      - default
    priority: 10000
    languages: en

rules:
  replace-l-to-I-character[ocr:en]:
    <<: *ocr
    patterns: '\bl\b'
    replacement: 'I'
    examples:
      ? |
        And if l refuse?
      : |
        And if I refuse?

Changelog

0.4.2 ^^^^^ release date: 2021-03-20

  • Fixes default configuration loading

0.4.1 ^^^^^ release date: 2021-03-16

  • Fixes missing default configuration files

0.4.0 ^^^^^ release date: 2021-03-16

  • Major refactoring
  • Drop python 2 support
  • Added support for languages and tags
  • Added default rules

0.3.0 ^^^^^ release date: 2021-03-02

  • Python 3.x support

0.2.1 ^^^^^ release date: 2016-02-28

  • Adding guess encoding back without python-magic dependency.

0.2 ^^^^^ release date: 2016-02-27

  • Removing chardet and python-magic dependencies. Either encoding is specified or it should be guessed by pysrt

0.1 ^^^^^ release date: 2015-10-16

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanit-0.4.3.tar.gz (23.3 kB view details)

Uploaded Source

File details

Details for the file cleanit-0.4.3.tar.gz.

File metadata

  • Download URL: cleanit-0.4.3.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.5

File hashes

Hashes for cleanit-0.4.3.tar.gz
Algorithm Hash digest
SHA256 08e88f4f193563cd2753837a6fdd45d9e0e2f80a7a3414451ac965a994a0f6e9
MD5 c591e11662506826c1a8a0a96fa7383a
BLAKE2b-256 9a829013093fe916e96ec7c2303372eb2adc77e8fc2f43ed8e33e2f0b52a2286

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page