Skip to main content

Tool for finding clips in YouTube videos

Project description

youtube-clipper

Tool for finding clips in YouTube videos. Supports searching in multiple videos, e.g. in playlists or channels.

YouTube Clipper (YTC) downloads subtitles (without videos themselves) using yt-dlp, parses them using provided parsers and converters and performs the search using whoosh.

Installation

From PyPI

  1. Create and activate a virtualenv: python3 -m venv .venv && source .venv/bin/activate.
  2. Install youtube-clipper using pip: pip install youtube-clipper

From source

  1. Clone the repository: git clone git@github.com:MCPN/youtube-clipper.git.
  2. Install poetry: pip install poetry.
  3. Install youtube-clipper using poetry: cd youtube-clipper && poetry install.
  4. To install test dependencies, run poetry install --with tests.

Usage

Basic usage: youtube-clipper --url https://www.youtube.com/watch?v=dQw4w9WgXcQ --query 'were no strangers to love' --allow-autogenerated.

There are two groups of additional arguments:

  1. yt-dlp arguments are passed almost directly to the yt-dlp client and are related to downloaded subtitles
  2. Searcher arguments are used after the download and control the search process and postprocessing.

For a full list of settings, run youtube-clipper --help.

Downloading subtitles

By default, yt-dlp will only look for manual subtitles. By adding the --allow-autogenerated option, one allows yt-dlp to download autogenerated subtitles, but manual ones are always preferred.

Despite YouTube only showing autogenerated subtitles in one language, there are often many translated versions in a yt-dlp result, so a search can be performed in a completely different one. For example, youtube-clipper --url https://www.youtube.com/watch?v=dQw4w9WgXcQ --query 'нам не чужда любовь' --allow-autogenerated --language ru --search-limit 1 surprisingly outputs the correct result.

However, one should note that autogenerated subtitles can be faulty, are often censored of obscene language, and contain marks like [Music], so exact searches over autogenerated subtitles are not recommended.

Query language

The query argument supports the whoosh query language. The full description can be found in the official documentation, here are some key features:

  • By default, the searcher doesn't require all words to appear in the result, with more matched words resulting in a better overall match. To search for an exact phrase, one can wrap a query in "double quotes" (be careful with bash escaping!)
  • While using the phrase search, one can allow words to be at a certain distance from each other "like this"~5 (now like can be within 5 words from this)
  • On the other hand, while allowing imperfect search, one can boost or lower the importance of a certain word using the ^ operator. For example, in query i^2 love cookies^0.5 the word i is twice as important as the word love while the word cookies is half as important

Pairwise grouping and deduplication

Searcher treats every separate subtitle (i.e. the phrase that appears on a screen at a singular moment) as its own document. The problem occurs when a phrase is split between two subtitles. In this case, the exact search will fail, while the regular one might be inaccurate.

To deal with it, one can enable pairwise grouping for the subtitles with the --enable-pairwise-group option. When enabled, every successive overlapping subtitle pair will be merged into one subtitle, and a search will be performed on new subtitles. For example, given three subtitles with phrases a, b and c respectively, the pairwise grouping will generate two new subtitles with phrases a b and b c.

Note that autogenerated subtitles already have overlaps in them, so the pairwise grouping might be excessive.

The downside of using pairwise grouping is that output might contain duplicates: indeed, a search query can match a common part of new subtitles (for example, in the previous example a search for b will match twice). The --deduplication-mode option allows to remove these duplicates:

  • When set to --keep-first, if two consecutive subtitles are in the output, only the first one will be kept. This also applies to a chain of more than two consecutive subtitles
  • When set to --keep-last, the last subtitle in a consecutive chain will be kept
  • When set to --disable, the deduplication is skipped

By default, the pairwise grouping is disabled and deduplication with the keep first mode is enabled. When searching in manual subtitles, pairwise grouping is highly recommended, and as for the deduplication, the keep first mode is preferred because the keep last mode can result in a timestamp that points after the phrase.

Development

Parsers and converters

There is an easy API for adding new subtitle formats to YTC. To do so, one should implement a parser interface at youtube_clipper.parsers.model:SubtitleParser and add the implementation to the registry at youtube_clipper.parsers.model:PARSERS_REGISTRY.

Alternatively, one can implement a converter for a currently unsupported format that will output a new file with an existing parser for it. This is done similarly to parsers, one should implement a parser interface at youtube_clipper.converters.model:SubtitlesConverter and add the implementation to the registry at youtube_clipper.converters.model:CONVERTERS_REGISTRY. Tests ensure that every converter has a corresponding parser.

Tests

Testing is done using pytest: pytest tests.

Type checking is done using mypy: mypy youtube_clipper tests. However, it's not very efficient, as most of the dependencies don't have stubs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

youtube_clipper-1.0.1.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

youtube_clipper-1.0.1-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file youtube_clipper-1.0.1.tar.gz.

File metadata

  • Download URL: youtube_clipper-1.0.1.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.7 Darwin/20.6.0

File hashes

Hashes for youtube_clipper-1.0.1.tar.gz
Algorithm Hash digest
SHA256 1a805543e846c8553d90a1e0de5237b7052bc41d5830a0a91710a32a165319f0
MD5 368488ebf58aedcbbe4a393ae9cd4091
BLAKE2b-256 9ce2720a3803e288b6c28f24d22af6fa2f0ea2a56e6215ab37e820357027935c

See more details on using hashes here.

File details

Details for the file youtube_clipper-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: youtube_clipper-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.7 Darwin/20.6.0

File hashes

Hashes for youtube_clipper-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8e8126950ce56e1cc793882b03572453dea90e63e9463f42a1a3b21e44670d3
MD5 115a45ded1ecfd39b805f17be7ef3f09
BLAKE2b-256 854096d73e629930458c49a60b5a06fff56ca6ddb1e580e3c5c9921111f4a52f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page