Skip to main content

A sweet little collection of handy functions for video file downloading. 📼

Project description

text

(a Multi-processing Audiovisual CRAWLer collectiON)

supported versions   PyPi   Licence   Tweet


About

A package for crawling and downloading YouTube videos. As multiple datasets that are introduced only provide the ids of videos without a download script, obtaining the video files may be difficult. This project aims to provide a general solution is such cases by downloading either the video or audio from ids specified by a dataset. It also aims to speed up processing though enabling multiple threads to run in parallel. The video resolution is user set in order to speed-up downloading and to limit the on-disk dataset size.

Currently only video-only or audio-only files are downloaded (the next update/version will allow to also download videos with audio).


Package Requirements

This is the list of the required packages:

  • pandas
  • pafy
  • ffmpeg
  • youtube-dl
  • tqdm

They can all be downloaded with:

$ pip install pandas pafy tqdm

CSV Dataset file

The package assumes that the following headers are included in the .csv file that includes the YouTube ids:

youtube_id start end (or) duration

The name of the headers do not need to match exactly but the data needs to include the id, start time end time or duration.


Usage

The main function used to download files is called download() as is located at the youtube_audio_and_video_downloader.py. You can simply call it by first importing it:

from macrawlon import download
#or
from youtube_audio_and_video_downloader import download

download(
  csv_dir=my_csv_dir,
  download_dir=my_down_dir,
  modality='video',
  resolutions=my_res_list,
  id_idx = 0,
  start_idx = 1,
  end_idx = None,
  duration=10,
  workers=5
  )

The function takes the following arguments:

Argument About
csv_dir directory for the dataset .csv file.
download_dir directory for the location to download
modality video modality to download, can choose audio, video, audio-video for separate audio and video files or audio+video for video files with audio.
resolutions (optional) list of resolution qualities, with the first list elements being the preferred options.
id_idx (optional) The column index in the csv file that contains the youtube video ids. E.g. if 0 then the first column of the csv should have the youtube video ids.
start_idx (optional) The index for the starting location (in secs.) in the video.
end_idx (optional) The index for the ending location (in secs.) in the video.
duration (optional) The duration (in secs.) of the video. To be used if end_idx is not specified.
workers (optional) The number of sub-processes to run.

Installation through git

Please make sure, Git is installed in your machine:

$ sudo apt-get update
$ sudo apt-get install git
$ git clone https://github.com/macrawlon/macrawlon.git
$ cd macrawlon
$ pip install .

You can then use it as any other package installed through pip.


Installation through pip

The latest stable release is also available for download through pip

$ pip install macrawlon

Licence

MIT

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macrawlon-0.1.tar.gz (7.0 kB view hashes)

Uploaded Source

Built Distribution

macrawlon-0.1-py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page