A sweet little collection of handy functions for video file downloading. 📼
Project description
(a Multi-processing Audiovisual CRAWLer collectiON)
About
A package for crawling and downloading YouTube videos. As multiple datasets that are introduced only provide the ids of videos without a download script, obtaining the video files may be difficult. This project aims to provide a general solution is such cases by downloading either the video or audio from ids specified by a dataset. It also aims to speed up processing though enabling multiple threads to run in parallel. The video resolution is user set in order to speed-up downloading and to limit the on-disk dataset size.
Currently only video-only or audio-only files are downloaded (the next update/version will allow to also download videos with audio).
Package Requirements
This is the list of the required packages:
pandas
pafy
ffmpeg
youtube-dl
tqdm
They can all be downloaded with:
$ pip install pandas pafy tqdm
CSV Dataset file
The package assumes that the following headers are included in the .csv
file that includes the YouTube ids:
youtube_id | start | end (or) duration |
---|
The name of the headers do not need to match exactly but the data needs to include the id, start time end time or duration.
Usage
The main function used to download files is called download()
as is located at the youtube_audio_and_video_downloader.py
. You can simply call it by first importing it:
from macrawlon import download
#or
from youtube_audio_and_video_downloader import download
download(
csv_dir=my_csv_dir,
download_dir=my_down_dir,
modality='video',
resolutions=my_res_list,
id_idx = 0,
start_idx = 1,
end_idx = None,
duration=10,
workers=5
)
The function takes the following arguments:
Argument | About |
---|---|
csv_dir |
directory for the dataset .csv file. |
download_dir |
directory for the location to download |
modality |
video modality to download, can choose audio , video , audio-video for separate audio and video files or audio+video for video files with audio. |
resolutions |
(optional) list of resolution qualities, with the first list elements being the preferred options. |
id_idx |
(optional) The column index in the csv file that contains the youtube video ids. E.g. if 0 then the first column of the csv should have the youtube video ids. |
start_idx |
(optional) The index for the starting location (in secs.) in the video. |
end_idx |
(optional) The index for the ending location (in secs.) in the video. |
duration |
(optional) The duration (in secs.) of the video. To be used if end_idx is not specified. |
workers |
(optional) The number of sub-processes to run. |
Installation through git
Please make sure, Git is installed in your machine:
$ sudo apt-get update
$ sudo apt-get install git
$ git clone https://github.com/macrawlon/macrawlon.git
$ cd macrawlon
$ pip install .
You can then use it as any other package installed through pip.
Installation through pip
The latest stable release is also available for download through pip
$ pip install macrawlon
Licence
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file macrawlon-0.1.tar.gz
.
File metadata
- Download URL: macrawlon-0.1.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c9a2cea60ddeb5bed4a0cba4e288b904d0bcf1632ea792530dc22683c4d669f |
|
MD5 | 2587e8ca324a82ca62486b9ae73d819a |
|
BLAKE2b-256 | 0b9fdd645e4b72eafff98952c6da8cdfdc6b220afeb0ad8d3201cf2417099249 |
File details
Details for the file macrawlon-0.1-py3-none-any.whl
.
File metadata
- Download URL: macrawlon-0.1-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bd9909df3bad5ee5dff9b2a9fe8abf8a9c3d2e1f82c01d490898b300517c25d |
|
MD5 | f50f484a549698d2701495a70a951d55 |
|
BLAKE2b-256 | 3b4bbf10c5f2911f382e810ac01352aa2f8c6e1d846bcce94aad3b785736cf3c |