Skip to main content

A scraper which will scrape out multimedia data from reddit.

Project description

Reddit Multimodal Crawler Downloads

This is a wrapper to the PRAW package to scrape content from image in the form of csv, json, tsv, sql files.

This repository will help you scrape various subreddits, and will return to you multi-media attributes.

You can pip install this to integrate with some other application, or use it as an commandline application.

pip install reddit-multimodal-crawler

How to use the repository?

Before running the code, you should have registered with the Reddit API and create a sample project to run the code and obtain the client_id, client_secret and make a user_agent. Then pass them in the arguements.

Although, the easier way is to use the pip install reddit-multimodal-crawler.

Functionalities

This will help you scrape multiple subreddits just like PRAW but, will also return and save datasets for the same. Will scrape the posts and the comments as well.

Sample Code

import nltk
from reddit_multimodal_crawler.crawler import Crawler
import argparse

nltk.download("vader_lexicon")

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--subreddit_file_path",
        "A path to the file which contains the subreddits to scrape from.",
        type=str,
    )
    parser.add_argument(
        "--limit", "The limit to number of articles to scrape.", type=int
    )
    parser.add_argument("--client_id", "The Client ID provided by Reddit.", type=str)
    parser.add_argument(
        "--client_secret", "The Secret ID provided by the Reddit.", type=str
    )
    parser.add_argument(
        "--user_agent",
        "The User Agent in the form of <APP_NAME> <VERSION> by /u/<REDDIT_USERNAME>",
        type=str,
    )
    parser.add_argument(
        "--posts", "A boolean variable to parse through the posts or not.", type=bool
    )
    parser.add_argument(
        "--comments",
        "A boolean variable to parse through the comments of the top posts of subreddit",
        type=bool,
    )

    args = parser.parse_args()

    client_id = args["client_id"]
    client_secret = args["client_secret"]
    user_agent = args["user_agent"]
    file_path = args["subreddit_file_path"]
    limit = args["limit"]

    r = Crawler(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

    subreddit_list = open(file_path, "r").readlines().split()

    print(subreddit_list)

    if args["posts"]:
        r.get_posts(subreddit_names=subreddit_list, sort_by="top", limit=limit)

    if args["comments"]:
        r.get_comments(subreddit_names=subreddit_list, sort_by="top", limit=limit)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reddit_multimodal_crawler-1.3.1.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reddit_multimodal_crawler-1.3.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file reddit_multimodal_crawler-1.3.1.tar.gz.

File metadata

File hashes

Hashes for reddit_multimodal_crawler-1.3.1.tar.gz
Algorithm Hash digest
SHA256 a7dfe3ece16f9fe7acfc225426ecd5b03dd11df822aa9e62fe42258e3c12aaf1
MD5 a6d7d63dfe689001f38d2602e579cde4
BLAKE2b-256 6b3ba8a3d70e62530fafc2a7dc4cfb03da2aa1bca2e21e322b69caefd04d34c9

See more details on using hashes here.

File details

Details for the file reddit_multimodal_crawler-1.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for reddit_multimodal_crawler-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 41161cce549f97022540434a79f8a1c1b2a12a3e81ac8ba3d824039cdcef06a5
MD5 ed20ab686d2c1222bf0dd2620cddc57c
BLAKE2b-256 e8edb0f7dff406335345d68626100e1f85573e1cd539db6eec4edf2ef1606328

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page