A scraper which will scrape out multimedia data from reddit.

These details have not been verified by PyPI

Project links

Homepage

Project description

Reddit Multimodal Crawler

This is a wrapper to the PRAW package to scrape content from image in the form of csv, json, tsv, sql files.

This repository will help you scrape various subreddits, and will return to you multi-media attributes.

You can pip install this to integrate with some other application, or use it as an commandline application.

pip install reddit-multimodal-crawler

How to use the repository?

Before running the code, you should have registered with the Reddit API and create a sample project to run the code and obtain the client_id, client_secret and make a user_agent. Then pass them in the arguements.

Although, the easier way is to use the pip install reddit-multimodal-crawler.

Functionalities

This will help you scrape multiple subreddits just like PRAW but, will also return and save datasets for the same. Will scrape the posts and the comments as well.

Sample Code

import nltk
from reddit_multimodal_crawler.crawler import Crawler
import argparse

nltk.download("vader_lexicon")

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--subreddit_file_path",
        "A path to the file which contains the subreddits to scrape from.",
        type=str,
    )
    parser.add_argument(
        "--limit", "The limit to number of articles to scrape.", type=int
    )
    parser.add_argument("--client_id", "The Client ID provided by Reddit.", type=str)
    parser.add_argument(
        "--client_secret", "The Secret ID provided by the Reddit.", type=str
    )
    parser.add_argument(
        "--user_agent",
        "The User Agent in the form of <APP_NAME> <VERSION> by /u/<REDDIT_USERNAME>",
        type=str,
    )
    parser.add_argument(
        "--posts", "A boolean variable to parse through the posts or not.", type=bool
    )
    parser.add_argument(
        "--comments",
        "A boolean variable to parse through the comments of the top posts of subreddit",
        type=bool,
    )

    args = parser.parse_args()

    client_id = args["client_id"]
    client_secret = args["client_secret"]
    user_agent = args["user_agent"]
    file_path = args["subreddit_file_path"]
    limit = args["limit"]

    r = Crawler(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

    subreddit_list = open(file_path, "r").readlines().split()

    print(subreddit_list)

    if args["posts"]:
        r.get_posts(subreddit_names=subreddit_list, sort_by="top", limit=limit)

    if args["comments"]:
        r.get_comments(subreddit_names=subreddit_list, sort_by="top", limit=limit)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.3.2

Dec 31, 2022

1.3.1

Dec 30, 2022

This version

1.2.0

Dec 27, 2022

1.1.0

Dec 27, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reddit_multimodal_crawler-1.2.0.tar.gz (4.8 kB view details)

Uploaded Dec 27, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

reddit_multimodal_crawler-1.2.0-py3-none-any.whl (5.3 kB view details)

Uploaded Dec 27, 2022 Python 3

File details

Details for the file reddit_multimodal_crawler-1.2.0.tar.gz.

File metadata

Download URL: reddit_multimodal_crawler-1.2.0.tar.gz
Upload date: Dec 27, 2022
Size: 4.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for reddit_multimodal_crawler-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ad7885b5b37a41a0eeb5f713a67eb4fa75fed485bf94a0622d54a7f646359a06`
MD5	`7d13651ac53ca10694caf3a61f224bc5`
BLAKE2b-256	`be2d1421e19602969be57ace8617357641c20b970124a042bb9daef43b0da5a1`

See more details on using hashes here.

File details

Details for the file reddit_multimodal_crawler-1.2.0-py3-none-any.whl.

File metadata

Download URL: reddit_multimodal_crawler-1.2.0-py3-none-any.whl
Upload date: Dec 27, 2022
Size: 5.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for reddit_multimodal_crawler-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5563de474f862dad0ddfbe9877f5ca5528304a1394cc21b387f1fd168eb2cb2`
MD5	`0a3717b1fdb57f6b5fb027a2ce1245ac`
BLAKE2b-256	`6927a93f211084f2183e0db5efad20f80a0716f277afa3a932d1bd41ca3fce53`

See more details on using hashes here.

reddit-multimodal-crawler 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Reddit Multimodal Crawler

How to use the repository?

Functionalities

Sample Code

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes