Skip to main content

Data Curation over Time

Project description

Redditflow.

Mission.

Scrape data from reddit over a period of time of your choice, filter with AI assistants, and connect it to your ML pipelines!

Execution is as simple as this:

  • Make a config file with your required details of input.
  • Run the API in a single line with the config passed as input.

Installation.

pip install redditflow

Docs.

1) Text API.

Argument Input Description
sort_by str Sort the results by available options like 'best', 'new' ,'top', 'controversial' , etc as available from Reddit.
subreddit_text_limit int Number of rows to be scraped per subreddit
total_limit int Total number of rows to be scraped
start_time DateTime Start date and time in dd.mm.yy hh.mm.ss format
stop_time DateTime Stop date and time in dd.mm.yy hh.mm.ss format
subreddit_search_term str Input search term to create filtered outputs
subreddit_object_type str Available options for scraping are submission and comment.
resume_task_timestamp str, Optional If task gets interrupted, the timestamp information available from the created folder names can be used to resume.

2) Image API

Argument Input Description
sort_by str Sort the results by available options like 'best', 'new' ,'top', 'controversial' , etc as available from Reddit.
subreddit_image_limit int Number of images to be scraped per subreddit
total_limit int Total number of images to be scraped
start_time DateTime Start date and time in dd.mm.yy hh.mm.ss format
stop_time DateTime Stop date and time in dd.mm.yy hh.mm.ss format
subreddit_search_term str Input search term to create filtered outputs
subreddit_object_type str Available options for scraping are submission and comment
client_id str Since Image API requires praw, the config requires a praw client ID.
client_secret str Praw client secret.

Examples

Text Scraping and filtering

config = {
        "sort_by": "best",
         "subreddit_text_limit": 50,
        "total_limit": 200,
        "start_time": "27.03.2021 11:38:42",
        "end_time": "27.03.2022 11:38:42",
        "subreddit_search_term": "healthcare",
        "subreddit_object_type": "comment",
         "resume_task_timestamp":1648613439
    }
from redditflow import TextApi
TextApi(config)

Image Scraping and filtering

config = {
        "sort_by": "best",
        "subreddit_image_limit": 3,
        "total_limit": 10,
         "start_time": "13.11.2021 09:38:42",
         "end_time": "15.11.2021 11:38:42",
         "subreddit_search_term": "cats",
         "subreddit_object_type": "comment",
         "client_id": "$CLIENT_ID", # get client id for praw
         "client_secret": $CLIENT_SECRET, #get client secret for praw
         }

from redditflow import ImageApi
ImageApi(config)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redditflow-0.1.0.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

redditflow-0.1.0-py3.8.egg (28.9 kB view details)

Uploaded Source

File details

Details for the file redditflow-0.1.0.tar.gz.

File metadata

  • Download URL: redditflow-0.1.0.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for redditflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7178fa8f0c714486e07dd219d03d70459399a98062116fdc20061b57d5ffdebe
MD5 e3c84bd57a6a45b83e5f1fe8273b2884
BLAKE2b-256 70b2784f99ac1d96954de9934d61c28f054d9411b25c499ff9233d6e5f9aa66c

See more details on using hashes here.

File details

Details for the file redditflow-0.1.0-py3.8.egg.

File metadata

  • Download URL: redditflow-0.1.0-py3.8.egg
  • Upload date:
  • Size: 28.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for redditflow-0.1.0-py3.8.egg
Algorithm Hash digest
SHA256 6e8b83d74ee80a3efc17848a7b811bea8e11f4193bf588538e70edde7b9f3a80
MD5 96ed3e761c471b3d419b85c50ce590e8
BLAKE2b-256 9afcd19e6f4d06359abfda5ab4dd97e9b0f93695d0a8983dc57041a37f9806db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page