Data Curation over Time
Project description
Redditflow.
Mission.
Scrape data from reddit over a period of time of your choice, filter with AI assistants, and connect it to your ML pipelines!
Execution is as simple as this:
- Make a config file with your required details of input.
- Run the API in a single line with the config passed as input.
Installation.
pip install redditflow
Docs.
1) Text API.
Argument | Input | Description |
---|---|---|
sort_by | str | Sort the results by available options like 'best', 'new' ,'top', 'controversial' , etc as available from Reddit. |
subreddit_text_limit | int | Number of rows to be scraped per subreddit |
total_limit | int | Total number of rows to be scraped |
start_time | DateTime | Start date and time in dd.mm.yy hh.mm.ss format |
stop_time | DateTime | Stop date and time in dd.mm.yy hh.mm.ss format |
subreddit_search_term | str | Input search term to create filtered outputs |
subreddit_object_type | str | Available options for scraping are submission and comment . |
resume_task_timestamp | str, Optional | If task gets interrupted, the timestamp information available from the created folder names can be used to resume. |
2) Image API
Argument | Input | Description |
---|---|---|
sort_by | str | Sort the results by available options like 'best', 'new' ,'top', 'controversial' , etc as available from Reddit. |
subreddit_image_limit | int | Number of images to be scraped per subreddit |
total_limit | int | Total number of images to be scraped |
start_time | DateTime | Start date and time in dd.mm.yy hh.mm.ss format |
stop_time | DateTime | Stop date and time in dd.mm.yy hh.mm.ss format |
subreddit_search_term | str | Input search term to create filtered outputs |
subreddit_object_type | str | Available options for scraping are submission and comment |
client_id | str | Since Image API requires praw, the config requires a praw client ID. |
client_secret | str | Praw client secret. |
Examples
Text Scraping and filtering
config = {
"sort_by": "best",
"subreddit_text_limit": 50,
"total_limit": 200,
"start_time": "27.03.2021 11:38:42",
"end_time": "27.03.2022 11:38:42",
"subreddit_search_term": "healthcare",
"subreddit_object_type": "comment",
"resume_task_timestamp":1648613439
}
from redditflow import TextApi
TextApi(config)
Image Scraping and filtering
config = {
"sort_by": "best",
"subreddit_image_limit": 3,
"total_limit": 10,
"start_time": "13.11.2021 09:38:42",
"end_time": "15.11.2021 11:38:42",
"subreddit_search_term": "cats",
"subreddit_object_type": "comment",
"client_id": "$CLIENT_ID", # get client id for praw
"client_secret": $CLIENT_SECRET, #get client secret for praw
}
from redditflow import ImageApi
ImageApi(config)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
redditflow-0.1.0.tar.gz
(10.8 kB
view details)
Built Distribution
redditflow-0.1.0-py3.8.egg
(28.9 kB
view details)
File details
Details for the file redditflow-0.1.0.tar.gz
.
File metadata
- Download URL: redditflow-0.1.0.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7178fa8f0c714486e07dd219d03d70459399a98062116fdc20061b57d5ffdebe |
|
MD5 | e3c84bd57a6a45b83e5f1fe8273b2884 |
|
BLAKE2b-256 | 70b2784f99ac1d96954de9934d61c28f054d9411b25c499ff9233d6e5f9aa66c |
File details
Details for the file redditflow-0.1.0-py3.8.egg
.
File metadata
- Download URL: redditflow-0.1.0-py3.8.egg
- Upload date:
- Size: 28.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e8b83d74ee80a3efc17848a7b811bea8e11f4193bf588538e70edde7b9f3a80 |
|
MD5 | 96ed3e761c471b3d419b85c50ce590e8 |
|
BLAKE2b-256 | 9afcd19e6f4d06359abfda5ab4dd97e9b0f93695d0a8983dc57041a37f9806db |