Skip to main content

Data Curation over Time

Project description

Redditflow.

**Do everything from data collection from reddit to training a machine learning model in just two lines of python code! **


WebsiteInstallationDocsHuggingface HubBlog

PyPI Status Downloads

Discord license

Supports:

  • Text Data
  • Image Data

Execution is as simple as this:

  • Make a config file with your required details of input.
  • Run the API in a single line with the config passed as input.

Installation.

pip install redditflow

Latest installation from source.

pip install git+https://github.com/nfflow/redditflow

Docs.

1) Text API.

Argument Input Description
sort_by str Sort the results by available options like 'best', 'new' ,'top', 'controversial' , etc as available from Reddit.
subreddit_text_limit int Number of rows to be scraped per subreddit
total_limit int Total number of rows to be scraped
start_time DateTime Start date and time in dd.mm.yy hh.mm.ss format
stop_time DateTime Stop date and time in dd.mm.yy hh.mm.ss format
subreddit_search_term str Input search term to create filtered outputs
subreddit_object_type str Available options for scraping are submission and comment.
resume_task_timestamp str, Optional If task gets interrupted, the timestamp information available from the created folder names can be used to resume.
ml_pipeline Dict, Optional If an ML pipeline needs to be connected at the end, to have a trained model, specify this parameter. How to specify ML pipeline arguments

ML pipeline arguments

The ML pipeline dict can have the following arguments.

Argument Input Description
model_name str path to pre-trained model name(Currently from Sentence Transformers (https://www.sbert.net/) hub.
model_output_path str path to the model_output

2) Image API

Argument Input Description
sort_by str Sort the results by available options like 'best', 'new' ,'top', 'controversial' , etc as available from Reddit.
subreddit_image_limit int Number of images to be scraped per subreddit
total_limit int Total number of images to be scraped
start_time DateTime Start date and time in dd.mm.yy hh.mm.ss format
stop_time DateTime Stop date and time in dd.mm.yy hh.mm.ss format
subreddit_search_term str Input search term to create filtered outputs
subreddit_object_type str Available options for scraping are submission and comment
client_id str Since Image API requires praw, the config requires a praw client ID.
client_secret str Praw client secret.

Examples

Text data collection and training a model in the end.

from redditflow import TextApi


config = {
        "sort_by": "best",
        "subreddit_text_limit": 50,
        "total_limit": 200,
        "start_time": "27.03.2021 11:38:42",
        "end_time": "27.03.2022 11:38:42",
        "subreddit_search_term": "healthcare",
        "subreddit_object_type": "comment",
        "ml_pipeline": {
            'model_name': 'distilbert-base-uncased',
            'model_output_path': 'healthcare_27.03.2021-27.03.2022_redditflow',
            'model_architecture': 'CT'
            }
    }


TextApi(config)


Image data collection

from redditflow import ImageApi


config = {
        "sort_by": "best",
        "subreddit_image_limit": 3,
        "total_limit": 10,
        "start_time": "13.11.2021 09:38:42",
        "end_time": "15.11.2021 11:38:42",
        "subreddit_search_term": "cats",
        "subreddit_object_type": "comment",
        "client_id": "$CLIENT_ID",  # get client id for praw
        "client_secret": '$CLIENT_SECRET',  # get client secret for praw
         }

ImageApi(config)


Since the image api requires praw api from python, a praw client_id and client_secret are required. Read here about how to get client id and client secret for praw.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redditflow-1.0.0.tar.gz (11.6 kB view details)

Uploaded Source

File details

Details for the file redditflow-1.0.0.tar.gz.

File metadata

  • Download URL: redditflow-1.0.0.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for redditflow-1.0.0.tar.gz
Algorithm Hash digest
SHA256 11e1715898b5ff8fc53b342612bf314920392e11bb06c5c17445f35042b48f5f
MD5 8d8417e59bcf1713e716b10a878cbfc1
BLAKE2b-256 4bd7cb0b85f0cefdb978b63288e47972ee091b3129d86663e9fb65de3a3056b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page