Skip to main content

A Python package that is used to download posts and comments from Reddit.

Project description

ci codecov license PyPI

Reddit Data Collector

Reddit Data Collector is a Python package that allows a user to collect post and comment data from Reddit. It is built on top of the Python module PRAW, which stands for "The Python Reddit API Wrapper". It aims to make it very simple for a user to collect data from Reddit for further analysis (e.g. Natural Language Processing), without having to learn the inner workings of PRAW or the Reddit API.

It is currently maintained by Nico Van den Hooff.

Installation

Dependencies

Reddit Data Collector requires Python and:

  • pandas (>=1.3.5)
  • praw (>=7.5.0)
  • tqdm (>=4.62.3)

User installation

The recommended way to install Reddit Data Collector is using pip:

pip install reddit-data-collector

How to Use Reddit Data Collector

Please see the examples directory for step by step instructions on how to use Reddit Data Collector.

Development

Important links

Source code

You can check the latest sources with the command:

git clone https://github.com/nicovandenhooff/reddit-data-collector.git

Contributing

To learn more about making a contribution to Reddit Data Collector, please see the contributing file.

Potential Ideas for Contribution

  • Add ability to collect images from Reddit posts that contain them.
  • Add author information to post and comment data, currently the Reddit API is inconsistent with suspended and deleted author data, so this functionality has not been built in yet.
  • Add plotting module that creates useful visualizations of the data that has been collected
  • Add preprocessing module that cleans up the posts and/or comment data collected

Testing

After installation, you can launch the test suite, which is contained in the tests/tests.py. Note that you will have to have pytest >= 6.2.5 and pytest-cov >= 3.0.0 installed. You can launch the test suite by following these steps from the projects root directory:

  1. Open up tests.py with the following command:
open tests/tests.py

Comment out lines 24 to 30. Change the values in DataCollector() in line 32 to your Reddit credentials.

  1. Run the following command to run the tests:
pytest tests/tests.py
  1. If desired, run the following command to show test coverage:
pytest --cov=src tests/tests.py

Project History

The project was started in January 2022 by Nico Van den Hooff as a side project while he was completing the UBC Master of Data Science Project. Nico wanted to obtain a sample of posts and comments from Reddit, but noticed that while PRAW existed and provided seamless access to Reddit's API, there was no package available that allowed for a simple method to collect this data.

Inspiration

Certain sections of this README file was inspired by the scikit-learn README.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reddit-data-collector-1.1.0.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

reddit_data_collector-1.1.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file reddit-data-collector-1.1.0.tar.gz.

File metadata

  • Download URL: reddit-data-collector-1.1.0.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for reddit-data-collector-1.1.0.tar.gz
Algorithm Hash digest
SHA256 335b8ff74ef3ead878ddc9cb4e2a4da9a6dd14411f1e0a07c029f4660d63cf55
MD5 4ab32319039a26091976b011dc24677a
BLAKE2b-256 375121262ece57a920ac1688c160d0ecea492cbf3c4a25a38d88c94a5dff11dd

See more details on using hashes here.

File details

Details for the file reddit_data_collector-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: reddit_data_collector-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for reddit_data_collector-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8261cd8691a7805ed59733c9a7220365bb342323fe5d85ee0e2d1fd8a60962c
MD5 4515d6ada1e67fa2b63d1ea0c4c6d564
BLAKE2b-256 629d3994784d7609692163b50a2682dbaaaf8a00387b287a515267af1d9f6d6b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page