Skip to main content

Exports all accessible reddit comments for an account using pushshift

Project description

Exports all accessible reddit comments for an account using pushshift.

PyPi version Python 3.6|3.7|3.8 PRs Welcome

Install

Requires python3.6+

To install with pip, run:

pip install pushshift_comment_export

Is accessible as the script pushshift_comment_export, or by using python3 -m pushshift_comment_export.


Reddit (supposedly) only indexes the last 1000 items per query, so there are lots of comments that I don't have access to using the official reddit API (I run rexport periodically to pick up any new data.)

This downloads all the comments that pushshift has, which is typically more than the 1000 query limit. This is only really meant to be used once per account, to access old data that I don't have access to.

For more context see the comments here.

Reddit has recently added a data request which may let you get comments going further back, but pushshifts JSON response contains a bit more info than what the GDPR request does

Complies to the rate limit described here

$ pushshift_comment_export <reddit_username> --to-file ./data.json
.....
[D 200903 19:51:49 __init__:43] Have 4700, now searching for comments before 2015-10-07 23:32:03...
[D 200903 19:51:49 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username&limit=100&sort_type=created_utc&sort=desc&before=1444260723...
[D 200903 19:51:52 __init__:43] Have 4800, now searching for comments before 2015-09-22 13:55:00...
[D 200903 19:51:52 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username&limit=100&sort_type=created_utc&sort=desc&before=1442930100...
[D 200903 19:51:57 __init__:43] Have 4860, now searching for comments before 2014-08-28 07:10:14...
[D 200903 19:51:57 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username&limit=100&sort_type=created_utc&sort=desc&before=1409209814...
[I 200903 19:52:01 __init__:64] Done! writing 4860 comments to file ./data.json

pushshift doesn't require authentication, if you want to preview what this looks like, just go to https://api.pushshift.io/reddit/comment/search?author=

Usage in HPI

This has been merged into karlicoss/HPI, which combines the periodic results of rexport (to pick up new comments), with any from the past using this, which looks like this; my config looking like:

class reddit:
    class rexport:
        export_path: Paths = "~/data/rexport/*.json"
    class pushshift:
        export_path: Paths = "~/data/pushshift/*.json"

Then importing from my.reddit.all combines the data from both of them:

>>> from my.reddit.rexport import comments as rcomments
>>> from my.reddit.pushshift import comments as pcomments
>>> from my.reddit.all import comments
>>> from more_itertools import ilen
>>> ilen(rcomments())
1020
>>> ilen(pcomments())
4891
>>> ilen(comments())
4914

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pushshift_comment_export-0.1.4.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

pushshift_comment_export-0.1.4-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file pushshift_comment_export-0.1.4.tar.gz.

File metadata

File hashes

Hashes for pushshift_comment_export-0.1.4.tar.gz
Algorithm Hash digest
SHA256 3885520b575d3b84fa01cbd3c59f334c53933904d466702551f2be61a2f522a6
MD5 8d12e912a2de77a3f46061da54bcf770
BLAKE2b-256 79421146a90f6d4f4072f10d3cb5eead10efb8efc4121272fc14a1ec20de5629

See more details on using hashes here.

File details

Details for the file pushshift_comment_export-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pushshift_comment_export-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2de6ccd0cf93a6f0f72fcef97a561d40816ff32984e95629b072364aa69c0722
MD5 b4991bbd2a0187d856f1eaf1613237f7
BLAKE2b-256 445e505038ef7c70b63a05cdf17fd84987131a445102161c550cb3af1bab6129

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page