Exports all accessible reddit comments for an account using pushshift
Project description
Exports all accessible reddit comments for an account using pushshift.
Install
Requires python3.6+
To install with pip, run:
pip install pushshift_comment_export
Is accessible as the script pushshift_comment_export
, or by using python3 -m pushshift_comment_export
.
Reddit (supposedly) only indexes the last 1000 items per query, so there are lots of comments that I don't have access to using the official reddit API (I run rexport
periodically to pick up any new data.)
This downloads all the comments that pushshift has, which is typically more than the 1000 query limit. This is only really meant to be used once per account, to access old data that I don't have access to.
For more context see the comments here.
Reddit has recently added a data request which may let you get comments going further back, but pushshifts JSON response contains a bit more info than what the GDPR request does
Complies to the rate limit described here
$ pushshift_comment_export <reddit_username> --to-file ./data.json
.....
[D 200903 19:51:49 __init__:43] Have 4700, now searching for comments before 2015-10-07 23:32:03...
[D 200903 19:51:49 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username&limit=100&sort_type=created_utc&sort=desc&before=1444260723...
[D 200903 19:51:52 __init__:43] Have 4800, now searching for comments before 2015-09-22 13:55:00...
[D 200903 19:51:52 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username&limit=100&sort_type=created_utc&sort=desc&before=1442930100...
[D 200903 19:51:57 __init__:43] Have 4860, now searching for comments before 2014-08-28 07:10:14...
[D 200903 19:51:57 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username&limit=100&sort_type=created_utc&sort=desc&before=1409209814...
[I 200903 19:52:01 __init__:64] Done! writing 4860 comments to file ./data.json
pushshift doesn't require authentication, if you want to preview what this looks like, just go to https://api.pushshift.io/reddit/comment/search?author=
Usage in HPI
This has been merged into karlicoss/HPI, which combines the periodic results of rexport
(to pick up new comments), with any from the past using this, which looks like this; my config looking like:
class reddit:
class rexport:
export_path: Paths = "~/data/rexport/*.json"
class pushshift:
export_path: Paths = "~/data/pushshift/*.json"
Then importing from my.reddit.all
combines the data from both of them:
>>> from my.reddit.rexport import comments as rcomments
>>> from my.reddit.pushshift import comments as pcomments
>>> from my.reddit.all import comments
>>> from more_itertools import ilen
>>> ilen(rcomments())
1020
>>> ilen(pcomments())
4891
>>> ilen(comments())
4914
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pushshift_comment_export-0.1.4.tar.gz
.
File metadata
- Download URL: pushshift_comment_export-0.1.4.tar.gz
- Upload date:
- Size: 5.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3885520b575d3b84fa01cbd3c59f334c53933904d466702551f2be61a2f522a6 |
|
MD5 | 8d12e912a2de77a3f46061da54bcf770 |
|
BLAKE2b-256 | 79421146a90f6d4f4072f10d3cb5eead10efb8efc4121272fc14a1ec20de5629 |
File details
Details for the file pushshift_comment_export-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: pushshift_comment_export-0.1.4-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2de6ccd0cf93a6f0f72fcef97a561d40816ff32984e95629b072364aa69c0722 |
|
MD5 | b4991bbd2a0187d856f1eaf1613237f7 |
|
BLAKE2b-256 | 445e505038ef7c70b63a05cdf17fd84987131a445102161c550cb3af1bab6129 |