Skip to main content

Create a SQLite database containing data pulled from Reddit about a single user.

Project description

reddit-user-to-sqlite

Stores all the content from a specific user in a SQLite database. This includes their comments and their posts.

Install

The PyPI package is reddit-user-to-sqlite (PyPI Link). Install it globally using pipx:

pipx install reddit-user-to-sqlite

Usage

The CLI currently exposes two commands: user and archive. They allow you to archive recent comments/posts from the API or all posts (as read from a CSV file).

user

Fetches all comments and posts for a specific user.

reddit-user-to-sqlite user your_username
reddit-user-to-sqlite user your_username --db my-reddit-data.db

Params

Note: the argument order is reversed from most dogsheep packages (which take db_path first). This method allows for use of a default db name, so I prefer it.

  1. username: a case-insensitive string. The leading /u/ is optional (and ignored if supplied).
  2. (optional) --db: the path to a sqlite file, which will be created or updated as needed. Defaults to reddit.db.

archive

Reads the output of a Reddit GDPR archive and fetches additional info from the Reddit API (where possible). This allows you to store more than 1k posts/comments.

FYI: this behavior is built with the assumption that the archive that Reddit provides has the same format regardless of if you select GDPR or CCPA as the request type. But, just to be on the safe side, I recommend selecting GDPR during the export process until I'm able to confirm.

Params

Note: the argument order is reversed from most dogsheep packages (which take db_path first). This method allows for use of a default db name, so I prefer it.

  1. archive_path: the path to the (unzipped) archive directory on your machine. Don't rename/move the files that Reddit gives you.
  2. (optional) --db: the path to a sqlite file, which will be created or updated as needed. Defaults to reddit.db.
  3. (optional) --skip-saved: a flag for skipping the inclusion of loading saved comments/posts from the archive.

Viewing Data

The resulting SQLite database pairs well with Datasette, a tool for viewing SQLite in the web. Below is my recommended configuration.

First, install datasette:

pipx install datasette

Then, add the recommended plugins (for rendering timestamps and markdown):

pipx inject datasette datasette-render-markdown datasette-render-timestamps

Finally, create a metadata.json file next to your reddit.db with the following:

{
  "databases": {
    "reddit": {
      "tables": {
        "comments": {
          "sort_desc": "timestamp",
          "plugins": {
            "datasette-render-markdown": {
              "columns": ["text"]
            },
            "datasette-render-timestamps": {
              "columns": ["timestamp"]
            }
          }
        },
        "posts": {
          "sort_desc": "timestamp",
          "plugins": {
            "datasette-render-markdown": {
              "columns": ["text"]
            },
            "datasette-render-timestamps": {
              "columns": ["timestamp"]
            }
          }
        },
        "subreddits": {
          "sort": "name"
        }
      }
    }
  }
}

Now when you run

datasette reddit.db --metadata metadata.json

You'll get a nice, formatted output:

Motivation

I got nervous when I saw Reddit's notification of upcoming API changes. To ensure I could always access data I created, I wanted to make sure I had a backup in place before anything changed in a big way.

FAQs

Why does this post only show 1k recent comments / posts?

Reddit's paging API only shows 1000 items (page 11 is an empty list). If you have more comments (or posts) than than that, you can use the GDPR archive import feature feature to backfill your older data.

Why are my longer posts truncated in Datasette?

Datasette truncates long text fields by default. You can disable this behavior by using the truncate_cells_html flag when running datasette (see the docs):

datasette reddit.db --setting truncate_cells_html 0

How do I store a username that starts with -?

By default, click (the argument parser this uses) interprets leading dashes on argument as a flag. If you're fetching data for user -asdf, you'll get an error saying Error: No such option: -a. To ensure the last argument is interpreted positionally, put it after a --:

reddit-user-to-sqlite user -- -asdf

Why do some of my posts say [removed] even though I can see them on the web?

If a post is removed, only the mods and the user who posted it can see its text. Since this tool currently runs without any authentication, those removed posts can't be fetched via the API.

To load data about your own removed posts, use the GDPR archive import feature.

Why is the database missing data returned by the Reddit API?

While most Dogsheep projects grab the raw JSON output of their source APIs, Reddit's API has a lot of junk in it. So, I opted for a slimmed down approach.

If there's a field missing that you think would be useful, feel free to open an issue!

Does this tool refetch old data?

When running the user command, yes. It fetches and updates up to 1k each of comments and posts and updates the local copy.

When running the archive command, no. To cut down on API requests, it only fetches data about comments/posts that aren't yet in the database (since the archive may include many items).

Both of these may change in the future to be more in line with Reddit's per-subreddit archiving guidelines.

Development

This section is people making changes to this package.

When in a virtual environment, run the following:

pip install -e '.[test]'

This installs the package in --edit mode and makes its dependencies available. You can now run reddit-user-to-sqlite to invoke the CLI.

Running Tests

In your virtual environment, a simple pytest should run the unit test suite. You can also run pyright for type checking.

Releasing New Versions

these notes are mostly for myself (or other contributors)

  1. Run just release while your venv is active
  2. paste the stored API key (If you're getting invalid password, verify that ~/.pypirc is empty)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reddit_user_to_sqlite-0.4.2.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

reddit_user_to_sqlite-0.4.2-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file reddit_user_to_sqlite-0.4.2.tar.gz.

File metadata

  • Download URL: reddit_user_to_sqlite-0.4.2.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for reddit_user_to_sqlite-0.4.2.tar.gz
Algorithm Hash digest
SHA256 3c138e9fa56234ae686a4f09ea0b5293263ffed878b9d12b006cad6a85b96455
MD5 2e2ef98e81d63250391e5892c1d2a1c2
BLAKE2b-256 b250dd4e19068929cfa89cb48c4ff91d3016fecc0fb3003993332823ae0d21ae

See more details on using hashes here.

File details

Details for the file reddit_user_to_sqlite-0.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for reddit_user_to_sqlite-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b1896e98d925f6d33bfdbc732fe37b2c30641934b237e98180232cdb4e397035
MD5 c83f5f8c43269eac4e35c48a325ceeea
BLAKE2b-256 b199bdc61688cd0c8e079204913a7f19236d99d9539aaffcd6feabc6d8c6e203

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page