Skip to main content

Unofficial, community-made tool for downloading the Stack Exchange data dumps

Project description

SE Data Dump Downloader

For more comprehensive information, please read the main README on GitHub. This README contains an abridged version of the main README specifically aimed at Pypi users.

For usage problems not listed in this readme, see the main README. If no information exists, please open an issue on GitHub - keeping the tool accessible to everyone is a priority.


The SE Data Dump Downloader (abbreviated sedd) is a command line Selenium-based utility for downloading the entire Stack Exchange data dump in their new anti-community format, since they decided not to bother providing an official "download all" button. It's one of two components that operate on the data dump in the second project, the other being the (non-python-based) SE data dump transformer - a project that converts the data dump from the not-so-useful official .xml format to some other formats. The pypi package is exclusively for the downloader, and does not ship with a copy of the transformer. See the main README if you're looking for the transformer.

For the pypi version, you can download it with:

pip3 install sedd

Note that there are some additional steps before you can start using it, that are detailed in this README.

Configuration

sedd requires a special config.json file in the current working directory. There's a template available on GitHub.

The only two fields you need to fill out in the template is the email and password fields with credentials for a Stack Exchange account. You need to be logged in to download the data dumps, so the downloader needs the credentials to log in on your behalf. It doesn't matter if you're logged into SE elsewhere, as Selenium automatically creates a blank profile every time it starts, which won't include any cookies from SE, which means login is required.

[!tip]

The downloader can automatically create new accounts in the network for you, if you don't have all 180-whatever accounts on every site in the network already. You can also create these by hand if you prefer for some reason, but you are not required to have all 180+ accounts before using the downloader.

System requirements and pitfalls

sedd is exclusively Firefox-based, due to Chromium completely gutting support for uBlock Origin and custom filters. You need Firefox installed on your system to use sedd.

[!note] On Linux and Windows-based systems, geckodriver is slightly modified. This is an anti-anti-bot measure meant to prevent Cloudflare loops. If you're on macOS and get sent in a captcha loop, it's recommended you switch to Windows or Linux - a Linux VM is also an option if you have no way out of Apple's closed-down ecosystem.

Note that Ubuntu users, or other people who (for whatever reason) choose to use the Snap version of Firefox, have to jump through some extra hoops. The native version of Firefox is strongly encouraged, but if you run into problems with the snap version of Firefox and can't or won't switch, you need to define export SE_GECKODRIVER=/snap/bin/geckodriver. Selenium can and will find the snap version of geckodriver on its own, but for reasons I simply don't understand, it will still fail with several arbitrary errors.

Cloudflare issues or download issues.

Stack Exchange has configured Cloudflare to be highly aggressive, especially to certain countries. You will almost certainly run into captchas, and the downloader is designed to deal with this. After an initial attempt to solve the captcha on its own, you'll be notified (provided you don't disable the notification provider in config.json) and asked to solve it manually.

If, at this point, it appears to succeed, but you're redirected back to a full-screen Cloudflare captcha wall, you've likely run into a Cloudflare loop. See the main README for further help. If this doesn't help, please open an issue.

If the downloads start fine, but later suddenly fail for no good reason, you're likely running into general download instability. This especially applies to stackoverflow.com.7z, as its massive size simply increases the chance you wait for it long enough that it flakes out. See the main README for further help.

The "Warnings" section in the README may contain additional information about other failure modes not listed here in the future.

Using the downloader

With ./config.json in the current working directory and Firefox installed, you can now run the downloader with:

python3 -m sedd 

For command line flags, see python3 -m sedd --help, or the main readme.

[!tip]

The python3 -m sedd format is required. A binary form is not included as I frankly don't understand pypi binaries work, and I don't care enough to check.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sedd-2.1.0rc8.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sedd-2.1.0rc8-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file sedd-2.1.0rc8.tar.gz.

File metadata

  • Download URL: sedd-2.1.0rc8.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for sedd-2.1.0rc8.tar.gz
Algorithm Hash digest
SHA256 1501fd9c86b5de0ffd7818b38419785472f3ba300b2e5552d119d0bae20c0628
MD5 1a46f705380d68296f9bbe380f342152
BLAKE2b-256 4c237019e3fce0f2da0b5b020208b462f996aa2030ee2435b7fdc7cfed841867

See more details on using hashes here.

Provenance

The following attestation bundles were made for sedd-2.1.0rc8.tar.gz:

Publisher: uploader.yml on LunarWatcher/se-data-dump-transformer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sedd-2.1.0rc8-py3-none-any.whl.

File metadata

  • Download URL: sedd-2.1.0rc8-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for sedd-2.1.0rc8-py3-none-any.whl
Algorithm Hash digest
SHA256 7e324f2389d6b79e727fa438b2cb988ef97500b389178b85a2a7d01b8503ce53
MD5 3c980b79dec4f37afeb233aeae251d84
BLAKE2b-256 e1a0d66a396b2556db51f5bf0ee77e803d5834ab52a6d3a190d0bd7efa06760b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sedd-2.1.0rc8-py3-none-any.whl:

Publisher: uploader.yml on LunarWatcher/se-data-dump-transformer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page