Skip to main content

Unofficial, community-made tool for downloading the Stack Exchange data dumps

Project description

SE Data Dump Downloader

For more comprehensive information, please read the main README on GitHub. This README contains an abridged version of the main README specifically aimed at Pypi users.

For usage problems not listed in this readme, see the main README. If no information exists, please open an issue on GitHub - keeping the tool accessible to everyone is a priority.


The SE Data Dump Downloader (abbreviated sedd) is a command line Selenium-based utility for downloading the entire Stack Exchange data dump in their new anti-community format, since they decided not to bother providing an official "download all" button. It's one of two components that operate on the data dump in the second project, the other being the (non-python-based) SE data dump transformer - a project that converts the data dump from the not-so-useful official .xml format to some other formats. The pypi package is exclusively for the downloader, and does not ship with a copy of the transformer. See the main README if you're looking for the transformer.

For the pypi version, you can download it with:

pip3 install sedd

Note that there are some additional steps before you can start using it, that are detailed in this README.

Configuration

sedd requires a special config.json file in the current working directory. There's a template available on GitHub.

The only two fields you need to fill out in the template is the email and password fields with credentials for a Stack Exchange account. You need to be logged in to download the data dumps, so the downloader needs the credentials to log in on your behalf. It doesn't matter if you're logged into SE elsewhere, as Selenium automatically creates a blank profile every time it starts, which won't include any cookies from SE, which means login is required.

[!tip]

The downloader can automatically create new accounts in the network for you, if you don't have all 180-whatever accounts on every site in the network already. You can also create these by hand if you prefer for some reason, but you are not required to have all 180+ accounts before using the downloader.

System requirements and pitfalls

sedd is exclusively Firefox-based, due to Chromium completely gutting support for uBlock Origin and custom filters. You need Firefox installed on your system to use sedd.

[!note] On Linux and Windows-based systems, geckodriver is slightly modified. This is an anti-anti-bot measure meant to prevent Cloudflare loops. If you're on macOS and get sent in a captcha loop, it's recommended you switch to Windows or Linux - a Linux VM is also an option if you have no way out of Apple's closed-down ecosystem.

Note that Ubuntu users, or other people who (for whatever reason) choose to use the Snap version of Firefox, have to jump through some extra hoops. The native version of Firefox is strongly encouraged, but if you run into problems with the snap version of Firefox and can't or won't switch, you need to define export SE_GECKODRIVER=/snap/bin/geckodriver. Selenium can and will find the snap version of geckodriver on its own, but for reasons I simply don't understand, it will still fail with several arbitrary errors.

Cloudflare issues or download issues.

Stack Exchange has configured Cloudflare to be highly aggressive, especially to certain countries. You will almost certainly run into captchas, and the downloader is designed to deal with this. After an initial attempt to solve the captcha on its own, you'll be notified (provided you don't disable the notification provider in config.json) and asked to solve it manually.

If, at this point, it appears to succeed, but you're redirected back to a full-screen Cloudflare captcha wall, you've likely run into a Cloudflare loop. See the main README for further help. If this doesn't help, please open an issue.

If the downloads start fine, but later suddenly fail for no good reason, you're likely running into general download instability. This especially applies to stackoverflow.com.7z, as its massive size simply increases the chance you wait for it long enough that it flakes out. See the main README for further help.

The "Warnings" section in the README may contain additional information about other failure modes not listed here in the future.

Using the downloader

With ./config.json in the current working directory and Firefox installed, you can now run the downloader with:

sedd

For command line flags, see sedd --help, or the main readme.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sedd-2.4.0.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sedd-2.4.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file sedd-2.4.0.tar.gz.

File metadata

  • Download URL: sedd-2.4.0.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for sedd-2.4.0.tar.gz
Algorithm Hash digest
SHA256 c5a79cad60ec26187bd42b9a0ddab4bef3df9328425f33be1d698aec75bf0afd
MD5 3332c5d7f996ab6b300994cf10e52950
BLAKE2b-256 eb7b419862d144a2001444607935baab76eb50c4afeff19b1f98a28fa8f857b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for sedd-2.4.0.tar.gz:

Publisher: uploader.yml on LunarWatcher/se-data-dump-transformer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sedd-2.4.0-py3-none-any.whl.

File metadata

  • Download URL: sedd-2.4.0-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for sedd-2.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72d41a0e83db402cc63e05e94096e021209e2c6748cb2fcdf3e11fc990b8fe4b
MD5 d5f77ab572d9ad45cdc28d4d685fc27e
BLAKE2b-256 75022bdfaf3a23e9c56709ac67561fc34b892f4ea8db7702a0d3e6afa4dde699

See more details on using hashes here.

Provenance

The following attestation bundles were made for sedd-2.4.0-py3-none-any.whl:

Publisher: uploader.yml on LunarWatcher/se-data-dump-transformer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page