Skip to main content

Software Heritage Datastore Scrubber

Project description

Tools to periodically checks data integrity in swh-storage and swh-objstorage, reports errors, and (try to) fix them.

This is a work in progress; some of the components described below do not exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)

The Scrubber package is made of the following parts:

Checking

Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.

There is one “checker” for each datastore package: storage (postgresql and cassandra), journal (kafka), and objstorage.

The journal is “crawled” using its native streaming; others are crawled by range, reusing swh-storage’s backfiller utilities, and checkpointed from time to time to the scrubber’s database (in the checked_range table).

Recovery

Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:

  • Brute-forcing variations until they match their checksum

  • Recovering from another data store

  • As a last resort, recovering from known origins, if any

Reinjection

Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swh.scrubber-1.0.0.tar.gz (42.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swh.scrubber-1.0.0-py3-none-any.whl (51.2 kB view details)

Uploaded Python 3

File details

Details for the file swh.scrubber-1.0.0.tar.gz.

File metadata

  • Download URL: swh.scrubber-1.0.0.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.3

File hashes

Hashes for swh.scrubber-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c6a05fdeb5648a24f1a4964055182118eaefe677def1cd91b7ccefc496f1c327
MD5 614e0b90da3822c5026b583b33c21c14
BLAKE2b-256 ac6d76f3276bf7d254cf645bc4514b34a546e90bd83d39b2d7029d4c857fcf2d

See more details on using hashes here.

File details

Details for the file swh.scrubber-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: swh.scrubber-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 51.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.3

File hashes

Hashes for swh.scrubber-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b7ecff0fc04c2e334442308148d0c18d641affd6bd08b5af6832103310d76c92
MD5 ca87bc95b7936dc58a3a9e563e984ee3
BLAKE2b-256 20b5cfe2569ec2569f87f91938c18c691a5c20a40f763e77cb5e87eafc541ecc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page