Skip to main content

Software Heritage Datastore Scrubber

Project description

Tools to periodically checks data integrity in swh-storage and swh-objstorage, reports errors, and (try to) fix them.

This is a work in progress; some of the components described below do not exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)

The Scrubber package is made of the following parts:

Checking

Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.

There is one “checker” for each datastore package: storage (postgresql and cassandra), journal (kafka), and objstorage.

The journal is “crawled” using its native streaming; others are crawled by range, reusing swh-storage’s backfiller utilities, and checkpointed from time to time to the scrubber’s database (in the checked_range table).

Storage

For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql

One (or more) checking worker can then be spawned by using the swh scrubber check storage command:

$ swh scrubber check storage chk-snp
[...]

Recovery

Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:

  • Brute-forcing variations until they match their checksum

  • Recovering from another data store

  • As a last resort, recovering from known origins, if any

Reinjection

Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swh.scrubber-2.0.3.tar.gz (52.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swh.scrubber-2.0.3-py3-none-any.whl (59.2 kB view details)

Uploaded Python 3

File details

Details for the file swh.scrubber-2.0.3.tar.gz.

File metadata

  • Download URL: swh.scrubber-2.0.3.tar.gz
  • Upload date:
  • Size: 52.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for swh.scrubber-2.0.3.tar.gz
Algorithm Hash digest
SHA256 fdba1c06adb4ecae2e792aebe4c8d9441465ceecb90520d9e426dbf1fb2c0b02
MD5 3b5eba8613a5e54909614e0a89cc3272
BLAKE2b-256 5668b948f40a86eab8e00c3b3ed7d9fa75f7c4f95c9bc4ee2670df24eae1696b

See more details on using hashes here.

File details

Details for the file swh.scrubber-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: swh.scrubber-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 59.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for swh.scrubber-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a176e7f99834ec719b52ce9f0e8ca03aa64d76066b05b20cdd8a687b25b72ab3
MD5 0be8ea457ea0694375c945ee4b4a44c9
BLAKE2b-256 53a82e508ac2a785d7dd38a4a7b5c40783c2df632afdd6e924a3f63db8b23122

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page