Skip to main content

Software Heritage datastore scrubber

Project description

Tools to periodically checks data integrity in swh-storage, swh-objstorage and swh-journal, reports errors, and (try to) fix them.

The Scrubber package is made of the following parts:

Checking

Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.

There is one “checker” for each datastore package: storage (postgresql and cassandra), journal (kafka), and object storage (any backends).

The journal is “crawled” using its native streaming; others are crawled by range, reusing swh-storage’s backfiller utilities, and checkpointed from time to time to the scrubber’s database (in the checked_range table).

Storage

For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init storage --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql

One (or more) checking worker can then be spawned by using the swh scrubber check run command:

$ swh scrubber check run chk-snp
[...]

Object storage

As with the storage checker, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init objstorage --object-type content --nb-partitions 65536 --name check-contents
Created configuration check-contents [3] for checking content in datastore objstorage remote

By default, an object storage checker detects missing and corrupted contents. To disable detection of missing contents, use the --no-check-references option of the swh check init command. To disable detection of corrupted contents, use the --no-check-hashes option of the swh check init command.

One (or more) checking worker can then be spawned by using the swh scrubber check run command:

  • if the content ids must be read from a storage instance

$ swh scrubber check run check-contents
[...]
  • if the content ids must be read from a kafka content topic of swh-journal

$ swh scrubber check run check-contents --use-journal
[...]

Journal

As with the other checkers, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init journal --object-type directory --name check-dirs-journal
Created configuration check-dirs-journal [4] for checking directory in datastore journal kafka

One (or more) checking worker can then be spawned by using the swh scrubber check run command:

$ swh scrubber check run check-dirs-journal
[...]

Recovery

Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:

  • Brute-forcing variations until they match their checksum

  • Recovering from another data store

  • As a last resort, recovering from known origins, if any

Reinjection

Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swh.scrubber-3.0.0.tar.gz (70.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swh.scrubber-3.0.0-py3-none-any.whl (80.1 kB view details)

Uploaded Python 3

File details

Details for the file swh.scrubber-3.0.0.tar.gz.

File metadata

  • Download URL: swh.scrubber-3.0.0.tar.gz
  • Upload date:
  • Size: 70.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.7

File hashes

Hashes for swh.scrubber-3.0.0.tar.gz
Algorithm Hash digest
SHA256 bbb5b359684144635ea0ae71bcbe295784fd470a8e77e92cfc8b7fa98c57f008
MD5 c01e84a883b16b65155fb919f9660a12
BLAKE2b-256 191a8d1c1fec5281d5253226d548fc6f1820c2674ff2b538f8c3c69a1be5f842

See more details on using hashes here.

File details

Details for the file swh.scrubber-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: swh.scrubber-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 80.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.7

File hashes

Hashes for swh.scrubber-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 633d90552f7cfe43c490aacb9f71c6f4986596a3e1fe404f627b1b1e37ce35e3
MD5 031e86e42b603fe8ce6f041d0d33d8ad
BLAKE2b-256 e793190f899fe3e18605f30b1a0f27d79b0e5ec48535c56040bd964cd68090e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page