swh.scrubber

Software Heritage datastore scrubber

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Tools to periodically checks data integrity in swh-storage, swh-objstorage and swh-journal, reports errors, and (try to) fix them.

The Scrubber package is made of the following parts:

Checking

Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.

There is one “checker” for each datastore package: storage (postgresql and cassandra), journal (kafka), and object storage (any backends).

The journal is “crawled” using its native streaming; others are crawled by range, reusing swh-storage’s backfiller utilities, and checkpointed from time to time to the scrubber’s database (in the checked_range table).

Storage

For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init storage --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql

Note

A configuration file is expected, as for most swh tools. This file must have a scrubber section with the configuration of the scrubber database. For storage checking operations, this configuration file must also have a storage configuration section. See the swh-storage documentation for more details on this. A typical configuration file could look like:

scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824

storage:
  cls: postgresql
  db: service=swh
  objstorage:
    cls: noop

One (or more) checking worker can then be spawned by using the swh scrubber check run command:

$ swh scrubber check run chk-snp
[...]

Object storage

As with the storage checker, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init objstorage --object-type content --nb-partitions 65536 --name check-contents
Created configuration check-contents [3] for checking content in datastore objstorage remote

Note

A configuration file is expected, as for most swh tools. This file must have a scrubber section with the configuration of the scrubber database. For object storage checking operations, this configuration file must have:

a storage configuration section if content ids are read from it (default)
a journal configuration section if content ids are read from a kafka content topic (require to use flag --use-journal of the swh scrubber check run command)
an objstorage configuration section targeting the object storage to check

See the swh-storage documentation, swh-objstorage documentation and swh-journal documentation for more details on this. A typical configuration file could look like:

scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824

storage:
  cls: postgresql
  db: service=swh
  objstorage:
    cls: noop

journal:
   cls: kafka
   brokers:
      - broker1.journal.softwareheritage.org:9093
      - broker2.journal.softwareheritage.org:9093
      - broker3.journal.softwareheritage.org:9093
      - broker4.journal.softwareheritage.org:9093
   group_id: swh.scrubber
   prefix: swh.journal.objects
   on_eof: stop

objstorage:
  cls: remote
  url: https://objstorage.softwareheritage.org/

By default, an object storage checker detects missing and corrupted contents. To disable detection of missing contents, use the --no-check-references option of the swh check init command. To disable detection of corrupted contents, use the --no-check-hashes option of the swh check init command.

One (or more) checking worker can then be spawned by using the swh scrubber check run command:

if the content ids must be read from a storage instance

$ swh scrubber check run check-contents
[...]

if the content ids must be read from a kafka content topic of swh-journal

$ swh scrubber check run check-contents --use-journal
[...]

Journal

As with the other checkers, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init journal --object-type directory --name check-dirs-journal
Created configuration check-dirs-journal [4] for checking directory in datastore journal kafka

Note

A configuration file is expected, as for most swh tools. This file must have a scrubber section with the configuration of the scrubber database. For journal checking operations, this configuration file must also have a journal configuration section.

See the swh-journal documentation for more details on this. A typical configuration file could look like:

scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824

journal:
   cls: kafka
   brokers:
      - broker1.journal.softwareheritage.org:9093
      - broker2.journal.softwareheritage.org:9093
      - broker3.journal.softwareheritage.org:9093
      - broker4.journal.softwareheritage.org:9093
   group_id: swh.scrubber
   prefix: swh.journal.objects
   on_eof: stop

One (or more) checking worker can then be spawned by using the swh scrubber check run command:

$ swh scrubber check run check-dirs-journal
[...]

Recovery

Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:

Brute-forcing variations until they match their checksum
Recovering from another data store
As a last resort, recovering from known origins, if any

Reinjection

Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

4.0.0

Mar 21, 2025

3.1.1

Feb 19, 2025

3.1.0

Nov 5, 2024

3.0.0

Apr 11, 2024

2.3.0

Feb 2, 2024

2.2.0

Dec 5, 2023

2.1.0

Oct 16, 2023

2.0.3

Aug 24, 2023

2.0.2

Jul 26, 2023

2.0.1

Jul 26, 2023

2.0.0

Jul 12, 2023

1.0.3

Apr 18, 2023

1.0.2

Mar 28, 2023

1.0.1

Mar 22, 2023

1.0.0

Mar 22, 2023

0.1.2

Dec 20, 2022

0.1.1

Oct 17, 2022

0.1.0

Aug 18, 2022

0.0.6

May 31, 2022

0.0.5

May 30, 2022

0.0.4

May 30, 2022

0.0.3

May 30, 2022

0.0.2

May 30, 2022

0.0.1

Mar 31, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swh_scrubber-4.0.0.tar.gz (70.1 kB view details)

Uploaded Mar 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

swh_scrubber-4.0.0-py3-none-any.whl (80.2 kB view details)

Uploaded Mar 21, 2025 Python 3

File details

Details for the file swh_scrubber-4.0.0.tar.gz.

File metadata

Download URL: swh_scrubber-4.0.0.tar.gz
Upload date: Mar 21, 2025
Size: 70.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for swh_scrubber-4.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8726d5c2750a3d26495a4670f3fa0a1783f92c2925cbfcce3bf7a9ebc7bcdfb8`
MD5	`410b8d0adf3aa584a0cbac00496dd602`
BLAKE2b-256	`f6f1264eef7ad43cf5dd17bcdf2aa7510f69423437eea7ff77f8569190610a17`

See more details on using hashes here.

File details

Details for the file swh_scrubber-4.0.0-py3-none-any.whl.

File metadata

Download URL: swh_scrubber-4.0.0-py3-none-any.whl
Upload date: Mar 21, 2025
Size: 80.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for swh_scrubber-4.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c09455be99eb371da336249ae2d616a7f81bc0769a3cbba6d619157ed5218f5`
MD5	`dc0962c25ab869876f95fb953e418a9f`
BLAKE2b-256	`b3f333a3f208734f53279803e737e176447db127186c46f38af3983820863a9a`

See more details on using hashes here.

swh.scrubber 4.0.0

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Checking

Storage

Object storage

Journal

Recovery

Reinjection

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes