Skip to main content

Software Heritage storage manager

Project description

Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata.

Quick start

Dependencies

Python tests for this module include tests that cannot be run without a local Postgresql database, so you need the Postgresql server executable on your machine (no need to have a running Postgresql server). They also expect a cassandra server.

Debian-like host

$ sudo apt install libpq-dev postgresql-11 cassandra

Non Debian-like host

The tests expect the path to cassandra to either be unspecified, it is then looked up at /usr/sbin/cassandra, either specified through the environment variable SWH_CASSANDRA_BIN.

Optionally, you can avoid running the cassandra tests.

(swh) :~/swh-storage$ tox -- -m 'not cassandra'

Installation

It is strongly recommended to use a virtualenv. In the following, we consider you work in a virtualenv named swh. See the developer setup guide for a more details on how to setup a working environment.

You can install the package directly from pypi:

(swh) :~$ pip install swh.storage
[...]

Or from sources:

(swh) :~$ git clone https://forge.softwareheritage.org/source/swh-storage.git
[...]
(swh) :~$ cd swh-storage
(swh) :~/swh-storage$ pip install .
[...]

Then you can check it’s properly installed:

(swh) :~$ swh storage --help
Usage: swh storage [OPTIONS] COMMAND [ARGS]...

  Software Heritage Storage tools.

Options:
  -h, --help  Show this message and exit.

Commands:
  rpc-serve  Software Heritage Storage RPC server.

Tests

The best way of running Python tests for this module is to use tox.

(swh) :~$ pip install tox

tox

From the sources directory, simply use tox:

(swh) :~/swh-storage$ tox
[...]
========= 315 passed, 6 skipped, 15 warnings in 40.86 seconds ==========
_______________________________ summary ________________________________
  flake8: commands succeeded
  py3: commands succeeded
  congratulations :)

Note: it is possible to set the JAVA_HOME environment variable to specify the version of the JVM to be used by Cassandra. For example, at the time of writing this, Cassandra is meant to be run with Java 11. On Debian bookworm, one needs to manually install openjdk-11-jre-headless from bullseye or unstable and set the appropriate environment variable:

(swh) :~/swh-storage$ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
(swh) :~/swh-storage$ tox
[...]

Development

The storage server can be locally started. It requires a configuration file and a running Postgresql database.

Sample configuration

A typical configuration storage.yml file is:

storage:
  cls: postgresql
  db: "dbname=softwareheritage-dev user=<user> password=<pwd>"
  objstorage:
    cls: pathslicing
    root: /tmp/swh-storage/
    slicing: 0:2/2:4/4:6

which means, this uses:

  • a local storage instance whose db connection is to softwareheritage-dev local instance,

  • the objstorage uses a local objstorage instance whose:

    • root path is /tmp/swh-storage,

    • slicing scheme is 0:2/2:4/4:6. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c

Note that the root path should exist on disk before starting the server.

Starting the storage server

If the python package has been properly installed (e.g. in a virtual env), you should be able to use the command:

(swh) :~/swh-storage$ swh storage -C storage.yml rpc-serve

This runs a local swh-storage api at 5002 port.

(swh) :~/swh-storage$ curl http://127.0.0.1:5002
<html>
<head><title>Software Heritage storage server</title></head>
<body>
<p>You have reached the
<a href="https://www.softwareheritage.org/">Software Heritage</a>
storage server.<br />
See its
<a href="https://docs.softwareheritage.org/devel/swh-storage/">documentation
and API</a> for more information</p>

And then what?

In your upper layer (loader-git, loader-svn, etc…), you can define a remote storage with this snippet of yaml configuration.

storage:
  cls: remote
  url: http://localhost:5002/
storage:
  cls: pipeline
  steps:
    - cls: buffer
      min_batch_size:
      content: 10000
      content_bytes: 104857600
      directory: 1000
      revision: 1000
    - cls: filter
    - cls: remote
      url: http://localhost:5002/

Cassandra

As an alternative to PostgreSQL, swh-storage can use Cassandra as a database backend. It can be used like this:

storage:
  cls: cassandra
  hosts:
    - localhost
  keyspace: swh
  objstorage:
    cls: pathslicing
    root: /home/storage/swh-storage/
    slicing: 0:2/2:4/4:6

The Cassandra swh-storage implementation supports both Cassandra >= 4.0-alpha2 and ScyllaDB >= 4.4 (and possibly earlier versions, but this is untested).

While the main code supports both transparently, running tests or configuring the schema requires specific code when using ScyllaDB, enabled by setting the SWH_USE_SCYLLADB=1 environment variable.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swh_storage-4.2.0.tar.gz (459.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swh_storage-4.2.0-py3-none-any.whl (582.6 kB view details)

Uploaded Python 3

File details

Details for the file swh_storage-4.2.0.tar.gz.

File metadata

  • Download URL: swh_storage-4.2.0.tar.gz
  • Upload date:
  • Size: 459.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for swh_storage-4.2.0.tar.gz
Algorithm Hash digest
SHA256 6bf1feb2f071d1f2180324da5a7bc21ec7d36ee6fda348ae664bf15e8941f363
MD5 42c2c5ef877278e394b2bc54d1a359f2
BLAKE2b-256 5a108cfd7728e36e1f630289635dd8067cbaa9102622a9d28f6d30e109a8c868

See more details on using hashes here.

File details

Details for the file swh_storage-4.2.0-py3-none-any.whl.

File metadata

  • Download URL: swh_storage-4.2.0-py3-none-any.whl
  • Upload date:
  • Size: 582.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for swh_storage-4.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3862e541995b5316d94f9ed05216169fb960a3ca315e9ee8166607d00a7b0719
MD5 fcf2929bc568fddc0633b132a728b02e
BLAKE2b-256 74cf939ada24f57992afd5da9353fd6549b28f54c5f34df8ca2946e77d195223

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page