Software Heritage Tarball Loader
Project description
SWH Tarball Loader
The Software Heritage Tarball Loader is a tool and a library to uncompress a local tarball and inject into the SWH dataset its tree representation.
Configuration
This is the loader's (or task's) configuration file.
{/etc/softwareheritage | ~/.config/swh | ~/.swh}
/loader/tar.yml:
extraction_dir: /home/storage/tmp/
storage:
cls: remote
args:
url: http://localhost:5002/
API
Load tarball directly from code or python3's toplevel:
# Fill in those
repo = 'loader-tar.tgz'
tarpath = '/home/storage/tar/%s' % repo
origin = {'url': 'ftp://%s' % repo, 'type': 'tar'}
visit_date = 'Tue, 3 May 2017 17:16:32 +0200'
revision = {
'author': {'name': 'some', 'fullname': 'one', 'email': 'something'},
'committer': {'name': 'some', 'fullname': 'one', 'email': 'something'},
'message': '1.0 Released',
'date': None,
'committer_date': None,
'type': 'tar',
}
import logging
logging.basicConfig(level=logging.DEBUG)
from swh.loader.tar.tasks import LoadTarRepository
l = LoadTarRepository()
l.run_task(tar_path=tarpath, origin=origin, visit_date=visit_date,
revision=revision, branch_name='master')
Celery
Load tarball using celery.
Providing you have a properly configured celery up and running, the celery worker configuration file needs to be updated:
{/etc/softwareheritage | ~/.config/swh | ~/.swh}
/worker.yml:
task_modules:
- swh.loader.tar.tasks
task_queues:
- swh_loader_tar
cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details
Tar Producer
Its job is to compulse from a file or a folder a list of existing tarballs. From this list, compute the corresponding messages to send to the broker.
Configuration
Message producer's configuration file (tar.yml
):
# Mirror's root directory holding tarballs to load into swh
mirror_root_directory: /srv/storage/space/mirrors/gnu.org/gnu/
# Url scheme prefix used to create the origin url
url_scheme: http://ftp.gnu.org/gnu/
type: ftp
# File containing a subset list tarballs from mirror_root_directory to load.
# The file's format is one absolute path name to a tarball per line.
# NOTE:
# - This file must contain data consistent with the mirror_root_directory
# - if this option is not provided, the mirror_root_directory is scanned
# completely as usual
# mirror_subset_archives: /home/storage/missing-archives
# Randomize blocks of messages and send for consumption
block_messages: 250
Run
Trigger the message computations:
python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.yml
This will walk the mirror_root_directory
folder and send encountered
tarball messages for the swh-loader-tar to uncompress (through
celery).
If the mirror_subset_archives
is provided, the tarball messages will
be computed from such file (the mirror_root_directory
is still used
so please be consistent).
If problem arises during tarball message computation, a message will be output with the tarball that present a problem.
It will displayed the number of tarball messages sent at the end.
Dry run
python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.yml --dry-run
This will do the same as previously described but only display the number of potential tarball messages computed.
Help
python3 -m swh.loader.tar.producer --help
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for swh.loader.tar-0.0.36-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9dd9e42f26261cc0337c85b19d3f8b81e45e8436eb84e2315bf24bf27eb85c27 |
|
MD5 | e8c2a1751f38672b4de1636225042c57 |
|
BLAKE2b-256 | 520b820106b9e496815607f186170477ed0e32b89b69402f20dcc8fe99c05b1f |