Create BagIt packages harvesting data from upstream sources
Project description
bagit-create
CLI tool to prepare BagIt (RFC) packages harvesting metadata and raw files, following the CERN Archival Information Packages (AIP) specification.
Data is taken from various upstream sources, such as CDS (CERN Document Service) and CERN Open Data.
Install
If you just need to run BagIt Create from the command line:
# Install from PyPi
pip install bagit-create
# Check installed version
bic --version
For development, you can clone this repository and then install it with the -e
flag:
# Clone the repository
git clone https://gitlab.cern.ch/digitalmemory/bagit-create
cd bagit-create
pip install -e .
# Check installed version
bic --version
Usage
CLI
# Show CLI Usage help
bic --help
bic --recid=2272168 --source=cds
# Generate JSON metadata for arkivum, running in a very verbose way
bic --recid 2766073 --source cds --ark-json -vv
# Deleted resource, running in a very verbose way
bic --recid 1 --source cds -vv
# Run tests
pytest
CLI options:
--recid TEXT
, Unique ID of the record in the upstream source [required]--source [cds|ilcdoc|cod]
, Select source pipeline [required]--skip-downloads
, Creates files but skip downloading the actual payloads--ark-json
, Generate a JSON metadata file for arkivum ingestions--ark-json-rel
, Generate a JSON metadata file for arkivum ingestions using relative paths-v
, Enable logging (verbose, 'info' level)-vv
, Enable logging (very verbose, 'debug' level)-b
,--bibdoc
, Get metadata for a CDS record from the bibdocfile utility. (/opt/cdsweb/bin/bibdocfile
must be available in the system and the resource must be from CDS). See bibdocfile.--bd-ssh-host TEXT
, SSH host to run bibdocfile. See bibdocfile.
Module
The BagIt-Create tool can be used from other python scripts easily:
from bagit_create.main import process
process(recid=2272168, source="cds")
Supported sources
Name | ID | URL | Notes |
---|---|---|---|
CERN Document Server | cds | https://cds.cern.ch/ | Invenio v1.1.3.1106-62468 |
ILC Document Server | ilcdoc | http://ilcdoc.linearcollider.org | CDS Invenio v1.0.7.2-5776 |
CERN Open Data | cod | https://opendata.cern.ch/ |
CERN Document Server (CDS)
To prepare a BagIt from a CDS Resource ID, using the CLI interface, run python cli.py --recid=2272168 --source=cds
> tree bagitexport_2272168
bagitexport_2272168
├── bagit.txt
├── 2272168_1605200583
│ ├── metadata.xml
│ └── references.txt
└── 2272168_bacc9427609e6509f172e6b2604659d6jfkob
└── 2272168.mp4
2 directories, 3 files
CDS metadata is XML/MARC21
bibdocfile
The bibdocfile
command line utility can be used to get metadata for CDS, exposing internal file paths and hashes normally not available through the CDS API.
If the executable is available in the path (i.e. you can run /opt/cdsweb/bin/bibdocfile
) just append --bibdoc
:
bic --recid 2751237 --source cds --ark-json --bibdoc -v
If this is not the case, you can pass a --bd-ssh-host
parameter specifying the name of an SSH configured connection pointing to a machine able to run the command for you. Be aware that your machine must be able to establish such connection without any user interaction (the script will run ssh <THE_PROVIDED_SSH_HOST> bibdocfile ..args
).
Since in a normal CERN scenario this can't be possible due to required ProxyJumps/OTP authentication steps, you can use the ControlMaster
feature of any recent version of OpenSSH, allowing to reuse sockets for connecting:
Add an entry in ~/.ssh/config
to set up the SSH connection to the remote machine able to run bibdocfile
for you in the following way:
Host <SSH_NAME>
User <YOUR_USER>
Hostname <HOSTNAME.cern.ch>
ProxyJump <LXPLUS_or_AIADM>
ControlMaster auto
ControlPath ~/.ssh/control:%h:%p:%r
Then, run ssh <SSH_NAME>
in a shell, authenticate and keep it open. OpenSSH will now reuse this socket everytime you run <SSH_NAME>
, allowing BagItCreate tool to run bibdocfile
over this ssh connection for you, if you pass the bd-ssh-host
parameter:
bic --recid 2751237 --source cds --ark-json --bibdoc --bd-ssh-host=<SSH_NAME> -v
CERN Open Data
To prepare a BagIt from a CERN Open Data Record ID, run ./cli.py --recid 1 --source cod
.
CERN Open Data metadata follows this schema.
Examples
- CDS
2272168
- DM entry - CDS
1000571
- bibdoc entry (merged results), hundreds of entries - COD
1
- packed in file lists - COD
5200
- non packed - COD
8884
- big record
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.