Data standard, storage and retrieval for leaks and document collections
Project description
leakrfc
An RFC for leaks
leakrfc
provides a data standard and archive storage for leaked data, private and public document collections. The concepts and implementations are originally inspired by mmmeta and Aleph's servicelayer archive.
leakrfc
acts as a standardized storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various analysis platforms, such as ICIJ Datashare, Liquid Investigations, and Aleph.
It can act as a drop-in replacement for the underlying archive of Aleph.
install
pip install leakrfc
build a dataset
leakrfc
stores metadata for the files that then refers to the actual source file.
List the files in a public accessible source (using anystore
):
ANYSTORE_URI="https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/" anystore keys
Crawl these documents into this dataset:
leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"
The metadata and source files are now stored in the archive (./data
by default). All metadata and other information lives in the ddos_patriotfront/.leakrfc
subdirectory. Files are keyed and retrievable by their checksum (default: sha1
).
Retrieve file metadata:
leakrfc -d ddos_patriotfront head "19338a97797bcc0eeb832cf7169cbbafc54ed255"
Retrieve actual file blob:
leakrfc -d ddos_patriotfront get "19338a97797bcc0eeb832cf7169cbbafc54ed255" > file.pdf
api
run api
export LEAKRFC_ARCHIVE__URI=./data
uvicorn leakrfc.api:app
request a file
For public files:
# metadata only via headers
curl -I "http://localhost:5000/<dataset>/<sha1>"
# bytes stream of file
curl -s "http://localhost:5000/<dataset>/<sha1>" > /tmp/file.lrfc
Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: {"sub": "<dataset>/<key>"}
). Therefore, clients need to be able to create such tokens (knowing the secret key) and handle dataset permissions.
Tokens should have a short expiration (via exp
property in payload).
# token in Authorization header
curl -H 'Authorization: Bearer <token>' ...
# metadata only via headers
curl -I "http://localhost:5000/file"
# bytes stream of file
curl -s "http://localhost:5000/file" > /tmp/file.s
configure storage
storage_config:
uri: s3://my_bucket
backend_kwargs:
endpoint_url: https://s3.example.org
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
pass through legacy aleph
storage_config:
uri: gcs://aleph_archive/
legacy_aleph: true
copy_over: true # subsequently merge legacy archive data into `leakrfc`
layout
The RFC is reflected by the following layout structure for a Dataset:
./archive/
my_dataset/
# metadata maintained by `leakrfc`
.leakrfc/
index.json # generated dataset metadata served for clients
config.yml # dataset configuration
documents.csv # document database (all metadata combined)
keys.csv # hash -> uri mapping for all files
state/ # processing state
logs/
created_at
updated_at
entities/
entities.ftm.json
files/ # FILE METADATA STORAGE:
a1/b1/a1b1c1.../info.json # - file metadata as json REQUIRED
a1/b1/a1b1c1.../txt # - extracted plain text
a1/b1/a1b1c1.../converted.pdf # - converted file, e.g. from .docx to .pdf for better web display
a1/b1/a1b1c1.../extracted/ # - extracted files from packages/archives
foo.txt
export/
my_dataset.img.zst # dump as image
my_dataset.leakrfc # dump as zipfile
# actual (read-only) data
Arbitrary Folder/
Source1.pdf
Tables/
Another_File.xlsx
dataset config.yml
Follows the specification in ftmq.model.Dataset
:
name: my_dataset # also known as "foreign_id"
title: An awesome leak
description: >
Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
similique asperiores quod et quae maiores. Et accusantium accusantium error
et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata
leakrfc: # see above
Development
This package is using poetry for packaging and dependencies management, so first install it.
Clone this repository to a local destination.
Within the repo directory, run
poetry install --with dev
This installs a few development dependencies, including pre-commit which needs to be registered:
poetry run pre-commit install
Before creating a commit, this checks for correct code formatting (isort, black) and some other useful stuff (see: .pre-commit-config.yaml
)
testing
leakrfc
uses pytest as the testing framework.
make test
License and Copyright
leakrfc
, (C) 2024 investigativedata.io
leakrfc
is licensed under the AGPLv3 or later license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file leakrfc-0.0.0.tar.gz
.
File metadata
- Download URL: leakrfc-0.0.0.tar.gz
- Upload date:
- Size: 33.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2a008cbe06d89004caf96e93d84946de0337ec24bc1aa16d65343a188b45f54 |
|
MD5 | 0a22edad686c8400b6e9ddcd4a5d1932 |
|
BLAKE2b-256 | cc70bd28cb6b3654abed7f6921d4055ff75596eba325a4e96fabf916ba12a14c |
File details
Details for the file leakrfc-0.0.0-py3-none-any.whl
.
File metadata
- Download URL: leakrfc-0.0.0-py3-none-any.whl
- Upload date:
- Size: 39.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73855dc8f9969b9ca206ce2263e781d4aeba9247c0403f2b554a602221e943ad |
|
MD5 | e1843b7132fd6769516f5daebb8f5822 |
|
BLAKE2b-256 | 54bcb9915bd510d87941c86408ac5e8d91c4f155a3494431b0d044cc430eb986 |