Skip to main content

Data standard, storage and retrieval for leaks and document collections

Project description

leakrfc

An RFC for leaks

leak-rfc.org

leakrfc provides a data standard and archive storage for leaked data, private and public document collections. The concepts and implementations are originally inspired by mmmeta and Aleph's servicelayer archive.

leakrfc acts as a standardized storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various analysis platforms, such as ICIJ Datashare, Liquid Investigations, and Aleph.

It can act as a drop-in replacement for the underlying archive of Aleph.

install

pip install leakrfc

build a dataset

leakrfc stores metadata for the files that then refers to the actual source file.

List the files in a public accessible source (using anystore):

ANYSTORE_URI="https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/" anystore keys

Crawl these documents into this dataset:

leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"

The metadata and source files are now stored in the archive (./data by default). All metadata and other information lives in the ddos_patriotfront/.leakrfc subdirectory. Files are keyed and retrievable by their checksum (default: sha1).

Retrieve file metadata:

leakrfc -d ddos_patriotfront head "19338a97797bcc0eeb832cf7169cbbafc54ed255"

Retrieve actual file blob:

leakrfc -d ddos_patriotfront get "19338a97797bcc0eeb832cf7169cbbafc54ed255" > file.pdf

api

run api

export LEAKRFC_ARCHIVE__URI=./data
uvicorn leakrfc.api:app

request a file

For public files:

# metadata only via headers
curl -I "http://localhost:5000/<dataset>/<sha1>"

# bytes stream of file
curl -s "http://localhost:5000/<dataset>/<sha1>" > /tmp/file.lrfc

Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: {"sub": "<dataset>/<key>"}). Therefore, clients need to be able to create such tokens (knowing the secret key) and handle dataset permissions.

Tokens should have a short expiration (via exp property in payload).

# token in Authorization header
curl -H 'Authorization: Bearer <token>' ...

# metadata only via headers
curl -I "http://localhost:5000/file"

# bytes stream of file
curl -s "http://localhost:5000/file" > /tmp/file.s

configure storage

storage_config:
  uri: s3://my_bucket
  backend_kwargs:
    endpoint_url: https://s3.example.org
    aws_access_key_id: ${AWS_ACCESS_KEY_ID}
    aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}

pass through legacy aleph

storage_config:
  uri: gcs://aleph_archive/
  legacy_aleph: true
  copy_over: true # subsequently merge legacy archive data into `leakrfc`

layout

The RFC is reflected by the following layout structure for a Dataset:

./archive/
    my_dataset/

        # metadata maintained by `leakrfc`
        .leakrfc/
            index.json      # generated dataset metadata served for clients
            config.yml      # dataset configuration
            documents.csv   # document database (all metadata combined)
            keys.csv        # hash -> uri mapping for all files
            state/          # processing state
                logs/
                created_at
                updated_at
            entities/
                entities.ftm.json
            files/                         # FILE METADATA STORAGE:
                a1/b1/a1b1c1.../info.json  # - file metadata as json REQUIRED
                a1/b1/a1b1c1.../txt        # - extracted plain text
                a1/b1/a1b1c1.../converted.pdf  # - converted file, e.g. from .docx to .pdf for better web display
                a1/b1/a1b1c1.../extracted/ # - extracted files from packages/archives
                    foo.txt
            export/
                my_dataset.img.zst         # dump as image
                my_dataset.leakrfc         # dump as zipfile

        # actual (read-only) data
        Arbitrary Folder/
            Source1.pdf
            Tables/
                Another_File.xlsx

dataset config.yml

Follows the specification in ftmq.model.Dataset:

name: my_dataset #  also known as "foreign_id"
title: An awesome leak
description: >
  Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
  labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
  similique asperiores quod et quae maiores. Et accusantium accusantium error
  et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata

leakrfc: # see above

Development

This package is using poetry for packaging and dependencies management, so first install it.

Clone this repository to a local destination.

Within the repo directory, run

poetry install --with dev

This installs a few development dependencies, including pre-commit which needs to be registered:

poetry run pre-commit install

Before creating a commit, this checks for correct code formatting (isort, black) and some other useful stuff (see: .pre-commit-config.yaml)

testing

leakrfc uses pytest as the testing framework.

make test

License and Copyright

leakrfc, (C) 2024 investigativedata.io

leakrfc is licensed under the AGPLv3 or later license.

see NOTICE and LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leakrfc-0.0.3.tar.gz (34.4 kB view details)

Uploaded Source

Built Distribution

leakrfc-0.0.3-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file leakrfc-0.0.3.tar.gz.

File metadata

  • Download URL: leakrfc-0.0.3.tar.gz
  • Upload date:
  • Size: 34.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64

File hashes

Hashes for leakrfc-0.0.3.tar.gz
Algorithm Hash digest
SHA256 90a56330ea63401ae8f7c3b8c5f609fb5ec5e3d619a420bf33e9109c3eab38a4
MD5 bad3f0813eb2fee503a48a435fb04b60
BLAKE2b-256 3c074b704196afc2a0c2f9287d9460954ad1cc1b8fbbee6c711ce52ec458d6a4

See more details on using hashes here.

File details

Details for the file leakrfc-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: leakrfc-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64

File hashes

Hashes for leakrfc-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b665f56002c95d2942f2dc3bd71fb790b69208b8788a1400ab0a7a68d5e6e4ef
MD5 335aa21980c126438edab93d3453a090
BLAKE2b-256 1a4d58f4d73765b0ff0ff767f25adaf9d2c91e8e8fa01212eedfdc86dc4e5737

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page