Skip to main content

Data standard, storage and retrieval for leaks and document collections

Project description

leakrfc

An RFC for leaks

leak-rfc.org

leakrfc provides a data standard and archive storage for leaked data, private and public document collections. The concepts and implementations are originally inspired by mmmeta and Aleph's servicelayer archive.

leakrfc acts as a standardized storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various analysis platforms, such as ICIJ Datashare, Liquid Investigations, and Aleph.

It can act as a drop-in replacement for the underlying archive of Aleph.

install

pip install leakrfc

build a dataset

leakrfc stores metadata for the files that then refers to the actual source file.

List the files in a public accessible source (using anystore):

ANYSTORE_URI="https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/" anystore keys

Crawl these documents into this dataset:

leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"

The metadata and source files are now stored in the archive (./data by default). All metadata and other information lives in the ddos_patriotfront/.leakrfc subdirectory. Files are keyed and retrievable by their checksum (default: sha1).

Retrieve file metadata:

leakrfc -d ddos_patriotfront head "19338a97797bcc0eeb832cf7169cbbafc54ed255"

Retrieve actual file blob:

leakrfc -d ddos_patriotfront get "19338a97797bcc0eeb832cf7169cbbafc54ed255" > file.pdf

api

run api

export LEAKRFC_ARCHIVE__URI=./data
uvicorn leakrfc.api:app

request a file

For public files:

# metadata only via headers
curl -I "http://localhost:5000/<dataset>/<sha1>"

# bytes stream of file
curl -s "http://localhost:5000/<dataset>/<sha1>" > /tmp/file.lrfc

Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: {"sub": "<dataset>/<key>"}). Therefore, clients need to be able to create such tokens (knowing the secret key) and handle dataset permissions.

Tokens should have a short expiration (via exp property in payload).

# token in Authorization header
curl -H 'Authorization: Bearer <token>' ...

# metadata only via headers
curl -I "http://localhost:5000/file"

# bytes stream of file
curl -s "http://localhost:5000/file" > /tmp/file.s

configure storage

storage_config:
  uri: s3://my_bucket
  backend_kwargs:
    endpoint_url: https://s3.example.org
    aws_access_key_id: ${AWS_ACCESS_KEY_ID}
    aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}

pass through legacy aleph

storage_config:
  uri: gcs://aleph_archive/
  legacy_aleph: true
  copy_over: true # subsequently merge legacy archive data into `leakrfc`

layout

The RFC is reflected by the following layout structure for a Dataset:

./archive/
    my_dataset/

        # metadata maintained by `leakrfc`
        .leakrfc/
            index.json      # generated dataset metadata served for clients
            config.yml      # dataset configuration
            documents.csv   # document database (all metadata combined)
            keys.csv        # hash -> uri mapping for all files
            state/          # processing state
                logs/
                created_at
                updated_at
            entities/
                entities.ftm.json
            files/                         # FILE METADATA STORAGE:
                a1/b1/a1b1c1.../info.json  # - file metadata as json REQUIRED
                a1/b1/a1b1c1.../txt        # - extracted plain text
                a1/b1/a1b1c1.../converted.pdf  # - converted file, e.g. from .docx to .pdf for better web display
                a1/b1/a1b1c1.../extracted/ # - extracted files from packages/archives
                    foo.txt
            export/
                my_dataset.img.zst         # dump as image
                my_dataset.leakrfc         # dump as zipfile

        # actual (read-only) data
        Arbitrary Folder/
            Source1.pdf
            Tables/
                Another_File.xlsx

dataset config.yml

Follows the specification in ftmq.model.Dataset:

name: my_dataset #  also known as "foreign_id"
title: An awesome leak
description: >
  Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
  labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
  similique asperiores quod et quae maiores. Et accusantium accusantium error
  et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata

leakrfc: # see above

Development

This package is using poetry for packaging and dependencies management, so first install it.

Clone this repository to a local destination.

Within the repo directory, run

poetry install --with dev

This installs a few development dependencies, including pre-commit which needs to be registered:

poetry run pre-commit install

Before creating a commit, this checks for correct code formatting (isort, black) and some other useful stuff (see: .pre-commit-config.yaml)

testing

leakrfc uses pytest as the testing framework.

make test

License and Copyright

leakrfc, (C) 2024 investigativedata.io

leakrfc is licensed under the AGPLv3 or later license.

see NOTICE and LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leakrfc-0.0.0.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

leakrfc-0.0.0-py3-none-any.whl (39.4 kB view details)

Uploaded Python 3

File details

Details for the file leakrfc-0.0.0.tar.gz.

File metadata

  • Download URL: leakrfc-0.0.0.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64

File hashes

Hashes for leakrfc-0.0.0.tar.gz
Algorithm Hash digest
SHA256 f2a008cbe06d89004caf96e93d84946de0337ec24bc1aa16d65343a188b45f54
MD5 0a22edad686c8400b6e9ddcd4a5d1932
BLAKE2b-256 cc70bd28cb6b3654abed7f6921d4055ff75596eba325a4e96fabf916ba12a14c

See more details on using hashes here.

File details

Details for the file leakrfc-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: leakrfc-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 39.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64

File hashes

Hashes for leakrfc-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73855dc8f9969b9ca206ce2263e781d4aeba9247c0403f2b554a602221e943ad
MD5 e1843b7132fd6769516f5daebb8f5822
BLAKE2b-256 54bcb9915bd510d87941c86408ac5e8d91c4f155a3494431b0d044cc430eb986

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page