Skip to main content

Data standard, storage and retrieval for leaks and document collections

Project description

leakrfc

An RFC for leaks

leak-rfc.org

leakrfc provides a data standard and archive storage for leaked data, private and public document collections. The concepts and implementations are originally inspired by mmmeta and Aleph's servicelayer archive.

leakrfc acts as a standardized storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various analysis platforms, such as ICIJ Datashare, Liquid Investigations, and Aleph.

It can act as a drop-in replacement for the underlying archive of Aleph.

install

pip install leakrfc

build a dataset

leakrfc stores metadata for the files that then refers to the actual source file.

List the files in a public accessible source (using anystore):

ANYSTORE_URI="https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/" anystore keys

Crawl these documents into this dataset:

leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"

The metadata and source files are now stored in the archive (./data by default). All metadata and other information lives in the ddos_patriotfront/.leakrfc subdirectory. Files are keyed and retrievable by their checksum (default: sha1).

Retrieve file metadata:

leakrfc -d ddos_patriotfront head "19338a97797bcc0eeb832cf7169cbbafc54ed255"

Retrieve actual file blob:

leakrfc -d ddos_patriotfront get "19338a97797bcc0eeb832cf7169cbbafc54ed255" > file.pdf

api

run api

export LEAKRFC_ARCHIVE__URI=./data
uvicorn leakrfc.api:app

request a file

For public files:

# metadata only via headers
curl -I "http://localhost:5000/<dataset>/<sha1>"

# bytes stream of file
curl -s "http://localhost:5000/<dataset>/<sha1>" > /tmp/file.lrfc

Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: {"sub": "<dataset>/<key>"}). Therefore, clients need to be able to create such tokens (knowing the secret key) and handle dataset permissions.

Tokens should have a short expiration (via exp property in payload).

# token in Authorization header
curl -H 'Authorization: Bearer <token>' ...

# metadata only via headers
curl -I "http://localhost:5000/file"

# bytes stream of file
curl -s "http://localhost:5000/file" > /tmp/file.s

configure storage

storage_config:
  uri: s3://my_bucket
  backend_kwargs:
    endpoint_url: https://s3.example.org
    aws_access_key_id: ${AWS_ACCESS_KEY_ID}
    aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}

pass through legacy aleph

storage_config:
  uri: gcs://aleph_archive/
  legacy_aleph: true
  copy_over: true # subsequently merge legacy archive data into `leakrfc`

layout

The RFC is reflected by the following layout structure for a Dataset:

./archive/
    my_dataset/

        # metadata maintained by `leakrfc`
        .leakrfc/
            index.json      # generated dataset metadata served for clients
            config.yml      # dataset configuration
            documents.csv   # document database (all metadata combined)
            keys.csv        # hash -> uri mapping for all files
            state/          # processing state
                logs/
                created_at
                updated_at
            entities/
                entities.ftm.json
            files/                         # FILE METADATA STORAGE:
                a1/b1/a1b1c1.../info.json  # - file metadata as json REQUIRED
                a1/b1/a1b1c1.../txt        # - extracted plain text
                a1/b1/a1b1c1.../converted.pdf  # - converted file, e.g. from .docx to .pdf for better web display
                a1/b1/a1b1c1.../extracted/ # - extracted files from packages/archives
                    foo.txt
            export/
                my_dataset.img.zst         # dump as image
                my_dataset.leakrfc         # dump as zipfile

        # actual (read-only) data
        Arbitrary Folder/
            Source1.pdf
            Tables/
                Another_File.xlsx

dataset config.yml

Follows the specification in ftmq.model.Dataset:

name: my_dataset #  also known as "foreign_id"
title: An awesome leak
description: >
  Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
  labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
  similique asperiores quod et quae maiores. Et accusantium accusantium error
  et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata

leakrfc: # see above

Development

This package is using poetry for packaging and dependencies management, so first install it.

Clone this repository to a local destination.

Within the repo directory, run

poetry install --with dev

This installs a few development dependencies, including pre-commit which needs to be registered:

poetry run pre-commit install

Before creating a commit, this checks for correct code formatting (isort, black) and some other useful stuff (see: .pre-commit-config.yaml)

testing

leakrfc uses pytest as the testing framework.

make test

License and Copyright

leakrfc, (C) 2024 investigativedata.io

leakrfc is licensed under the AGPLv3 or later license.

see NOTICE and LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leakrfc-0.0.2.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

leakrfc-0.0.2-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file leakrfc-0.0.2.tar.gz.

File metadata

  • Download URL: leakrfc-0.0.2.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64

File hashes

Hashes for leakrfc-0.0.2.tar.gz
Algorithm Hash digest
SHA256 4f4fb9d0fce18b4d7af3dcd442d713af411f034e0e6acb7cc90e97200c0d94e8
MD5 ac3908d186fff7dd35c52d8976e78cbe
BLAKE2b-256 939763384cdb674cc83d4b1b5057b7f3f3e48ece461ff8b06b085a98b021ccd3

See more details on using hashes here.

File details

Details for the file leakrfc-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: leakrfc-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 39.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.10.11-amd64

File hashes

Hashes for leakrfc-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3fd29b7a3efe2544f988406a3a5c7a7b7a38b9f7d4310f68989c6f7f44904672
MD5 ed7e2defea2be1ce055d82f9a13413ad
BLAKE2b-256 7d6e6c494d91b3de0cc3ebded7b6081bccb9e09179af936f32364e857e48434f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page