Skip to main content

Easily archive online media content

Project description

Auto Archiver

Read the article about Auto Archiver on bellingcat.com.

Python tool to automatically archive social media posts, videos, and images from a Google Sheets, the console, and more. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. If using Google Sheets as the source for links, it will be updated with information about the archived content. It can be run manually or on an automated basis.

There are 3 ways to use the auto-archiver:

  1. (easiest installation) via docker
  2. (local python install) pip install auto-archiver
  3. (legacy/development) clone and manually install from repo (see legacy tutorial video)

But you always need a configuration/orchestration file, which is where you'll configure where/what/how to archive. Make sure you read orchestration.

How to run the auto-archiver

Option 1 - docker

Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the -v volume flag.

  1. install docker
  2. pull the auto-archiver docker image with docker pull bellingcat/auto-archiver
  3. run the docker image locally in a container: docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver -m auto_archiver --config secrets/orchestration.yaml breaking this command down:
    1. docker run tells docker to start a new container (an instance of the image)
    2. --rm makes sure this container is removed after execution (less garbage locally)
    3. -v $PWD/secrets:/app/secrets - your secrets folder
      1. -v is a volume flag which means a folder that you have on your computer will be connected to a folder inside the docker container
      2. $PWD/secrets points to a secrets/ folder in your current working directory (where your console points to), we use this folder as a best practice to hold all the secrets/tokens/passwords/... you use
      3. /app/secrets points to the path the docker container where this image can be found
    4. -v $PWD/local_archive:/app/local_archive - (optional) if you use local_storage
      1. -v same as above, this is a volume instruction
      2. $PWD/local_archive is a folder local_archive/ in case you want to archive locally and have the files accessible outside docker
      3. /app/local_archive is a folder inside docker that you can reference in your orchestration.yml file

Option 2 - python package

  1. make sure you have python 3.8 or higher installed
  2. install the package pip/pipenv/conda install auto-archiver
  3. test it's installed with auto-archiver --help
  4. run it with your orchestration file and pass any flags you want in the command line auto-archiver --config secrets/orchestration.yaml
    1. if your orchestration file is inside a secrets/ which we advise

Option 3 - local installation

This can also be used for development.

Legacy instructions, only use if docker/package is not an option

Install the following locally:

  1. ffmpeg must also be installed locally for this tool to work.
  2. firefox and geckodriver on a path folder like /usr/local/bin.
  3. fonts-noto to deal with multiple unicode characters during selenium/geckodriver's screenshots: sudo apt install fonts-noto -y.

Clone and run:

  1. git clone https://github.com/bellingcat/auto-archiver
  2. pipenv install
  3. pipenv run python -m src.auto_archiver --config secrets/orchestration.yaml

Examples

Orchestration

The archiver work is orchestrated by the following workflow (we call each a step):

  1. Feeder gets the links (from a spreadsheet, from the console, ...)
  2. Archiver tries to archive the link (twitter, youtube, ...)
  3. Enricher adds more info to the content (hashes, thumbnails, ...)
  4. Formatter creates a report from all the archived content (HTML, PDF, ...)
  5. Database knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)

To check all available steps (which archivers, storages, databses, ...) exist check the example.orchestration.yaml.

The great thing is you configure all the workflow in your orchestration.yaml file which we advise you put into a secrets/ folder and don't share it with others because it will contain passwords and other secrets.

The structure of orchestration file is split into 2 parts: steps (what steps to use) and configs (how those steps should behave), here's a simplification:

# orchestration.yaml content
steps:
  feeder: gsheet_feeder
  archivers: # order matters
    - youtubedl_enricher
  enrichers:
    - thumbnail_enricher
  formatter: html_formatter
  storages:
    - local_storage
  databases:
    - gsheet_db

configurations:
  gsheet_feeder:
    sheet: "your google sheet name"
    header: 2 # row with header for your sheet
  # ... configurations for the other steps here ...

All the configurations in the orchestration.yaml file (you can name it differently but need to pass it in the --config FILENAME argument) can be seen in the console by using the --help flag. They can also be overwritten, for example if you are using the cli_feeder to archive from the command line and want to provide the URLs you should do:

auto-archiver --config orchestration.yaml --cli_feeder.urls="url1,url2,url3"

Here's the complete workflow that the auto-archiver goes through:

graph TD
    s((start)) --> F(fa:fa-table Feeder)
    F -->|get and clean URL| D1{fa:fa-database Database}
    D1 -->|is already archived| e((end))
    D1 -->|not yet archived| a(fa:fa-download Archivers)
    a -->|got media| E(fa:fa-chart-line Enrichers)
    E --> S[fa:fa-box-archive Storages]
    E --> Fo(fa:fa-code Formatter)
    Fo --> S
    Fo -->|update database| D2(fa:fa-database Database)
    D2 --> e

Orchestration checklist

Use this to make sure you help making sure you did all the required steps:

  • you have a /secrets folder with all your configuration files including
    • a orchestration file eg: orchestration.yaml pointing to the correct location of other files
    • (optional if you use GoogleSheets) you have a service_account.json (see how-to)
    • (optional for telegram) a anon.session which appears after the 1st run where you login to telegram
      • if you use private channels you need to add channel_invites and set join_channels=true at least once
    • (optional for VK) a vk_config.v2.json
    • (optional for using GoogleDrive storage) gd-token.json (see help script)
    • (optional for instagram) instaloader.session file which appears after the 1st run and login in instagram
    • (optional for browsertrix) profile.tar.gz file

Example invocations

These assume you've installed with pipenv, see docker section above for how to run through docker

# all the configurations come from ./orchestration.yaml
auto-archiver
# all the configurations come from ./secrets/orchestration.yaml
auto-archiver --config orchestration.yaml
# uses the configurations but for another google docs sheet 
# with a header on row 2 and with some different column names
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
auto-archiver --config orchestration.yaml --gsheets_feeder.sheet="use it on another sheets doc" --gsheets_feeder.header=2 --gsheets_feeder.columns='{"url": "link"}'
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
auto-archiver --s3_storage.private=1

Extra notes on configuration

Google Drive

To use Google Drive storage you need the id of the shared folder in the config.yaml file which must be shared with the service account eg autoarchiverservice@auto-archiver-111111.iam.gserviceaccount.com and then you can use --storage=gd

Telethon (Telegrams API Library)

The first time you run, you will be prompted to do a authentication with the phone number associated, alternatively you can put your anon.session in the root.

Running on Google Sheets Feeder (gsheets_feeder)

The --gseets_feeder.sheet property is the name of the Google Sheet to check for URLs. This sheet must have been shared with the Google Service account used by gspread. This sheet must also have specific columns (case-insensitive) in the header row - see Gsheet.configs for all their names.

For example, for use with this spreadsheet:

A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column

When the auto archiver starts running, it updates the "Archive status" column. A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Media URL" column. The auto archiver has added "archive in progress" to one of the status columns. The links are downloaded and archived, and the spreadsheet is updated to the following: A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above. Note that the first row is skipped, as it is assumed to be a header row (--gsheets_feeder.header=1 and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.


Development

Use python -m src.auto_archiver --config secrets/orchestration.yaml to run from the local development environment.

Docker development

  • working with docker locally:
    • docker build . -t auto-archiver to build a local image
    • docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml
      • to use local archive, also create a volume -v for it by adding -v $PWD/local_archive:/app/local_archive
  • release to docker hub
    • docker image tag auto-archiver bellingcat/auto-archiver:latest
    • docker push bellingcat/auto-archiver

RELEASE

  • update version in version.py
  • run bash ./scripts/release.sh and confirm
  • package is automatically updated in pypi
  • docker image is automatically pushed to dockerhup

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auto_archiver-0.3.0.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

auto_archiver-0.3.0-py3-none-any.whl (61.3 kB view details)

Uploaded Python 3

File details

Details for the file auto_archiver-0.3.0.tar.gz.

File metadata

  • Download URL: auto_archiver-0.3.0.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for auto_archiver-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4bf8cf5af31872505d464ec0f4b40eda8d9b67cde0380accb2e543af4cb174ad
MD5 1b55a3c03723d455006aaea6bfd266ec
BLAKE2b-256 192122dbbcb060061fbb756d17bce95abf4f9f341e160b22be3dc7ef01333d0f

See more details on using hashes here.

File details

Details for the file auto_archiver-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for auto_archiver-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bebec3199aed621e7fcff93bea83b56539138b53ae291aab0179815c713a98e8
MD5 bffb334bf466b73748ae14c6f488475a
BLAKE2b-256 6eadbd700b7ba76280701520261fdca39050f49f9a252669140c50443a03b293

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page