Skip to main content

A simple toolkit for managing local state against remote metadata.

Project description

test release pypi

mmmeta

Handle meta information and local state about files from remote (read-only) locations.

example usecase

It’s better explained by a concrete example:

Server scrapes documents and stores them with metadata

Client1 wants to download all files with document_type="contract"

Client2 wants to import all documents scraped not longer than 1 week ago into a database, but only the ones that are not imported yet

synopsis

To clarify the terms used in this manual:

  • files: actual files (like pdfs…)

  • metadata files: json files that contain metadata for actual files

  • metadata db: sqlite database containing metadata for all files from the remote

  • remote: the “source of truth” where files, metadata files and metadata db are stored. A remote can still be a local folder on the same machine…

  • client: A client that has read-only access to the remote

  • state db: sqlite database, stored only on the client, containing local state for files

  • store: A simple implementation of a key-value store for additional information

  • metadir: a directory named _mmmeta that is synced between remote and client and contains metadata db, store, and (on the client) state db

how does this scenario work?

Server 1. Stores a metadata json file for each file 2. Generates (and updates) a metadir

Client1 1. Syncs remote metadir 2. Merge remote metadata db with local state db 3. Query state db for given criteria 4. For each result download the actual file from the remote

Client2 1. Syncs remote metadir 2. Merge remote metadata db with local state db 3. Query the state db for remote metadata retrieved_at=<date> and local state imported=False

mmmeta automates almost ;) all of this:

implementation of this scenario with mmmeta

Server

scrapes document and stores them with metadata

a) Server stores files locally

If the files (and their metadata) are stored locally, metadata generation is as easy as looping through all of the json files and generate the database out of it. This can be done via command line inside the directory of the metadata files:

mmmeta generate

This will loop through all json files create a sqlite database in ./_mmmeta/meta.db

For other path locations, see initialization

When new metadata files are added, simply re-run this command. It will just update the meta db without deleting existing entries, which means the old metadata files don’t need to stay on the server (see next situation).

b) Server downloads files locally but then pushes into a cloud

Here we don’t have all the files locally, only a subset (the new downloaded ones).

First, synchronize cloud metadir to local (aka the server).

Then update metadata as described above:

mmmeta generate

Last, synchronize the updated metadir back to the cloud.

c) Server directly pushes files to cloud

Here, we don’t have any file and its metadata locally (on the server). Updating the meta db happens within the python code of the application:

First, synchronize cloud metadir to local (aka the server).

Then, run your application…

from mmmeta import mmmeta

m = mmmeta("./path/to/metadir")

for data in scraper:
    m.files.insert(**data)
    # or upsert, if you want:
    m.files.upsert(**data, [keys])  # e.g. "content_hash"

This will update the meta db in the metadir

Last, synchronize the updated metadir back to the cloud.

Client1

wants to download all files with document_type="contract"

First, synchronize remote metadir to local.

Then,

from mmmeta import mmmeta

m = mmmeta("./path/to/metadir")

for file in m.files(document_type="contract"):
    download(file.remote.url)

def download(url):
    # implement download based on remote storage
    # url will be, based on storage, something like:
    # - file:///path/to/file.pdf (remote is local filesystem)
    # - s3://bucket/path/to/file.pdf (remote is aws cloud storage)
    # - https://remote.com/path/to/file.pdf
    # ...

See config on how to generate remote urls or uris

The

Client2

wants to import all documents scraped not longer than 1 week ago into a database, but only the ones that are not imported yet

Therefore the client uses a local state db in the mmmeta.

First, synchronize remote metadata db to local

Then, update meta to local state: via command-line:

MMMETA=./path/to/metadir mmmeta update

or programmatically:

from mmmeta import mmmeta

m = mmmeta("./path/to/metadir/")
m.update()

After that, remote metadata and local state are merged and easy usable like this:

for file in m.files.find(retrieved_at=<date>, imported=False):
    process_import(file)
    file["imported"] = True
    file.save()

The files object on a metadir is a wrapper to a dataset table with all its functionallity, instead that it yields File objects that you can use to alter the state of the files in the database as described in the example above.

Initialization

On the client:

When mmmeta is initialized with a path, the directory path/_mmmeta will be the metadir

path can be set via env var:

MMMETA=./path/ mmmeta update

or in scripts:

from mmmeta import mmmeta

m = mmmeta("./path/")

On the remote:

Same as client, but for the metadata files either recursively inside path unless other specified via env var MMMETA_FILES_ROOT

This means, on the remote the metadata files and the metadir don’t need to be in the same path location.

Or, speaking of clouds: metadir and actual files can exist in different buckets.

Synchronization

This package is totally agnostic about the remote storage backend (could be a local filesystem location or cloud storage) and doesn’t handle any of the local <-> remote synchronization.

Therefore the synchronization of the metadir ./foo/_mmmeta is up to you with the tool of your choice.

Config

mmmeta can optionally have a config stored in ./foo/_mmmeta/config.yml

Example (all settings are optional):

metadata:
  file_name: _file_name  # key in json metadat for file name
  include:  # only include these keys from json metadata in meta db
  - reference
  - modified_at
  - title
  - originators
  - publisher:name  # nested keys are flattened with ":" between them
  unique: content_hash  # unqiue identifier for files
remote:  # simple string replacement to generate `File.remote.<attr>` attributes, like:
  url: https://my_bucket.s3.eu-central-1.amazonaws.com/foo/bar/{_file_name}
  uri: s3://my_bucket/foo/bar/{_file_name}

remote

The configuration section remote from above ensures that the file objects have attributes to access the actual files from the remote:

from mmmeta import mmmeta

m = mmmeta()

for file in m.files:
    print(file.remote.uri)

Store

mmmeta ships with a simple key-value-store that can be used by both the remote and client to store some additional data. The store lives in the metadir ./foo/_mmmeta/_store

You can store any values in it:

from mmmeta import mmmeta

m = mmmeta("./path/to/metadir/")
m.store["new_files"] = 17

any machine that synchronizes the metadir can read these values:

from mmmeta import mmmeta

m = mmmeta("./path/to/metadir/")
new_files = m.store["new_files"]  # 17

For storing timestamps, there is a shorthand via the touch function:

m.touch("my_ts_key")

This will save the value of the current datetime.now() to the key my_ts_key. The values are typed (int, float or timestamp), so you can easily do something like this:

from mmmeta import mmmeta

m = mmmeta("./path/to/metadir/")

if m.store["remote_last_updated"] > m.store["local_last_updated"]:
    # run scraper

Installation

Requires python3. Virtualenv use recommended.

Additional dependencies will be installed automatically:

pip install mmmeta

After this, you should be able to execute in your terminal:

mmmeta --help

You should as well be able to import it in your python scripts:

from mmmeta import mmmeta

cli

Usage: mmmeta [OPTIONS] COMMAND [ARGS]...

Options:
  --metadir TEXT     Base path for reading meta info and storing state
                     [default: <current/working/dir>]
  --files-root TEXT  Base path for actual files to generate metadir from
                     [default: <current/working/dir>]
  --help             Show this message and exit.

Commands:
  generate
  inspect
  update

developement

Install testing requirements:

make install

Test:

make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmmeta-0.4.3.tar.gz (20.5 kB view hashes)

Uploaded Source

Built Distribution

mmmeta-0.4.3-py3-none-any.whl (19.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page