Recap is a tiny metadata service built for engineers.
Project description
Recap crawls and serves metadata
There are a ton of data catalogs out there, but most are complex, bloated, and built for a wide audience. Recap is the opposite of that; it's tiny and built specifically for engineers.
You can use Recap to build data tools for:
- Observability
- Monitoring
- Debugging
- Security
- Compliance
- Governance
- Cost
Principles
Recap was designed to be small and flexible.
- Lightweight: Recap starts with a single command and doesn't require other infrastructure (not even Docker).
- CLI-first: Recap doesn't even have a GUI!
- RESTful: Recap comes with a REST API to ease integration with outside tools.
- Automated: Recap is not meant for manual taxonomy curation.
- Modular: Recap's storage and crawling layers are fully pluggable.
- Programmable: Recap is a Python library, so you can invoke it directly from your code.
Integrations
Recap uses SQLAlchemy to access databases, so any SQLAlchemy dialect should work. Recap has been tested with:
- Snowflake
- Bigquery
- PostgreSQL
Stream and filesystem crawling is in the works.
Usage
Quickstart
Start by installing Recap. Python 3.9 or above is required.
pip install recap-core
Now let's crawl a database:
recap refresh postgresql://username@localhost/some_db
You can use any SQLAlchemy connect string.
recap refresh bigquery://some-project-12345
recap refresh snowflake://some_user:some_pass@some_account_id
For Snowflake and BigQuery, you'll have to pip install snowflake-sqlalchemy
or pip install sqlalchemy-bigquery
, respectively.
Crawled metadata is stored in a directory structure. See what's available using:
recap list /
Recap will respond with a JSON list:
[
"databases"
]
Append children to the list path to browse around:
recap list /databases
After you poke around, try and read some metadata. Every node in the path can have metadata, but only table and view children contain metadata. You can look at metadata using the recap read
command:
recap read /databases/postgresql/instances/localhost/schemas/some_db/tables/some_table
Recap will print all of some_table
's metadata to the CLI in JSON format:
{
"access": {
"username": {
"privileges": [
"INSERT",
"SELECT",
"UPDATE",
"DELETE",
"TRUNCATE",
"REFERENCES",
"TRIGGER"
],
"read": true,
"write": true
}
},
"columns": {
"email": {
"autoincrement": false,
"default": null,
"generic_type": "VARCHAR",
"nullable": false,
"type": "VARCHAR"
},
"id": {
"autoincrement": true,
"default": "nextval('\"some_db\".some_table_id_seq'::regclass)",
"generic_type": "BIGINT",
"nullable": false,
"type": "BIGINT"
}
},
"data_profile": {
"email": {
"count": 10,
"distinct": 10,
"empty_strings": 0,
"max_length": 32,
"min_length": 13,
"nulls": 0
},
"id": {
"average": 5.5,
"count": 10,
"max": 10,
"min": 1,
"negatives": 0,
"nulls": 0,
"sum": 55.0,
"zeros": 0
}
},
"indexes": {
"index_some_table_on_email": {
"columns": [
"email"
],
"unique": false
}
},
"location": {
"database": "postgresql",
"instance": "localhost",
"schema": "some_db",
"table": "some_table"
},
"primary_key": {
"constrained_columns": [
"id"
],
"name": "some_table_pkey"
}
}
You can search for metadata, too. Recap stores its metadata in DuckDB by default, so you can use DuckDB's JSON path syntax to search the catalog:
recap search "metadata->'$.location'->>'$.table' = 'some_table'"
The database file defaults to ~/.recap/catalog/recap.duckdb
, if you wish to open a DuckDB client directly.
API
Server
Recap comes with an API out of the box. You can start it with:
recap api
A uvicorn
server will bind to http://localhost:8000 by default. You can look at the API endpoints using http://localhost:8000/docs.
You can pass custom uvicorn
configuration by creating ~/.recap/settings.toml
and setting parameters under the api
space like:
api.host = "0.0.0.0"
Client
You can use Recap's CLI to query a remote Recap API server if you wish. Set catalog.url
in settings.toml
to point to your Recap API location.
catalog.url = "http://localhost:8000"
Configuration
Recap uses Dynaconf to manage configuration. By default, Recap can see anything you put into ~/recap/settings.toml
.
You can customize your settings.toml
location using the SETTINGS_FILE_FOR_DYNACONF
environment variable:
SETTINGS_FILE_FOR_DYNACONF=/tmp/api.toml recap list
You can also configure your catalog and crawlers in settings.toml
. Here's an example:
[api]
host = "0.0.0.0"
[catalog]
url = file:///tmp/recap.duckdb
[[crawlers]]
url = "bigquery://some-project-12345"
[[crawlers]]
url = "postgresql://username@localhost/some_db"
[[crawlers]]
url = "snowflake://some_user:some_pass@some_account_id"
Warning
Recap is still a little baby application. It's going to wake up crying in the middle of the night. It's going to vomit on the floor once in a while. But if you give it some love and care, it'll be worth it. As time goes on, it'll grow up and be more mature. Bear with it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for recap_core-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 803631fbd8d0eeaca760b2a39c6d1758ac85e96518c5dd3b6e252b7141bf8ad3 |
|
MD5 | bcb5cb2c042c75174b98da3ecccb931a |
|
BLAKE2b-256 | 693a92b621a5460c03677c97788227c8fff6b33e4d00792a6405410c62372a62 |