Skip to main content

Python library for datannur catalog metadata management

Project description

datannurpy

PyPI version Python CI codecov License: MIT

Python library for datannur catalog metadata management.

Supported formats

A lightweight catalog compatible with most data sources:

Category Formats
Flat files CSV, Excel (.xlsx, .xls)
Columnar Parquet, Delta Lake, Apache Iceberg, Hive partitioned
Statistical SAS (.sas7bdat), SPSS (.sav), Stata (.dta)
Databases PostgreSQL, MySQL, Oracle, SQL Server, SQLite, DuckDB

All formats support automatic schema inference and statistics computation.

Installation

pip install datannurpy

Optional extras

# Databases
pip install datannurpy[postgres]  # PostgreSQL
pip install datannurpy[mysql]     # MySQL
pip install datannurpy[oracle]    # Oracle
pip install datannurpy[mssql]     # SQL Server

# File formats
pip install datannurpy[stat]      # SAS, SPSS, Stata
pip install datannurpy[delta]     # Delta Lake metadata extraction
pip install datannurpy[iceberg]   # Apache Iceberg metadata extraction

# Cloud storage
pip install datannurpy[s3]        # Amazon S3
pip install datannurpy[azure]     # Azure Blob Storage
pip install datannurpy[gcs]       # Google Cloud Storage
pip install datannurpy[cloud]     # All cloud providers

# Multiple extras
pip install datannurpy[postgres,stat,delta]

Note: The iceberg, s3, azure, gcs, and cloud extras require Python 3.10+.

SQL Server note: Requires an ODBC driver on the system:

  • macOS: brew install unixodbc freetds
  • Linux: apt install unixodbc-dev tdsodbc
  • Windows: Microsoft ODBC Driver

Quick start

from datannurpy import Catalog

catalog = Catalog()
catalog.add_folder("./data", include=["*.csv", "*.xlsx", "*.parquet"])
catalog.add_database("sqlite:///mydb.sqlite")
catalog.export_app("./my-catalog", open_browser=True)

Scan depth

The depth parameter controls how much metadata is extracted. Set it globally on Catalog or per-method call:

catalog = Catalog(depth="variable")         # global default
catalog.add_folder("./data")                # inherits "variable"
catalog.add_folder("./big", depth="stat")   # override for this call
catalog.add_database("sqlite:///db.sqlite", depth="dataset")
Feature dataset variable stat value (default)
Folders
Datasets (format, path, mtime)
Variables (names, types)
DB introspection (PK, FK, comments)
Row count, statistics
Modalities, frequency tables

Note: At depth="variable", CSV and Excel files only extract column names (types require reading data, available from depth="stat"). All other formats provide types at this level.

Typical use cases:

  • dataset — quick inventory of available files/tables without reading data
  • variable — lightweight schema discovery (column names and types)
  • stat — data profiling without modality detection (faster than value)
  • value — full catalog with frequency tables and modality assignment (default)

Configuration file

Alternative to Python scripts - define the catalog in YAML:

# catalog.yml
app_path: ./my-catalog
refresh: true
verbose: true          # show full tracebacks on errors
log_file: ./errors.log  # write tracebacks to file

add:
  - type: folder
    path: ./data
    include: ["*.csv", "*.parquet"]

  - type: dataset
    path: ./data/sales.parquet
    name: Sales Data
    description: Monthly sales figures

  - type: database
    uri: sqlite:///mydb.sqlite

  - type: metadata
    path: ./metadata

export_app:
  open_browser: true

Environment variables ($VAR or ${VAR}) are expanded in all values. Define reusable values with env:, load secrets from env_file, or place a .env next to the YAML:

env:
  data_dir: /shared/data
  db_host: db.example.com
env_file: /secure/path/.env # secrets: DB_USER, DB_PASSWORD

add:
  - type: folder
    path: ${data_dir}/sales
  - type: folder
    path: ${data_dir}/hr
  - type: database
    uri: oracle://${DB_USER}:${DB_PASSWORD}@${db_host}:1521/ORCL

SSH tunnel support in YAML:

add:
  - type: database
    uri: mysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}/${DB_NAME}
    ssh_tunnel:
      host: ${SSH_HOST}
      user: ${SSH_USER}

Priority: system env vars > env_file / .env > env: section.

Run with:

python -m datannurpy catalog.yml
python -m datannurpy --help     # show usage
python -m datannurpy --version  # show version

Incremental scan

Re-run with the same app_path to only rescan changed files (compares mtime) or tables (compares schema + row count):

catalog = Catalog(app_path="./my-catalog")
catalog.add_folder("./data")  # skips unchanged files
catalog.export_app()          # removes deleted entities, exports

Use refresh=True to force a full rescan.

Evolution tracking

Changes between exports are automatically tracked in evolution.json:

  • add: new folder, dataset, variable, modality, etc.
  • update: modified field (shows old and new value)
  • delete: removed entity

Cascade filtering: when a parent entity is added or deleted, its children are automatically filtered out to reduce noise. For example, adding a new dataset won't generate separate entries for each variable.

Disable tracking with track_evolution=False:

catalog.export_db(track_evolution=False)

Scanning files

from datannurpy import Catalog, Folder

catalog = Catalog()

# Scan a folder (CSV, Excel, SAS)
catalog.add_folder("./data")

# With custom folder metadata
catalog.add_folder("./data", Folder(id="prod", name="Production"))

# With filtering options
catalog.add_folder(
    "./data",
    include=["*.csv", "*.xlsx"],
    exclude=["**/tmp/**"],
    recursive=True,
    csv_encoding="utf-8",  # or "cp1252", "iso-8859-1" (auto-detected by default)
)

# Multiple folders with shared options
catalog.add_folder(["./data/sales", "./data/hr"], include=["*.csv"])

# Add a single file
catalog.add_dataset("./data/sales.csv")

# Multiple files
catalog.add_dataset(["./data/sales.csv", "./data/products.csv"])

Parquet formats

Supports simple Parquet files and partitioned datasets (Delta, Hive, Iceberg):

# add_folder auto-detects all formats
catalog.add_folder("./data")  # scans *.parquet + Delta/Hive/Iceberg directories

# add_dataset for a single partitioned directory with metadata override
catalog.add_dataset(
    "./data/sales_delta",
    name="Sales Data",
    description="Monthly sales",
    folder=Folder(id="sales", name="Sales"),
)

With extras [delta] and [iceberg], metadata (name, description, column docs) is extracted when available.

Remote storage

Scan files on SFTP servers or cloud storage (S3, Azure, GCS):

from datannurpy import Catalog, Folder

catalog = Catalog()

# SFTP (paramiko included by default)
catalog.add_folder(
    "sftp://user@host/path/to/data",
    storage_options={"password": "secret"},  # or key_filename="/path/to/key"
)

# Amazon S3 (requires: pip install datannurpy[s3])
catalog.add_folder(
    "s3://my-bucket/data",
    storage_options={"key": "...", "secret": "..."},
)

# Azure Blob (requires: pip install datannurpy[azure])
catalog.add_folder(
    "az://container/data",
    storage_options={"account_name": "...", "account_key": "..."},
)

# Google Cloud Storage (requires: pip install datannurpy[gcs])
catalog.add_folder(
    "gs://my-bucket/data",
    storage_options={"token": "/path/to/credentials.json"},
)

# Single remote file
catalog.add_dataset("s3://my-bucket/data/sales.parquet", storage_options={...})

# Remote SQLite / GeoPackage database
catalog.add_database("sftp://host/path/to/db.sqlite", storage_options={...})
catalog.add_database("s3://bucket/geodata.gpkg", storage_options={...})

The storage_options dict is passed directly to fsspec. See provider documentation for available options:

Scanning databases

# SQLite / GeoPackage
catalog.add_database("sqlite:///path/to/db.sqlite")
catalog.add_database("sqlite:///path/to/geodata.gpkg")  # GeoPackage is SQLite

# PostgreSQL / MySQL / Oracle / SQL Server
catalog.add_database("postgresql://user:pass@host:5432/mydb")
catalog.add_database("mysql://user:pass@host:3306/mydb")
catalog.add_database("oracle://user:pass@host:1521/service_name")
catalog.add_database("mssql://user:pass@host:1433/mydb")

# SSL/TLS connections
catalog.add_database("postgresql://user:pass@host/db?sslmode=require")

# SQL Server with Windows auth (requires proper Kerberos setup)
catalog.add_database("mssql://host/db?TrustedConnection=yes")

# With options
catalog.add_database(
    "postgresql://localhost/mydb",
    schema="public",
    include=["sales_*"],
    exclude=["*_tmp"],
    sample_size=10000,         # override catalog default (100_000)
    group_by_prefix=True,  # group tables by common prefix (default)
    prefix_min_tables=2,  # minimum tables to form a group
)

# Multiple schemas with shared options
catalog.add_database(
    "postgresql://localhost/mydb",
    schema=["public", "sales", "hr"],
)

# SSH tunnel (for databases behind a firewall)
catalog.add_database(
    "mysql://user:pass@dbhost/mydb",
    ssh_tunnel={"host": "ssh.example.com", "user": "sshuser"},
)
# Also supports: port, password, key_file
catalog.add_database(
    "postgresql://user:pass@dbhost/mydb",
    ssh_tunnel={
        "host": "bastion.example.com",
        "port": 2222,
        "user": "admin",
        "key_file": "~/.ssh/id_rsa",
    },
)

Database metadata enrichment (requires depth="variable" or higher):

Metadata Target field Backends
Primary keys Variable.key All 6
Foreign keys Variable.fk_var_id All 6
Table/column comments description All except SQLite
NOT NULL, UNIQUE, INDEX Auto tags (db---*) All 6
Auto-increment Auto tag All 6

This metadata is always refreshed, even when table data is unchanged (cache hit).

Sampling

By default, Catalog sets sample_size=100_000. All methods (add_folder, add_dataset, add_database) inherit this value. Override per-method with an explicit int, or pass sample_size=None to disable sampling for a single call:

catalog = Catalog(sample_size=100_000)              # default
catalog.add_folder("./data")                        # inherits 100_000
catalog.add_folder("./small", sample_size=None)     # no sampling
catalog.add_database("postgresql://localhost/mydb", sample_size=50_000)  # override

To disable sampling globally, pass sample_size=None to the Catalog:

catalog = Catalog(sample_size=None)

When a dataset has more rows than sample_size, a uniform random sample is used for frequency counts and modality detection. All other statistics (nb_row, nb_missing, nb_distinct, min, max, mean, std) are computed on the full dataset.

The actual number of sampled rows is recorded in Dataset.sample_size (None when no sampling was applied).

CSV options

Use csv_skip_copy=True on the Catalog to avoid the UTF-8 temp copy when files are already local and UTF-8 (auto-fallback if encoding detection fails):

catalog = Catalog(app_path="./output", csv_skip_copy=True)

Manual metadata

Load manually curated metadata from files or a database:

# Load from a folder containing metadata files
catalog.add_metadata("./metadata")

# Load from a database
catalog.add_metadata("sqlite:///metadata.db")

Can be used alone or combined with auto-scanned metadata (add_folder, add_database).

Expected structure: One file/table per entity, named after the entity type:

metadata/
├── variable.csv      # Variables (descriptions, tags...)
├── dataset.xlsx      # Datasets
├── institution.json  # Institutions (owners, managers)
├── tag.csv           # Tags
├── modality.csv      # Modalities
├── value.csv         # Modality values
└── ...

Supported formats: CSV, Excel (.xlsx), JSON, SAS (.sas7bdat), or database tables.

File format: Standard tabular structure following datannur schemas.

# variable.csv
id,description,tag_ids
source---employees_csv---salary,"Monthly gross salary in euros","finance,hr"
source---employees_csv---department,"Department code","hr"

Merge behavior:

  • Existing entities are updated (manual values override auto-scanned values)
  • New entities are created
  • List fields (tag_ids, doc_ids, etc.) are merged

Ordering: add_metadata should be called after add_folder, add_dataset, and add_database so manual values take precedence over auto-scanned metadata. In YAML configuration, this is enforced automatically regardless of declaration order. In Python, the caller controls the execution order.

Helper functions for building IDs in preprocessing scripts:

from datannurpy import sanitize_id, build_dataset_id, build_variable_id

sanitize_id("My File (v2)")  # → "My_File_v2"
build_dataset_id("source", "employees_csv")  # → "source---employees_csv"
build_variable_id("source", "employees_csv", "salary")  # → "source---employees_csv---salary"

Output

# JSON metadata only (for existing datannur instance)
catalog.export_db("./output")

# Complete standalone app
catalog.export_app("./my-catalog", open_browser=True)

API Reference

Catalog

Catalog(app_path=None, depth="value", refresh=False, freq_threshold=100, csv_encoding=None, sample_size=100_000, csv_skip_copy=False, app_config=None, quiet=False, verbose=False, log_file=None)
Attribute Type Description
app_path str | None Load existing catalog for incremental scan
depth "dataset" | "variable" | "stat" | "value" Default scan depth for add_folder
refresh bool Force full rescan ignoring cache (default: False)
freq_threshold int Max distinct values for modality detection
csv_encoding str | None Default CSV encoding (utf-8, cp1252, etc.)
sample_size int | None Default sample size for stats (default: 100_000)
csv_skip_copy bool Skip UTF-8 temp copy for local CSV (default: False)
app_config dict[str, str] | None Key-value config for the web app (see below)
quiet bool Suppress progress logging (default: False)
verbose bool Show full tracebacks on errors (default: False)
log_file str | Path | None Write error tracebacks to file (truncated each run)
folder Table[Folder] Folder table (.all(), .count, .get_by(...))
dataset Table[Dataset] Dataset table
variable Table[Variable] Variable table
modality Table[Modality] Modality table
value Table[Value] Modality value table
freq Table[Freq] Frequency table (computed)
institution Table[Institution] Institution table
tag Table[Tag] Tag table
doc Table[Doc] Document table

Catalog.add_folder()

catalog.add_folder(
    path,
    folder=None,
    *,
    depth=None,
    include=None,
    exclude=None,
    recursive=True,
    csv_encoding=None,
    sample_size=None,
    csv_skip_copy=None,
    storage_options=None,
    refresh=None,
    quiet=None,
    time_series=True,
)
Parameter Type Default Description
path str | Path | list[str | Path] required Directory or list of directories to scan
folder Folder | None None Custom folder metadata
depth "dataset" | "variable" | "stat" | "value" | None None Scan depth (uses catalog.depth if None)
include list[str] | None None Glob patterns to include
exclude list[str] | None None Glob patterns to exclude
recursive bool True Scan subdirectories
csv_encoding str | None None Override CSV encoding
sample_size int | None None Sample rows for stats (overrides catalog)
csv_skip_copy bool | None None Skip UTF-8 temp copy (overrides catalog)
storage_options dict | None None Options for remote storage (passed to fsspec)
refresh bool | None None Force rescan (overrides catalog setting)
quiet bool | None None Override catalog quiet setting
time_series bool True Group files with temporal patterns

Time series detection:

When time_series=True (default), files with temporal patterns in their names are automatically grouped into a single dataset:

data/
├── enquete_2020.csv    ─┐
├── enquete_2021.csv     ├─→ Single dataset "enquete" with nb_resources=3
├── enquete_2022.csv    ─┘
└── reference.csv       ─→ Separate dataset "reference"

Detected patterns: year (2024), quarter (2024Q1, 2024T2), month (2024-03, 202403), date (2024-03-15).

The resulting dataset includes:

  • nb_resources: number of resources in the series
  • start_date / end_date: temporal coverage
  • Variables track their own start_date / end_date based on presence across periods

Set time_series=False to treat each file as a separate dataset.

Catalog.add_dataset()

catalog.add_dataset(
    path,
    folder=None,
    *,
    folder_id=None,
    depth=None,
    csv_encoding=None,
    sample_size=None,
    csv_skip_copy=None,
    storage_options=None,
    refresh=None,
    quiet=None,
    name=None,
    description=None,
    ...,
)
Parameter Type Default Description
path str | Path | list[str | Path] required File(s) or partitioned directory (local/remote)
folder Folder | None None Parent folder
folder_id str | None None Parent folder ID (alternative to folder)
depth "dataset" | "variable" | "stat" | "value" | None None Scan depth (uses catalog.depth if None)
csv_encoding str | None None Override CSV encoding
sample_size int | None None Sample rows for stats (overrides catalog)
csv_skip_copy bool | None None Skip UTF-8 temp copy (overrides catalog)
storage_options dict | None None Options for remote storage (passed to fsspec)
refresh bool | None None Force rescan (overrides catalog setting)
quiet bool | None None Override catalog quiet setting
name str | None None Override dataset name
description str | None None Override dataset description

Additional metadata parameters: type, link, localisation, manager_id, owner_id, tag_ids, doc_ids, start_date, end_date, updating_each, no_more_update

Catalog.add_database()

catalog.add_database(
    connection,
    folder=None,
    *,
    depth=None,
    schema=None,
    include=None,
    exclude=None,
    sample_size=None,
    group_by_prefix=True,
    prefix_min_tables=2,
    time_series=True,
    storage_options=None,
    refresh=None,
    quiet=None,
    oracle_client_path=None,
    ssh_tunnel=None,
)
Parameter Type Default Description
connection str | ibis.BaseBackend required Connection string or ibis backend object
folder Folder | None None Custom root folder
depth "dataset" | "variable" | "stat" | "value" | None None Scan depth (uses catalog.depth if None)
schema str | list[str] | None None Schema(s) to scan
include list[str] | None None Table name patterns to include
exclude list[str] | None None Table name patterns to exclude
sample_size int | None None Sample rows for stats (overrides catalog)
group_by_prefix bool | str True Group tables by prefix into subfolders
prefix_min_tables int 2 Min tables to form a prefix group
time_series bool True Detect temporal table patterns
storage_options dict | None None Options for remote SQLite/GeoPackage
refresh bool | None None Force rescan (overrides catalog setting)
quiet bool | None None Override catalog quiet setting
oracle_client_path str | None None Path to Oracle Instant Client libraries
ssh_tunnel dict | None None SSH tunnel config (host, user, port, etc.)

Connection string formats:

  • SQLite: sqlite:///path/to/db.sqlite or sftp://host/path/db.sqlite (remote)
  • PostgreSQL: postgresql://user:pass@host:5432/database
  • MySQL: mysql://user:pass@host:3306/database
  • Oracle: oracle://user:pass@host:1521/service_name
  • SQL Server: mssql://user:pass@host:1433/database

Catalog.add_metadata()

catalog.add_metadata(path, depth=None, quiet=None)
Parameter Type Default Description
path str | Path required Folder or database containing metadata files
depth "dataset" | "variable" | "stat" | "value" | None None Filter which entities to load
quiet bool | None None Override catalog quiet setting

Supported entity files/tables: all catalog entities. The id column is not required for value and freq (composite key computed automatically).

Catalog.export_db()

catalog.export_db(output_dir=None, track_evolution=True, quiet=None)
Parameter Type Default Description
output_dir str | Path | None None Output directory (uses app_path if None)
track_evolution bool True Track changes between exports
quiet bool | None None Override catalog quiet setting

Exports JSON metadata files. Calls finalize() automatically when data has been scanned.

Catalog.finalize()

catalog.finalize()

Removes entities no longer seen during scan. Called automatically by export_db()/export_app().

Catalog.export_app()

catalog.export_app(output_dir=None, open_browser=False, quiet=None)

Exports complete standalone datannur app with data. Uses app_path by default if set at init.

app_config

Pass a dictionary to Catalog(app_config={...}) to configure the web app. Entries are written as config.json in the database directory.

catalog = Catalog(
    app_path="./my-catalog",
    app_config={
        "contact_email": "contact@example.com",
        "more_info": "Data from [open data portal](https://example.com).",
    },
)

If app_config is not provided, no config.json is generated.

Folder

Folder(id, name=None, description=None, parent_id=None, type=None, data_path=None)
Parameter Type Description
id str Unique identifier
name str | None Display name
description str | None Description
parent_id str | None Parent folder ID

ID helpers

from datannurpy import sanitize_id, build_dataset_id, build_variable_id
Function Description Example
sanitize_id(s) Clean string for use as ID "My File (v2)" → "My_File_v2"
build_dataset_id(folder_id, dataset_name) Build dataset ID ("src", "sales") → "src---sales"
build_variable_id(folder_id, dataset_name, var) Build variable ID ("src", "sales", "amount") → "src---sales---amount"

License

MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datannurpy-0.15.0.tar.gz (7.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datannurpy-0.15.0-py3-none-any.whl (7.6 MB view details)

Uploaded Python 3

File details

Details for the file datannurpy-0.15.0.tar.gz.

File metadata

  • Download URL: datannurpy-0.15.0.tar.gz
  • Upload date:
  • Size: 7.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datannurpy-0.15.0.tar.gz
Algorithm Hash digest
SHA256 fffd682be447a1f6a7845586fcd84f77a73f2e2a8a1fe68fce760a0a0d85c6a6
MD5 d83472bdc01d5cbc1f588499d08a94af
BLAKE2b-256 1b1f1b12bcd30b6461da78cb102289f4619596c8fa31debec3c75b0f78f0ea1f

See more details on using hashes here.

Provenance

The following attestation bundles were made for datannurpy-0.15.0.tar.gz:

Publisher: release.yml on datannur/datannurpy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datannurpy-0.15.0-py3-none-any.whl.

File metadata

  • Download URL: datannurpy-0.15.0-py3-none-any.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datannurpy-0.15.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d9f90590a1c835e91c6b46fe6557db57dca077874e58789d2516dfe5badf0e8
MD5 d568991741a32e536cd4f0b4d43e07b6
BLAKE2b-256 24f69226fdf62bc26249a32b3dcd304815166cbfd404686d70693d8d5ee0a145

See more details on using hashes here.

Provenance

The following attestation bundles were made for datannurpy-0.15.0-py3-none-any.whl:

Publisher: release.yml on datannur/datannurpy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page