Skip to main content

Data profiling with spatial column type annotation.

Project description

atlas-profiler

License: MIT Python 3.10+ PyPI GitHub

Atlas Profiler is a dataset profiling library. Given a CSV/TSV, file-like object, or pandas DataFrame, it returns JSON-style metadata about the dataset, its columns, detected types, value ranges, optional plots, spatial/temporal coverage, and profiling runtime.

The package builds on the Datamart Profiler workflow and adds an ML-assisted spatial column classifier. That classifier is only one part of the profiler: non-spatial columns still go through the core rule-based type detection, statistics, plots, coverage, and dataset-summary pipeline.

What It Produces

process_dataset(...) returns a metadata dictionary with fields such as:

  • Dataset size, row count, profiled row count, and column count.
  • Per-column structural type, semantic types, missing/unclean value ratios, distinct counts, and optional plots.
  • Dataset-level type summary: numerical, categorical, spatial, and temporal.
  • Spatial coverage from lat/long pairs, WKT points, resolved addresses, and administrative areas.
  • Temporal coverage and temporal resolution for datetime columns.
  • Attribute keywords derived from column names.
  • Optional random sample rows and per-step profiling timings.

Core Type System

The profiler detects broad structural types for all columns:

Structural type Meaning
MissingData Empty column.
Integer Integer-like values.
Float Floating point values.
Text String/text values.
Boolean Boolean-like values such as true/false, yes/no, 0/1.
GeoCoordinates Point geometry or coordinate-pair strings.
GeoShape Polygon-like geometry.

It also annotates semantic types when evidence is available:

Semantic type Examples
DateTime Dates, timestamps, and year columns.
latitude, longitude Coordinate columns, paired after profiling.
address, AdministrativeArea Address-like and admin-area text, optionally resolved with Nominatim or datamart_geo.
URL, FileName, identifier, Enumeration URLs, file paths, IDs, and categorical columns.

Spatial ML Classifier

When geo_classifier=True, Atlas Profiler creates a HybridGeoClassifier(GeoClassifier()). It samples values from each column, predicts spatial labels in one batch, validates sensitive predictions with rules, and passes accepted labels into the normal profiler type system.

The classifier labels are not the full profiler type system. They are a spatial CTA layer mapped into profiler structural and semantic types:

Classifier label family Mapped profiler behavior
latitude, longitude Float columns with latitude/longitude semantic types, then paired for coverage.
x_coord, y_coord Projected coordinate-like float columns.
point, line, polygon, multi-line, multi-polygon Geometry columns mapped to point or shape structural types.
zip5, zip9, address Text columns with address semantics.
borough, borough_code, city, state, state_code, country Text columns with administrative-area semantics.
bbl, bin NYC spatial identifiers mapped as integer identifiers.
non_spatial Falls back to the core profiler's normal type detection.

Manual column annotations take precedence over ML predictions. Low-confidence or rule-rejected ML predictions also fall back to the regular profiler workflow.

Pipeline

process_dataset runs the same high-level workflow for every dataset:

  1. Load data from a path, file object, or DataFrame.
  2. Compute cheap full-data stats and sample values for each column.
  3. Optionally run a single batch spatial ML prediction for all non-manual columns.
  4. Process every column with either an accepted geo prediction or the regular profiler type detector.
  5. Pair latitude/longitude columns and compute dataset-level type counts.
  6. Optionally compute numerical ranges, histograms, spatial coverage, temporal coverage, keywords, samples, and timing metadata.

The regular type detector recognizes integers, floats, text, booleans, URLs, file paths, WKT points/polygons, categorical values, IDs, datetimes, latitude/longitude name patterns, and optional administrative areas.

Installation

pip install atlas-profiler

For source development:

git clone https://github.com/VIDA-NYU/atlas-profiler.git
cd atlas-profiler
pip install -e .

Basic Usage

from atlas_profiler import process_dataset

metadata = process_dataset("data.csv")

process_dataset also accepts a pandas DataFrame:

metadata = process_dataset(
    df,
    geo_classifier=True,
    geo_classifier_threshold=0.5,
    coverage=True,
    plots=False,
    include_sample=False,
)

Key parameters:

Parameter Default Description
data required Path, file-like object, or pandas DataFrame.
geo_classifier True Enable the default hybrid spatial classifier, disable with False, or pass a classifier instance.
geo_classifier_threshold 0.5 Confidence cutoff for spatial ML predictions.
coverage True Compute numerical ranges plus spatial/temporal coverage.
plots False Add compact histogram-style plot data to column metadata.
include_sample False Include a small deterministic CSV sample in the output.
indexes True Preserve non-default DataFrame indexes as columns.
load_max_size 5000000 Target bytes to profile; larger inputs are sampled.
metadata None Optional seed metadata, including manual annotations.
nominatim None Optional Nominatim endpoint for resolving address strings.
datamart_geo_data None True or a datamart_geo.GeoData instance for administrative-area resolution.

Manual Annotations

Manual annotations can be supplied through the metadata argument. They are useful when a user or upstream discovery step already knows a column's type. Manually annotated columns skip the spatial ML classifier and are reconciled with observed values during normal column processing.

Model Files

GeoClassifier() first looks for bundled model files under profiler/model/. If they are not present, it uses a user cache directory and downloads missing files when auto_download=True.

Required model files:

  • model.pt
  • config.json
  • label_encoder.json

CTA model training, synthetic data generation, and standalone CTA inference are documented in training/README.md.

Project Structure

atlas-profiler/
├── atlas_profiler/          # Public import shim: from atlas_profiler import process_dataset
├── profiler/                # Runtime profiling package
│   ├── core.py              # process_dataset, loading, column pipeline, coverage
│   ├── profile_types.py     # Rule-based structural/semantic type detection
│   ├── spatial.py           # Spatial coverage, geohashing, GeoClassifier integration
│   ├── temporal.py          # Date parsing and temporal resolution
│   ├── numerical.py         # Numeric summaries and ranges
│   └── types.py             # Type constants
├── training/                # CTA data generation, model training, standalone inference
├── tests/                   # Unit tests
├── examples/                # Example notebooks
├── README.md
└── pyproject.toml

Relationship To Datamart Profiler

This project reuses the structure and main profiling logic of Datamart Profiler, with additional spatial CTA model integration.

Credits:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atlas_profiler-0.0.2b0.tar.gz (34.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atlas_profiler-0.0.2b0-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file atlas_profiler-0.0.2b0.tar.gz.

File metadata

  • Download URL: atlas_profiler-0.0.2b0.tar.gz
  • Upload date:
  • Size: 34.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for atlas_profiler-0.0.2b0.tar.gz
Algorithm Hash digest
SHA256 885372312446dcfe6dffc5b567ce3e9a2e2f05db16cc942c2a4465adfb77e43b
MD5 e3e57f44ea8a487d5b2406db64b6034e
BLAKE2b-256 047bd76fc2360b79ccf43e86601760c1066a6c8f40c36fef7e23f8134024deaa

See more details on using hashes here.

File details

Details for the file atlas_profiler-0.0.2b0-py3-none-any.whl.

File metadata

File hashes

Hashes for atlas_profiler-0.0.2b0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1447b48ad47d6c9864ec2e87026620f2014bb30ba5c9c388c1e8ad8e63413f2
MD5 ffa1174be5f25c47eb5108ed9bc34891
BLAKE2b-256 6328e79fb25f04071b4055ffdf71a46e2066f69e2061194865809c631296c139

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page