Data profiling with spatial column type annotation.
Project description
atlas-profiler
Atlas Profiler is a dataset profiling library. Given a CSV/TSV, file-like object, or pandas DataFrame, it returns JSON-style metadata about the dataset, its columns, detected types, value ranges, optional plots, spatial/temporal coverage, and profiling runtime.
The package builds on the Datamart Profiler workflow and adds an ML-assisted spatial column classifier. That classifier is only one part of the profiler: non-spatial columns still go through the core rule-based type detection, statistics, plots, coverage, and dataset-summary pipeline.
What It Produces
process_dataset(...) returns a metadata dictionary with fields such as:
- Dataset size, row count, profiled row count, and column count.
- Per-column structural type, semantic types, missing/unclean value ratios, distinct counts, and optional plots.
- Dataset-level type summary: numerical, categorical, spatial, and temporal.
- Spatial coverage from lat/long pairs, WKT points, resolved addresses, and administrative areas.
- Temporal coverage and temporal resolution for datetime columns.
- Attribute keywords derived from column names.
- Optional random sample rows and per-step profiling timings.
Core Type System
The profiler detects broad structural types for all columns:
| Structural type | Meaning |
|---|---|
MissingData |
Empty column. |
Integer |
Integer-like values. |
Float |
Floating point values. |
Text |
String/text values. |
Boolean |
Boolean-like values such as true/false, yes/no, 0/1. |
GeoCoordinates |
Point geometry or coordinate-pair strings. |
GeoShape |
Polygon-like geometry. |
It also annotates semantic types when evidence is available:
| Semantic type | Examples |
|---|---|
DateTime |
Dates, timestamps, and year columns. |
latitude, longitude |
Coordinate columns, paired after profiling. |
address, AdministrativeArea |
Address-like and admin-area text, optionally resolved with Nominatim or datamart_geo. |
URL, FileName, identifier, Enumeration |
URLs, file paths, IDs, and categorical columns. |
Spatial ML Classifier
When geo_classifier=True, Atlas Profiler creates a HybridGeoClassifier(GeoClassifier()). It samples values from each column, predicts spatial labels in one batch, validates sensitive predictions with rules, and passes accepted labels into the normal profiler type system.
The classifier labels are not the full profiler type system. They are a spatial CTA layer mapped into profiler structural and semantic types:
| Classifier label family | Mapped profiler behavior |
|---|---|
latitude, longitude |
Float columns with latitude/longitude semantic types, then paired for coverage. |
x_coord, y_coord |
Projected coordinate-like float columns. |
point, line, polygon, multi-line, multi-polygon |
Geometry columns mapped to point or shape structural types. |
zip5, zip9, address |
Text columns with address semantics. |
borough, borough_code, city, state, state_code, country |
Text columns with administrative-area semantics. |
bbl, bin |
NYC spatial identifiers mapped as integer identifiers. |
non_spatial |
Falls back to the core profiler's normal type detection. |
Manual column annotations take precedence over ML predictions. Low-confidence or rule-rejected ML predictions also fall back to the regular profiler workflow.
Pipeline
process_dataset runs the same high-level workflow for every dataset:
- Load data from a path, file object, or DataFrame.
- Compute cheap full-data stats and sample values for each column.
- Optionally run a single batch spatial ML prediction for all non-manual columns.
- Process every column with either an accepted geo prediction or the regular profiler type detector.
- Pair latitude/longitude columns and compute dataset-level type counts.
- Optionally compute numerical ranges, histograms, spatial coverage, temporal coverage, keywords, samples, and timing metadata.
The regular type detector recognizes integers, floats, text, booleans, URLs, file paths, WKT points/polygons, categorical values, IDs, datetimes, latitude/longitude name patterns, and optional administrative areas.
Installation
pip install atlas-profiler
For source development:
git clone https://github.com/VIDA-NYU/atlas-profiler.git
cd atlas-profiler
pip install -e .
Basic Usage
from atlas_profiler import process_dataset
metadata = process_dataset("data.csv")
process_dataset also accepts a pandas DataFrame:
metadata = process_dataset(
df,
geo_classifier=True,
geo_classifier_threshold=0.5,
coverage=True,
plots=False,
include_sample=False,
)
Key parameters:
| Parameter | Default | Description |
|---|---|---|
data |
required | Path, file-like object, or pandas DataFrame. |
geo_classifier |
True |
Enable the default hybrid spatial classifier, disable with False, or pass a classifier instance. |
geo_classifier_threshold |
0.5 |
Confidence cutoff for spatial ML predictions. |
coverage |
True |
Compute numerical ranges plus spatial/temporal coverage. |
plots |
False |
Add compact histogram-style plot data to column metadata. |
include_sample |
False |
Include a small deterministic CSV sample in the output. |
indexes |
True |
Preserve non-default DataFrame indexes as columns. |
load_max_size |
5000000 |
Target bytes to profile; larger inputs are sampled. |
metadata |
None |
Optional seed metadata, including manual annotations. |
nominatim |
None |
Optional Nominatim endpoint for resolving address strings. |
datamart_geo_data |
None |
True or a datamart_geo.GeoData instance for administrative-area resolution. |
Manual Annotations
Manual annotations can be supplied through the metadata argument. They are useful when a user or upstream discovery step already knows a column's type. Manually annotated columns skip the spatial ML classifier and are reconciled with observed values during normal column processing.
Model Files
GeoClassifier() first looks for bundled model files under profiler/model/. If they are not present, it uses a user cache directory and downloads missing files when auto_download=True.
Required model files:
model.ptconfig.jsonlabel_encoder.json
CTA model training, synthetic data generation, and standalone CTA inference are documented in training/README.md.
Project Structure
atlas-profiler/
├── atlas_profiler/ # Public import shim: from atlas_profiler import process_dataset
├── profiler/ # Runtime profiling package
│ ├── core.py # process_dataset, loading, column pipeline, coverage
│ ├── profile_types.py # Rule-based structural/semantic type detection
│ ├── spatial.py # Spatial coverage, geohashing, GeoClassifier integration
│ ├── temporal.py # Date parsing and temporal resolution
│ ├── numerical.py # Numeric summaries and ranges
│ └── types.py # Type constants
├── training/ # CTA data generation, model training, standalone inference
├── tests/ # Unit tests
├── examples/ # Example notebooks
├── README.md
└── pyproject.toml
Relationship To Datamart Profiler
This project reuses the structure and main profiling logic of Datamart Profiler, with additional spatial CTA model integration.
Credits:
- Datamart Profiler codebase: https://gitlab.com/ViDA-NYU/auctus/auctus
- Datamart Profiler on PyPI: https://pypi.org/project/datamart-profiler/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atlas_profiler-0.0.2b0.tar.gz.
File metadata
- Download URL: atlas_profiler-0.0.2b0.tar.gz
- Upload date:
- Size: 34.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
885372312446dcfe6dffc5b567ce3e9a2e2f05db16cc942c2a4465adfb77e43b
|
|
| MD5 |
e3e57f44ea8a487d5b2406db64b6034e
|
|
| BLAKE2b-256 |
047bd76fc2360b79ccf43e86601760c1066a6c8f40c36fef7e23f8134024deaa
|
File details
Details for the file atlas_profiler-0.0.2b0-py3-none-any.whl.
File metadata
- Download URL: atlas_profiler-0.0.2b0-py3-none-any.whl
- Upload date:
- Size: 35.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1447b48ad47d6c9864ec2e87026620f2014bb30ba5c9c388c1e8ad8e63413f2
|
|
| MD5 |
ffa1174be5f25c47eb5108ed9bc34891
|
|
| BLAKE2b-256 |
6328e79fb25f04071b4055ffdf71a46e2066f69e2061194865809c631296c139
|