Data profiling with spatial column type annotation.
Project description
atlas-profiler
Atlas Profiler is a comprehensive dataset profiling library that automatically detects and annotates data types, including spatial and temporal features. Given a CSV/TSV, file-like object, or pandas DataFrame, it returns rich JSON-style metadata about your dataset, its columns, detected types, value ranges, optional plots, spatial/temporal coverage, and profiling runtime.
Quick Start
Installation
Install from PyPI:
pip install atlas-profiler
Or install from source for development:
git clone https://github.com/VIDA-NYU/atlas-profiler.git
cd atlas-profiler
pip install -e .
Basic Usage
from atlas_profiler import process_dataset
# Profile a CSV file
metadata = process_dataset("data.csv")
# Or profile a pandas DataFrame
import pandas as pd
df = pd.read_csv("data.csv")
metadata = process_dataset(
df,
geo_classifier=True,
geo_classifier_threshold=0.5,
coverage=True,
)
Documentation
For comprehensive guides, API reference, examples, and advanced configuration, visit the Complete Documentation.
Table of Contents
- Features
- What It Produces
- Type System
- Architecture
- Advanced Usage
- Project Structure
- Related Projects
Features
✨ Automatic Type Detection: Identifies structural types (Integer, Float, Text, Boolean, GeoCoordinates, GeoShape) and semantic types (DateTime, Address, URL, ID, etc.)
🌍 Spatial Intelligence: ML-powered spatial column classifier trained on synthetic data, recognizing coordinates, addresses, geospatial identifiers, and administrative areas
⏰ Temporal Analysis: Detects and analyzes temporal columns with coverage and resolution information
📊 Rich Metadata: Comprehensive dataset profiling including:
- Column-level statistics and distinct value counts
- Dataset-level type summaries
- Spatial and temporal coverage information
- Optional histograms and sample data
- Profiling performance metrics
What It Produces
process_dataset(...) returns a metadata dictionary with:
- Dataset metrics: row count, column count, profiled row count
- Per-column analysis: structural type, semantic types, missing value ratios, distinct counts, sample values
- Dataset summary: numerical, categorical, spatial, and temporal type counts
- Coverage information: spatial bounding boxes, temporal ranges, geohash coverage
- Attribute keywords: automatically extracted from column names
- Performance metrics: per-step profiling timings
Type System
Structural Types
The profiler recognizes these broad structural types:
| Type | Meaning |
|---|---|
Integer |
Integer-like values |
Float |
Floating point values |
Text |
String/text values |
Boolean |
Boolean-like values (true/false, yes/no, 0/1) |
GeoCoordinates |
Point geometry or coordinate-pair strings |
GeoShape |
Polygon-like geometry |
MissingData |
Empty column |
Semantic Types
The profiler also annotates semantic meaning when evidence is available:
| Type | Examples |
|---|---|
DateTime |
Dates, timestamps, year columns |
latitude, longitude |
Coordinate columns (paired after profiling) |
address, AdministrativeArea |
Address text or admin areas (optionally resolved via Nominatim or datamart_geo) |
URL, FileName, identifier, Enumeration |
URLs, file paths, IDs, categorical values |
Architecture
Pipeline
process_dataset executes a consistent workflow for every dataset:
- Load data from path, file object, or DataFrame
- Compute statistics on full data and collect sample values per column
- Predict spatial labels (optional) using batch ML inference
- Process columns with geo predictions or rule-based type detection
- Pair lat/long columns and compute dataset-level type summaries
- Compute coverage (optional) for numerical, spatial, and temporal ranges
Spatial ML Classifier
When geo_classifier=True, Atlas Profiler uses a HybridGeoClassifier that:
- Samples values from each column
- Predicts spatial labels in a single batch
- Validates predictions using rule-based checks
- Maps predictions to the profiler's type system
Supported spatial labels:
| Label Family | Mapped Type |
|---|---|
latitude, longitude |
Float + semantic types |
x_coord, y_coord |
Projected coordinates |
point, polygon, line |
Geometry types |
address, zip5, zip9 |
Address/postal codes |
borough, city, state, country |
Administrative areas |
bbl, bin |
NYC spatial identifiers |
non_spatial |
Falls back to standard detection |
Manual annotations take precedence over ML predictions. Low-confidence or rejected predictions fall back to rule-based detection.
Advanced Usage
Configuration Parameters
Key parameters for process_dataset():
| Parameter | Default | Description |
|---|---|---|
data |
required | Path, file-like object, or pandas DataFrame |
geo_classifier |
True |
Enable spatial ML classifier |
geo_classifier_threshold |
0.5 |
Confidence cutoff for predictions |
coverage |
True |
Compute numerical ranges and spatial/temporal coverage |
plots |
False |
Include histogram-style plot data |
include_sample |
False |
Include sample rows in output |
indexes |
True |
Preserve DataFrame indexes as columns |
load_max_size |
5000000 |
Target bytes to profile (larger inputs are sampled) |
metadata |
None |
Optional seed metadata with manual annotations |
nominatim |
None |
Nominatim endpoint for address resolution |
datamart_geo_data |
None |
GeoData instance for admin-area resolution |
Manual Annotations
Supply manual type annotations through the metadata argument. Useful when upstream processes or domain knowledge already identifies column types:
metadata = {
"columns": [
{
"name": "latitude",
"semantic_types": ["http://schema.org/latitude"]
},
{
"name": "longitude",
"semantic_types": ["http://schema.org/longitude"]
}
]
}
result = process_dataset(df, metadata=metadata)
Manually annotated columns skip the spatial ML classifier and are reconciled with observed values during processing.
Model Files
The spatial ML classifier uses these model files (automatically downloaded if missing):
model.pt— PyTorch model weightsconfig.json— Model configurationlabel_encoder.json— Label encoding
Files are cached locally and auto_download=True enables automatic retrieval.
For model training details, see training/README.md.
Project Structure
atlas-profiler/
├── atlas_profiler/ # Public API: from atlas_profiler import process_dataset
├── profiler/ # Core profiling package
│ ├── core.py # process_dataset(), data loading, column pipeline
│ ├── profile_types.py # Rule-based type detection
│ ├── spatial.py # Spatial coverage & GeoClassifier
│ ├── temporal.py # Temporal analysis
│ ├── numerical.py # Numerical profiling
│ └── types.py # Type constants
├── training/ # Model training & synthetic data generation
├── tests/ # Unit tests
├── examples/ # Example notebooks
├── docs/ # Sphinx documentation
└── pyproject.toml # Project configuration
Related Projects
This project builds upon and extends Datamart Profiler with additional spatial intelligence via ML-assisted column type classification.
- Datamart Profiler: https://pypi.org/project/datamart-profiler/
- Research Background: Developed by the NYU Visualization and Data Analytics Lab
License
Atlas Profiler is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atlas_profiler-0.0.2b1.tar.gz.
File metadata
- Download URL: atlas_profiler-0.0.2b1.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef22fc10ca327d3f25da478ad8fa0dd7d8d71373680b437304fd43cd19dcca03
|
|
| MD5 |
adec98c3f0b21c863ae8fc275f4878c4
|
|
| BLAKE2b-256 |
e15d4944cd49e3eca21eb4efb58b02a063e08bc5f66ad662b239ffbd446dc5c3
|
File details
Details for the file atlas_profiler-0.0.2b1-py3-none-any.whl.
File metadata
- Download URL: atlas_profiler-0.0.2b1-py3-none-any.whl
- Upload date:
- Size: 37.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ec6e024a78244257914dc53537ada991afe0adfbd5904393b8c6cf8e7ec42de
|
|
| MD5 |
b40942055af9dfc913240b1cdc7b0390
|
|
| BLAKE2b-256 |
1a8bf76094c1dca4f2c3fd315a87bd9be17165e506a87e7830c65571db7538e3
|