Skip to main content

Data profiling with spatial column type annotation.

Project description

atlas-profiler

License: MIT Python 3.10+ PyPI Documentation GitHub

Atlas Profiler is a comprehensive dataset profiling library that automatically detects and annotates data types, including spatial and temporal features. Given a CSV/TSV, file-like object, or pandas DataFrame, it returns rich JSON-style metadata about your dataset, its columns, detected types, value ranges, optional plots, spatial/temporal coverage, and profiling runtime.

Quick Start

Installation

Install from PyPI:

pip install atlas-profiler

Or install from source for development:

git clone https://github.com/VIDA-NYU/atlas-profiler.git
cd atlas-profiler
pip install -e .

Basic Usage

from atlas_profiler import process_dataset

# Profile a CSV file
metadata = process_dataset("data.csv")

# Or profile a pandas DataFrame
import pandas as pd
df = pd.read_csv("data.csv")
metadata = process_dataset(
    df,
    geo_classifier=True,
    geo_classifier_threshold=0.5,
    coverage=True,
)

Documentation

For comprehensive guides, API reference, examples, and advanced configuration, visit the Complete Documentation.

Table of Contents

Features

Automatic Type Detection: Identifies structural types (Integer, Float, Text, Boolean, GeoCoordinates, GeoShape) and semantic types (DateTime, Address, URL, ID, etc.)

🌍 Spatial Intelligence: ML-powered spatial column classifier trained on synthetic data, recognizing coordinates, addresses, geospatial identifiers, and administrative areas

Temporal Analysis: Detects and analyzes temporal columns with coverage and resolution information

📊 Rich Metadata: Comprehensive dataset profiling including:

  • Column-level statistics and distinct value counts
  • Dataset-level type summaries
  • Spatial and temporal coverage information
  • Optional histograms and sample data
  • Profiling performance metrics

What It Produces

process_dataset(...) returns a metadata dictionary with:

  • Dataset metrics: row count, column count, profiled row count
  • Per-column analysis: structural type, semantic types, missing value ratios, distinct counts, sample values
  • Dataset summary: numerical, categorical, spatial, and temporal type counts
  • Coverage information: spatial bounding boxes, temporal ranges, geohash coverage
  • Attribute keywords: automatically extracted from column names
  • Performance metrics: per-step profiling timings

Type System

Structural Types

The profiler recognizes these broad structural types:

Type Meaning
Integer Integer-like values
Float Floating point values
Text String/text values
Boolean Boolean-like values (true/false, yes/no, 0/1)
GeoCoordinates Point geometry or coordinate-pair strings
GeoShape Polygon-like geometry
MissingData Empty column

Semantic Types

The profiler also annotates semantic meaning when evidence is available:

Type Examples
DateTime Dates, timestamps, year columns
latitude, longitude Coordinate columns (paired after profiling)
address, AdministrativeArea Address text or admin areas (optionally resolved via Nominatim or datamart_geo)
URL, FileName, identifier, Enumeration URLs, file paths, IDs, categorical values

Architecture

Pipeline

process_dataset executes a consistent workflow for every dataset:

  1. Load data from path, file object, or DataFrame
  2. Compute statistics on full data and collect sample values per column
  3. Predict spatial labels (optional) using batch ML inference
  4. Process columns with geo predictions or rule-based type detection
  5. Pair lat/long columns and compute dataset-level type summaries
  6. Compute coverage (optional) for numerical, spatial, and temporal ranges

Spatial ML Classifier

When geo_classifier=True, Atlas Profiler uses a HybridGeoClassifier that:

  • Samples values from each column
  • Predicts spatial labels in a single batch
  • Validates predictions using rule-based checks
  • Maps predictions to the profiler's type system

Supported spatial labels:

Label Family Mapped Type
latitude, longitude Float + semantic types
x_coord, y_coord Projected coordinates
point, polygon, line Geometry types
address, zip5, zip9 Address/postal codes
borough, city, state, country Administrative areas
bbl, bin NYC spatial identifiers
non_spatial Falls back to standard detection

Manual annotations take precedence over ML predictions. Low-confidence or rejected predictions fall back to rule-based detection.

Advanced Usage

Configuration Parameters

Key parameters for process_dataset():

Parameter Default Description
data required Path, file-like object, or pandas DataFrame
geo_classifier True Enable spatial ML classifier
geo_classifier_threshold 0.5 Confidence cutoff for predictions
coverage True Compute numerical ranges and spatial/temporal coverage
plots False Include histogram-style plot data
include_sample False Include sample rows in output
indexes True Preserve DataFrame indexes as columns
load_max_size 5000000 Target bytes to profile (larger inputs are sampled)
metadata None Optional seed metadata with manual annotations
nominatim None Nominatim endpoint for address resolution
datamart_geo_data None GeoData instance for admin-area resolution

Manual Annotations

Supply manual type annotations through the metadata argument. Useful when upstream processes or domain knowledge already identifies column types:

metadata = {
    "columns": [
        {
            "name": "latitude",
            "semantic_types": ["http://schema.org/latitude"]
        },
        {
            "name": "longitude", 
            "semantic_types": ["http://schema.org/longitude"]
        }
    ]
}

result = process_dataset(df, metadata=metadata)

Manually annotated columns skip the spatial ML classifier and are reconciled with observed values during processing.

Model Files

The spatial ML classifier uses these model files (automatically downloaded if missing):

  • model.pt — PyTorch model weights
  • config.json — Model configuration
  • label_encoder.json — Label encoding

Files are cached locally and auto_download=True enables automatic retrieval.

For model training details, see training/README.md.

Project Structure

atlas-profiler/
├── atlas_profiler/          # Public API: from atlas_profiler import process_dataset
├── profiler/                # Core profiling package
│   ├── core.py              # process_dataset(), data loading, column pipeline
│   ├── profile_types.py     # Rule-based type detection
│   ├── spatial.py           # Spatial coverage & GeoClassifier
│   ├── temporal.py          # Temporal analysis
│   ├── numerical.py         # Numerical profiling
│   └── types.py             # Type constants
├── training/                # Model training & synthetic data generation
├── tests/                   # Unit tests
├── examples/                # Example notebooks
├── docs/                    # Sphinx documentation
└── pyproject.toml           # Project configuration

Related Projects

This project builds upon and extends Datamart Profiler with additional spatial intelligence via ML-assisted column type classification.

License

Atlas Profiler is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atlas_profiler-0.0.2b1.tar.gz (37.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atlas_profiler-0.0.2b1-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file atlas_profiler-0.0.2b1.tar.gz.

File metadata

  • Download URL: atlas_profiler-0.0.2b1.tar.gz
  • Upload date:
  • Size: 37.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for atlas_profiler-0.0.2b1.tar.gz
Algorithm Hash digest
SHA256 ef22fc10ca327d3f25da478ad8fa0dd7d8d71373680b437304fd43cd19dcca03
MD5 adec98c3f0b21c863ae8fc275f4878c4
BLAKE2b-256 e15d4944cd49e3eca21eb4efb58b02a063e08bc5f66ad662b239ffbd446dc5c3

See more details on using hashes here.

File details

Details for the file atlas_profiler-0.0.2b1-py3-none-any.whl.

File metadata

File hashes

Hashes for atlas_profiler-0.0.2b1-py3-none-any.whl
Algorithm Hash digest
SHA256 7ec6e024a78244257914dc53537ada991afe0adfbd5904393b8c6cf8e7ec42de
MD5 b40942055af9dfc913240b1cdc7b0390
BLAKE2b-256 1a8bf76094c1dca4f2c3fd315a87bd9be17165e506a87e7830c65571db7538e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page