Skip to main content

C++ accelerated data preparation for pandas and the Python data stack

Project description


Arnio



Fast data preparation for the Python data stack.


Arnio is a compiled C++ data preparation engine for messy CSV and pandas workflows.
It parses, infers types, strips whitespace, deduplicates, validates, and profiles data —
then hands clean results back to the tools you already use.
Use Arnio before and alongside pandas, NumPy, scikit-learn, DuckDB, and Arrow.


PyPI  Python  CI  Coverage  MIT  GSSoC 2026  Join Discord PyPI Downloads



pip install arnio

Colab install smoke test: COLAB_SMOKE_TEST.md


Quickstart · Integrations · Why Arnio · Architecture · Benchmarks · Community · Contribute




⚡ Quickstart

A simple workflow in just a few steps.

New to Arnio? Start with the pandas workflow example below before exploring advanced pipelines.

import arnio as ar

# Load CSV directly through C++ — no Python parsing overhead
frame = ar.read_csv("messy_sales_data.csv")

# Declare what clean data looks like — arnio handles the rest
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# Out comes a standard pandas DataFrame — use it like you always have
df = ar.to_pandas(clean)

# Use copy=True when you need defensive pandas-owned buffers
safe_df = ar.to_pandas(clean, copy=True)

Already have a pandas DataFrame? Use Arnio in-place in your existing pandas workflow:

import pandas as pd
import arnio as ar

df = pd.read_csv("messy_sales_data.csv")

clean_df = df.arnio.clean([
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_duplicates",),
])

report = clean_df.arnio.profile()

Select specific columns

Use select_columns() to create a new ArFrame with only the required columns before converting to pandas.

selected = frame.select_columns(["name", "revenue"])

print(selected.columns)
# ['name', 'revenue']

Every step above executes in C++. Your Python code is a configuration — not the execution engine.


📸 Peek at a 100 GB file without loading it

scan_csv reads only the header + a sample to infer the schema. Zero data loaded.

schema = ar.scan_csv("100GB_file.csv")
# {'id': 'int64', 'name': 'string', 'is_active': 'bool', 'revenue': 'float64'}

Useful for exploring datasets before committing memory.

👀 Preview rows without pandas conversion or full-column Python list materialization

preview() reads only the first n rows directly from the C++ frame — no pandas conversion triggered.

frame = ar.read_csv("huge_file.csv")

print(frame.preview())      # first 5 rows (default)
print(frame.preview(n=10))  # first 10 rows

Raises ValueError for invalid n (zero, negative, or non-integer).

🧩 Add custom steps without touching C++

Register any Python function as a pipeline step. It receives a DataFrame, returns a DataFrame.

def remove_outliers(df, column="revenue", threshold=100_000):
    return df[df[column] <= threshold]

ar.register_step("remove_outliers", remove_outliers)

# Now use it in any pipeline alongside native C++ steps
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("remove_outliers", {"column": "revenue", "threshold": 50000}),
    ("drop_duplicates",),
])

Custom steps run through a pandas↔ArFrame conversion bridge. Prototype in Python, then optionally migrate hot paths to C++ for full speed.




🔗 Integrations

Arnio is designed to make the rest of the Python data stack more productive, not to replace it.

Workflow How Arnio helps
pandas Clean, validate, and profile messy DataFrames through df.arnio.
NumPy Prepare typed numeric data before array/modeling workflows.
scikit-learn Use Arnio cleaning as a preprocessing layer before model training.
DuckDB / Arrow Validate and prepare data before analytics and columnar exchange.
notebooks Inspect quality issues and cleaning suggestions before analysis.

Pandas accessor

df = pd.read_csv("raw_customers.csv")

clean_df = df.arnio.clean(drop_duplicates=True)
quality = clean_df.arnio.profile()
validation = clean_df.arnio.validate({
    "email": ar.Email(nullable=False),
    "age": ar.Int64(nullable=True, min=0),
})

This keeps pandas as the analysis tool while Arnio handles the preparation, quality, and validation layer.

Product direction: PROJECT_DIRECTION.md




🔍 Why Arnio exists

Every data project starts the same way:

df = pd.read_csv("data.csv")              # 💥 RAM spike — entire file as raw strings
df.columns = df.columns.str.strip()        # Why is this not automatic?
df["name"] = df["name"].str.strip()        # Python loop over every cell
df["name"] = df["name"].str.lower()        # Another Python loop
df = df.dropna()                           # Another pass
df = df.drop_duplicates()                  # Another pass

Six lines. Four full-data passes. All in interpreted Python. This is fine for a Jupyter demo — but it doesn't scale, it doesn't compose, and it definitely doesn't belong in production.

Arnio intercepts this entire pattern. It moves the preparation layer into a predictable pipeline, accelerates supported operations in C++, and gives you clean data for pandas, NumPy, scikit-learn, DuckDB, or notebooks.

Without Arnio

df = pd.read_csv(path)
df.columns = df.columns.str.strip()
for col in str_cols:
    df[col] = df[col].str.strip()
    df[col] = df[col].str.lower()
df = df.dropna(subset=["revenue"])
df = df.drop_duplicates()
# 6+ lines, multiple passes, pure Python

With Arnio

frame = ar.read_csv(path)
df = ar.to_pandas(ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_nulls", {"subset": ["revenue"]}),
    ("drop_duplicates",),
]))
# Declarative. Single pipeline. C++ execution.



🏗️ Architecture

Arnio is not a pandas wrapper. It's a separate runtime with its own data model.

┌──────────────────────────────────────────────────────────────┐
│  Your Python Code                                            │
│  frame = ar.read_csv("data.csv")                             │
│  clean = ar.pipeline(frame, [...])                           │
│  df = ar.to_pandas(clean)                                    │
└────────────────────────┬─────────────────────────────────────┘
                         │  pybind11 boundary
┌────────────────────────▼─────────────────────────────────────┐
│  C++ Runtime  (_arnio_cpp)                                   │
│                                                              │
│  ┌─────────────┐  ┌─────────────────┐  ┌──────────────────┐ │
│  │  CsvReader   │  │  Frame/Column   │  │  Cleaning Engine │ │
│  │  • RFC 4180  │  │  • Columnar     │  │  • drop_nulls    │ │
│  │  • BOM strip │  │  • std::variant │  │  • fill_nulls    │ │
│  │  • Type      │  │  • Bool null    │  │  • drop_dupes    │ │
│  │    inference │  │    masks        │  │  • strip_ws      │ │
│  │  • Quoted    │  │  • O(1) column  │  │  • normalize     │ │
│  │    fields    │  │    lookup       │  │  • rename/cast   │ │
│  └─────────────┘  └─────────────────┘  └──────────────────┘ │
│                                                              │
│  to_pandas() ──→ zero-copy NumPy buffer (numerics/bools)     │
└──────────────────────────────────────────────────────────────┘

Design decisions that matter

Decision What it means
Columnar storage Data lives in typed std::vectors — vector<int64_t>, vector<double>, vector<string> — not rows of variants. Cache-friendly and SIMD-ready.
Boolean null masks Nulls are tracked in a separate vector<bool>, keeping data vectors dense. No sentinel values, no NaN tricks.
Two-pass CSV read Pass 1 infers types across all rows. Pass 2 parses values directly into the correct typed column. No string→object→cast overhead.
Zero-copy bridge to_pandas() exposes C++ memory directly via NumPy's buffer protocol where supported. Numeric columns preserve the fast zero-copy path by default, while copy=True requests defensive pandas-owned buffers.
Step registry Pipeline steps map to C++ function pointers. Adding a new cleaning primitive is a single function + one registry entry.

Full architecture documentation: ARCHITECTURE.md




🏎️ Benchmarks

Reference environment: Ubuntu, Python 3.12, synthetic messy CSV inputs.
Reproduce: make benchmark — generates deterministic tall and wide datasets and runs both engines.

To reproduce the published numbers from a fresh checkout:

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
python benchmarks/generate_data.py
python benchmarks/benchmark_vs_pandas.py

benchmarks/generate_data.py uses deterministic NumPy seeds, so every run creates the same benchmarks/benchmark_1m.csv tall input and benchmarks/benchmark_wide.csv wide input. The benchmark then executes three pandas runs and three arnio runs for each case, printing average wall-clock time from time.perf_counter() and peak Python allocation from tracemalloc. For cleaner comparisons, close other memory-heavy processes and run the script from the repository root after installing the same Python, pandas, NumPy, compiler, and arnio commit you want to compare.

Expected output format:

Tall CSV (1,000,000 rows x 12 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       4.73s         5.75s
Peak RAM               211MB         212MB
Speed: 0.8x | RAM: -1% reduction

Wide CSV (5,000 rows x 256 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       ...s          ...s
Peak RAM              ...MB         ...MB
Speed: ...x | RAM: ...% reduction

Small differences are expected across CPUs, operating systems, compilers, Python builds, and pandas/NumPy versions. If you share benchmark results in an issue or PR, include your OS, Python version, CPU model, pandas/NumPy versions, arnio commit, and the full command output so maintainers can compare like for like.

Arnio is near memory parity in the reference benchmark while replacing ad-hoc Python string loops with a compiled, declarative pipeline. Validate memory and speed on your own workload. The execution time gap is a known, active optimization target — the current drop_duplicates and strip_whitespace implementations use unoptimized row-key serialization.

What's already won 🎯 What's being optimized
  • Native C++ parsing eliminates Python memory spikes
  • Columnar storage matches pandas' internal efficiency
  • Declarative API eliminates .apply() spaghetti
  • Zero-copy bridge for numeric conversions
  • drop_duplicates — replace string serialization with hash-based comparisons
  • strip_whitespace — in-place mutation instead of copy-on-write
  • Parallel column processing via std::thread
  • Help close the gap →

🧠 Auto Clean Memory Benchmark

To measure the peak memory and execution time of the auto_clean pipeline using realistic dataset sizes:

python benchmarks/benchmark_auto_clean_memory.py --rows 100000

This script generates a reproducible synthetic dataset with mixed column types (strings, ints, floats, booleans, nulls, and duplicates) and measures:

  • ar.read_csv performance
  • ar.auto_clean(mode="safe") performance (low-risk cleanup like whitespace trimming)
  • ar.auto_clean(mode="strict") performance (includes type casting and deduplication)

The dataset is regenerated deterministically unless --reuse-file is provided. Each auto_clean benchmark run reloads the dataset to avoid mutation or caching effects between runs.

Options:

  • --repeat N runs each operation multiple times and reports average (and min/max range).
  • --seed N changes the deterministic dataset seed.
  • --reuse-file reuses an existing dataset file instead of regenerating it.
  • --keep-file keeps the generated CSV (otherwise it is removed at the end).

Expected output format:

Operation                    Time(s)     Peak Py(MiB)
--------------------------------------------------------------------
ar.read_csv           0.042 (0.041-0.044)    4.52 (4.50-4.60)
ar.auto_clean(safe)   0.012 (0.011-0.013)    0.15 (0.14-0.16)
ar.auto_clean(strict) 0.035 (0.034-0.036)    1.20 (1.18-1.22)
--------------------------------------------------------------------
Total avg (Read+Strict)       0.077             4.52



🧰 Cleaning primitives

Most operations below run natively in C++. The current filter_rows step uses the Python pipeline backend and may be optimized in C++ later.

Primitive What it does Example
drop_nulls Remove rows with null/empty values ar.drop_nulls(frame, subset=["age"])
keep_rows_with_nulls Keep only rows that contain at least one null ar.keep_rows_with_nulls(frame, subset=["age"])
validate_columns_exist Fail early when required columns are missing ar.validate_columns_exist(frame, ["age"])
filter_rows Filter rows using comparison operators ar.filter_rows(frame, column="age", op=">", value=18)
fill_nulls Replace nulls with a scalar ar.fill_nulls(frame, 0, subset=["revenue"])
drop_duplicates Deduplicate rows (first/last/none) ar.drop_duplicates(frame, keep="first")
drop_constant_columns Remove columns with only one unique value ar.drop_constant_columns(frame)
clip_numeric Clip numeric values to lower and/or upper bounds ar.clip_numeric(frame, lower=0, upper=100)
strip_whitespace Trim leading/trailing spaces from strings ar.strip_whitespace(frame)
normalize_case Force lower/upper/title case ar.normalize_case(frame, case_type="title")
rename_columns Rename columns via mapping ar.rename_columns(frame, {"old": "new"})
cast_types Cast column types ar.cast_types(frame, {"age": "int64"})
round_numeric_columns Round numeric columns (non-numeric columns in subset ignored safely) ar.round_numeric_columns(frame, decimals=2)
clean Convenience shorthand ar.clean(frame, drop_nulls=True)
safe_divide_columns Divide one column by another, handling zero/null denominators ar.safe_divide_columns(frame, numerator="revenue", denominator="cost", output_column="ratio")

Or compose them all into a pipeline:

clean = ar.pipeline(frame, [
    ("validate_columns_exist", {"columns": ["name", "city", "revenue"]}),
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": "unknown", "subset": ["city"]}),
    ("drop_duplicates", {"keep": "first"}),
])

🔎 Filter rows inside pipelines

Use filter_rows to keep only rows matching a condition.

clean = ar.pipeline(frame, [
    ("filter_rows", {
        "column": "revenue",
        "op": ">=",
        "value": 1000
    }),
])

Supported operators:

  • >
  • <
  • >=
  • <=
  • ==
  • !=

Works with:

  • integers
  • floats
  • strings
  • booleans

🔎 Isolate rows with null values

Use keep_rows_with_nulls to audit incomplete data — keep only rows that have at least one null.

frame = ar.read_csv("data.csv")

# Keep all rows that have at least one null anywhere
nulls = ar.keep_rows_with_nulls(frame)

# Keep rows where specifically 'age' or 'score' is null
nulls = ar.keep_rows_with_nulls(frame, subset=["age", "score"])

# Works inside a pipeline too
result = ar.pipeline(frame, [
    ("keep_rows_with_nulls", {"subset": ["age"]}),
])

Useful for data auditing — inspect what's missing before deciding how to fill or drop.


### 🔢 Safe column division

Divide one column by another while handling division by zero and null denominators explicitly:

result = ar.safe_divide_columns(
    frame,
    numerator="revenue",
    denominator="cost",
    output_column="ratio",
    fill_value=0.0,  # used when denominator is zero or null
)

When the denominator is zero or null, the result is replaced with fill_value (default 0.0) instead of raising an error or producing NaN/Inf.



📊 Pandas Dtype Support Matrix

This table helps users understand which pandas dtypes and workflows are fully supported, partially supported, unsupported, or planned.

If a dtype is partially supported, users may need conversion before processing. Unsupported dtypes should raise clear errors where applicable.

Pandas Dtype Support Status Notes
int64 ✅ Supported Fully supported with native C++ columnar storage
float64 ✅ Supported Fully supported with zero-copy conversion where possible
bool ✅ Supported Native supported boolean type
string ✅ Supported Recommended over object dtype for text workflows
datetime64[ns] ❌ Unsupported No native datetime parsing or conversion support yet
category ⚠️ Limited Converted to string/object during processing
object (mixed columns) ⚠️ Limited Mixed object columns may coerce to string and reduce type inference reliability
nullable pandas dtypes (Int64, boolean) ⚠️ Limited Supported through pandas extension dtypes with null-mask handling
timedelta64[ns] ❌ Unsupported Not currently supported

Notes

  • Numeric columns are optimized for zero-copy conversion between C++ and pandas where supported.
  • Pass copy=True to to_pandas() when downstream pandas code needs defensive pandas-owned column buffers.
  • Boolean conversion is already copied by the binding because std::vector<bool> cannot be exposed as a zero-copy NumPy buffer in the current implementation.
  • Columns with null masks may require copies so pandas can apply nullable values safely.
  • String columns require Python string object creation during to_pandas() conversion.
  • Mixed object columns may reduce type inference accuracy and may require preprocessing.
  • Unsupported dtypes should raise clear user-facing errors instead of silent failures.



🧠 Data quality engine

Arnio now includes built-in dataset understanding before you analyze in pandas.

report = ar.profile(frame)
print(report.summary())

suggestions = ar.suggest_cleaning(frame)
clean = ar.pipeline(frame, suggestions)

For production data contracts:

schema = ar.Schema({
    "id": ar.Int64(nullable=False, unique=True),
    "email": ar.Email(nullable=False),
    "username": ar.String(min_length=3, max_length=20),
    "revenue": ar.Float64(nullable=True, min=0),
})

result = ar.validate(frame, schema)
if not result.passed:
    summary = result.summary()
    print(summary["issues_by_rule"])
    print(summary["issues_by_column"])
    print(summary["issues_by_column_and_rule"])
    print(result.to_pandas())
    print(result.to_markdown(max_issues=10))

ValidationResult.to_markdown() is useful in CI logs, GitHub comments, or data quality reports because it renders a compact validation summary plus a GitHub-friendly issue table. Severity counts are not included in summary() yet because ValidationIssue does not currently carry severity information.

For low-risk automatic cleanup:

clean, report = ar.auto_clean(frame, mode="strict", return_report=True)

This is the layer pandas does not try to own: profiling, data contracts, row-level validation issues, and safe cleaning suggestions for messy incoming datasets.


Beginner-friendly auto-clean tutorial

Use this workflow when you receive a small messy dataset and want to inspect what Arnio will change before applying it.

import arnio as ar
import pandas as pd

raw = pd.DataFrame(
    {
        "order_id": [1001, 1002, 1002, 1003, 1004],
        "customer": [" Ishan ", " Prasoon ", " Prasoon ", " Pranay ", " Dhruv "],
        "city": [" Paris ", "London", "London", " New York ", " Tokyo "],
    }
)

frame = ar.from_pandas(raw)

report = ar.profile(frame)
summary = report.summary()
print(summary)

suggestions = ar.suggest_cleaning(frame)
print(suggestions)
# [('strip_whitespace', {'subset': ['customer', 'city']}), ('drop_duplicates', {'keep': 'first'})]

safe = ar.auto_clean(frame)
strict = ar.auto_clean(frame, mode="strict")

Messy input:

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

Expected cleaned output with mode="strict":

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

mode="safe" only trims whitespace. Use mode="strict" when you also want deterministic built-in cleanup such as exact duplicate removal.

See examples/auto_clean_tutorial.py for a runnable version of this walkthrough.


Data Quality Reports

Arnio provides detailed profiling for datasets via ar.profile(). To generate the report shown in these examples, the following code was used:

import arnio as ar
import pandas as pd

# Sample dataset used for these examples
data = {
    "user_id": [101, 102, 103, 104],
    "email": ["test@arnio.ai", "invalid-email", None, "test@arnio.ai"],
    "score": [85.5, 90.0, None, 88.2]
}
df = ar.from_pandas(pd.DataFrame(data))
# Bounded profiling for large datasets (controls how many sample values are kept)
report = ar.profile(df, sample_size=5)

1. Terminal Representation (Simplified Example)

A simplified view of the standard string representation of the report object:

DataQualityReport(
    row_count=4,
    column_count=3,
    memory_usage=733,
    duplicate_rows=0,
    columns={
        'user_id': ColumnProfile(dtype='int64', semantic_type='identifier', unique_count=4),
        'email': ColumnProfile(dtype='string', semantic_type='categorical', null_count=1, unique_ratio=0.666667),
        'score': ColumnProfile(dtype='float64', semantic_type='numeric', mean=87.9, min=85.5, max=90.0)
    }
)

2. JSON Format (Excerpts from .to_dict())

Key fields from the structured JSON export for integration with APIs or dashboards:

{
  "row_count": 4,
  "column_count": 3,
  "memory_usage": 733,
  "duplicate_rows": 0,
  "duplicate_ratio": 0.0,
  "columns": {
    "user_id": {
      "dtype": "int64",
      "semantic_type": "identifier",
      "null_count": 0,
      "unique_ratio": 1.0
    },
    "email": {
      "dtype": "string",
      "semantic_type": "categorical",
      "null_count": 1,
      "unique_ratio": 0.666667,
      "warnings": ["contains_nulls"]
    },
    "score": {
      "dtype": "float64",
      "semantic_type": "numeric",
      "null_count": 1,
      "mean": 87.9,
      "min": 85.5,
      "max": 90.0,
      "warnings": ["contains_nulls"]
    }
  }
}

3. Example Summary Table

A manually formatted Markdown table representing the core metrics:

Metric Value
Row Count 4
Column Count 3
Memory Usage 733 bytes
Duplicates 0 (0.0%)

🗺️ Roadmap

Version Focus Status
v1.0 Stable release · cross-platform wheels · CI/CD · PyPI publishing · Google Colab support ✅ Shipped
v1.1 Production readiness · release hardening · docs/tooling ✅ Shipped
v1.2 C++ pipeline optimization · speed parity with pandas · hash-based deduplication 🔨 Active
v1.3 Chunked / streaming processing · Parquet & JSON readers 📋 Planned
v1.4 Parallel column processing · SIMD string operations 💭 Exploring



💬 Community

Join the Arnio Discord Community for quick setup help, contributor onboarding, GSSoC 2026 coordination, feature discussion, and community updates.

Discord is for fast conversation and support. GitHub remains the source of truth for issue assignment, PR reviews, bugs, roadmap decisions, and releases.

Join Arnio Discord




🤝 Contribute

Arnio is a GSSoC 2026 project with a structured contributor backlog across beginner, intermediate, and advanced tracks.

You don't need C++ to contribute

Most new features are pure Python pipeline steps:

# 1. Write a function that takes a DataFrame and returns a DataFrame
def remove_special_chars(df, columns=None):
    cols = columns or df.select_dtypes("object").columns
    for col in cols:
        df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
    return df

# 2. Register it
ar.register_step("remove_special_chars", remove_special_chars)

# 3. Write tests, open a PR. That's it.

If you do know C++

The biggest performance wins are in:

  • drop_duplicates — replacing std::ostringstream row serialization with proper hash-based comparisons
  • strip_whitespace — converting from copy-on-write to in-place mutation
  • Parallel column processingstd::thread across independent columns

Getting started

# macOS / Linux
git clone https://github.com/im-anishraj/arnio.git && cd arnio
make install   # pip install -e ".[dev]" + pre-commit
make test      # pytest with coverage
make lint      # ruff + black

# Windows
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

PR titles must follow Conventional Commitsfeat:, fix:, docs:, chore:. Our release pipeline auto-generates changelogs from these.

For GSSoC contributors, please read GSSOC_GUIDE.md before asking to be assigned. It explains issue claiming, contribution levels, review expectations, and what maintainers look for in a strong PR. If you want a quick onboarding refresher, see the GSSoC FAQ. If you are new to Arnio terms, see the contributor glossary.

📖 Full Contributing Guide ·  GSSoC Guide ·  🐛 Open Issues ·  💬 Discussions ·  Discord




🚢 Release process

Arnio releases are automated through Release Please and GitHub Actions.

  1. Merge user-facing changes with Conventional Commit PR titles (feat:, fix:, docs:, or chore:) so Release Please can choose the version bump and changelog entries.
  2. Review and merge the Release Please PR on main; this updates release metadata and creates the GitHub release and tag.
  3. Confirm the Build & Publish Wheels workflow succeeds for the release tag. It builds the sdist and wheels, then publishes to PyPI through Trusted Publishing.
  4. Smoke test the published package in a clean environment:
python -m venv /tmp/arnio-smoke
source /tmp/arnio-smoke/bin/activate
python -m pip install -U pip
python -m pip install arnio
printf 'name,revenue\n Ada,10\n' > /tmp/arnio-smoke.csv
python - <<'PY'
import arnio as ar
print(ar.__version__)
print(ar.scan_csv("/tmp/arnio-smoke.csv"))
PY
  1. Verify the GitHub release, PyPI project page, and install command all show the expected version before announcing the release.

If any publish or smoke-test step fails, leave the failed tag and GitHub release in place until maintainers agree on the recovery plan.




📐 Project structure

arnio/
├── cpp/
│   ├── include/arnio/      # C++ headers — types, column, frame, csv_reader, cleaning
│   └── src/                 # C++ implementations (~30 KB of compiled logic)
├── bindings/
│   └── bind_arnio.cpp       # pybind11 module — the Python↔C++ bridge
├── arnio/
│   ├── __init__.py          # Public API surface
│   ├── io.py                # read_csv, scan_csv
│   ├── cleaning.py          # Python wrappers for C++ cleaning functions
│   ├── pipeline.py          # Step registry + pipeline executor
│   ├── convert.py           # to_pandas (zero-copy), from_pandas
│   ├── frame.py             # ArFrame — lightweight C++ Frame wrapper
│   └── exceptions.py        # ArnioError, UnknownStepError, CsvReadError, TypeCastError
├── tests/                   # pytest suite — CSV, cleaning, pipeline, conversions
├── benchmarks/              # Reproducible arnio vs pandas benchmark
├── examples/                # basic_usage.py, auto_clean_tutorial.py, custom_step.py
└── website/                 # Project website — arnio.vercel.app



Arnio



Stop writing cleaning scripts. Declare clean data.


DownloadsStarsForksWebsiteDiscord


Built with C++ and pybind11 · Licensed under MIT · Maintained by @im-anishraj

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arnio-1.7.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arnio-1.7.0-cp313-cp313-win_amd64.whl (203.6 kB view details)

Uploaded CPython 3.13Windows x86-64

arnio-1.7.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (251.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arnio-1.7.0-cp313-cp313-macosx_11_0_arm64.whl (192.8 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arnio-1.7.0-cp313-cp313-macosx_10_13_x86_64.whl (207.8 kB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

arnio-1.7.0-cp312-cp312-win_amd64.whl (203.6 kB view details)

Uploaded CPython 3.12Windows x86-64

arnio-1.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (251.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arnio-1.7.0-cp312-cp312-macosx_11_0_arm64.whl (192.7 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arnio-1.7.0-cp312-cp312-macosx_10_13_x86_64.whl (207.8 kB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

arnio-1.7.0-cp311-cp311-win_amd64.whl (201.2 kB view details)

Uploaded CPython 3.11Windows x86-64

arnio-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (250.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arnio-1.7.0-cp311-cp311-macosx_11_0_arm64.whl (191.8 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arnio-1.7.0-cp311-cp311-macosx_10_9_x86_64.whl (206.0 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

arnio-1.7.0-cp310-cp310-win_amd64.whl (200.5 kB view details)

Uploaded CPython 3.10Windows x86-64

arnio-1.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (249.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

arnio-1.7.0-cp310-cp310-macosx_11_0_arm64.whl (190.8 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

arnio-1.7.0-cp310-cp310-macosx_10_9_x86_64.whl (204.7 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

arnio-1.7.0-cp39-cp39-win_amd64.whl (207.1 kB view details)

Uploaded CPython 3.9Windows x86-64

arnio-1.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (249.3 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

arnio-1.7.0-cp39-cp39-macosx_11_0_arm64.whl (190.9 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

arnio-1.7.0-cp39-cp39-macosx_10_9_x86_64.whl (204.8 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file arnio-1.7.0.tar.gz.

File metadata

  • Download URL: arnio-1.7.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0.tar.gz
Algorithm Hash digest
SHA256 1b7a06600a0717d66be1671695a8549e5cbf4d6b1192ce9625a36f157be6cf1d
MD5 bb937e08e111f367798fcbcdc9c19f9c
BLAKE2b-256 a6b5914a99dc4609e1a538ccf9eee853fa7b1ef8d9a86063502d52f08ff6ce92

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: arnio-1.7.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 203.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 df3c9bf3ea3dcb755f55d644d5f904f7fa733304a26c34bf434c6b8a8dcbb317
MD5 de8240939d060ac3c9c2652b920d6933
BLAKE2b-256 8b1de3eaddb39ced58288c1bb80bad999f347a04226d244905d954c365888339

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47a3c3b23691ba5d6e9021106a97e8c4bc5370b1546bd6d8bcdd9d35bb69a59a
MD5 d765665c11f89f98f74741d403d5afe4
BLAKE2b-256 6dbb34f8357d2ce3921a74d485c995094cd9e858d59b46e56217e8bec5f30f7e

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 de1aaabdb622103e680609f7f34546df895870e58fac9832199dfdc2ebfacc66
MD5 470be202a180a4a0b9d1c14b4b140041
BLAKE2b-256 76925f3cd2959af1b0e36cf54d0f7dccfee9f9c3c6eaae8912ff11bcbce3910d

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 7655321d12c6a97e3358d49a157627b122eff25c54c222c60667709fcfa84372
MD5 ddeebde5e96b26b7eb8f2820175d3580
BLAKE2b-256 1ef651c9a5b7f295eff01bae6d0d9d00f8124ea971e6922a1190a3039595117b

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: arnio-1.7.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 203.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1ac740d8cb81e36350e296d2014ee053036f43694c821005550465c03cfc7bdd
MD5 f992cb461db530df0dfb12abb0ce27fa
BLAKE2b-256 cb52415d352ff08c9d3038ffaa173543432bd9dab234e06753859d48c9bea3e4

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 591eeb538e3e78745c34bfbc27d174e5adbb4a7d1b16dd410343a9fd6dc3205d
MD5 425640947e2082afcfd82dbe8fc2e2d8
BLAKE2b-256 6fdd79878dbe3c763042f77a1a531864603b41604511fbc1ba269a3cd92cfea0

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 db8231a4dd0c202c7f26677c5b8fc2e4395813ed77cd74f5f3b33c2b200ace41
MD5 e95b29111398eeedb0326c85c0404d04
BLAKE2b-256 cd69f821355951bbe945230fb5daa6917cc4726e6049c67fa233891bfa5021bf

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 667a940aa97a4d0f9092e32814f7cec5006243f5aa4f59f5c6731a0f34fdbb95
MD5 a8513c2cd9685b1170af746fe7ceb4ba
BLAKE2b-256 b3533bd4061886df826dd250cb17b4659f0ba9eb5eebfaaf0d9007715f68e5f9

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: arnio-1.7.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 201.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7fde911b4fb9da65cd0eb16d18965be379b05bf6b9dc117e3faa66a15fc62998
MD5 6953408c35e6e48e8a5e557035e84ab4
BLAKE2b-256 d79548c3ca495ee3adcbc84f1e317c173eafebc018afe47e33efcc850531b31c

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3cecd3efef3a863b5f536eeae4f834884398aba2246ff4659620a68968b443c7
MD5 1bdd4ebff8bbe92669470fbe1c7262b5
BLAKE2b-256 7eb061a0b6e07765a75f9206b5f70cfde78125e23b7d51ffed85f05a8bf09d19

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8b8a1ae65615bf1b5b67e097572d18f6f779de5a8fc74bc739ffad830d7c7839
MD5 dbe7ca720e1230592b2e68f854554f99
BLAKE2b-256 9e82be6d4a80e9a65c8bb0b3a24b4c1ec260ae976eef3f503b5ea2758e313c38

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2107ddbf94ba3ff14f231c518fbf9d14e8ef2701a30c9bc31c05e00d693ff3d4
MD5 388b9938100d85ca16a81ce6b71b901a
BLAKE2b-256 e97cd10ae2baba168cced6e919dbc4225e0840e438928697577c4cb8a3cf75de

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: arnio-1.7.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 200.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 e7ed118d9d3b429915a8def6aebab2d1feced0233cde27220e2146ceea04392d
MD5 ed663e2d24f20dba550c4b6dd13922c1
BLAKE2b-256 92ddf6af751f86acaab79302433bcad438fb33bc12b569dff507af6bed8a5b8c

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 df4c1bd03b998e2d07845e5c72d83fb23bc35d25fbaf021072e748d701b740f2
MD5 4976747c118f640b63ca5dce2dc59ddc
BLAKE2b-256 bf7e6ca3c56d3d3c41f04d899af0494db7e5017891b7a6e4a88552d771737557

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c0169c4fb5924ab285bf1db798731ba634d8d0739447a59e5c26747b4219304e
MD5 e9c55185c999c966a500aa4f4ff7e7e8
BLAKE2b-256 5f4491e6566731142df86db9a806b9543d0bd08a5c8b7cedd7245067c8f86fe6

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 194bcb2aa9bc926d4d3436e1518333ea3b7c0ee3c4826c463e91136816bf5162
MD5 4c4bc5705f9ea57cbcc57896841901a2
BLAKE2b-256 a2b802241cc7b2c2cfaf1134667cf08ed531a111e64bef944884a9e53a4a29b5

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: arnio-1.7.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 207.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 588e18c0dfef5787244dde8ccbcb55a27ac0afa1168ca5baf704d45451fbe2c3
MD5 393add12e424e1bf9146a620f8f45034
BLAKE2b-256 701da15d74e0d1dcd70826d8564ca52edfcb775c132f7ef163fce1b75c80b082

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 37ed84855c72426b586a43baef27f2cc748d082d15e6a872fd25b8dcbd5442f8
MD5 610e077139686ba1cddae533aa05b053
BLAKE2b-256 69637e81f673305fdeff8505b2020ac23cb9f0f153d765c6f11251acbd368ad9

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

  • Download URL: arnio-1.7.0-cp39-cp39-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 190.9 kB
  • Tags: CPython 3.9, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a96c0ab8e424afb4622bba3e8771f3aaf55839b58abf1984f60396f176343355
MD5 c679e1e8cc4d6ff1a75641b75bf1a444
BLAKE2b-256 fca14562e65c1f1052614f7b41e5679e4aab3ef18a33b5c1c28ee0e751155042

See more details on using hashes here.

File details

Details for the file arnio-1.7.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: arnio-1.7.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 204.8 kB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.7.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e3c66e3715703f71c32fc10ffe4f36b4f2cf3764d1cd7a116f74655df0716bda
MD5 3eec4a65f8b13b236e3ab3c66ec5d610
BLAKE2b-256 2f304f1a8dab08ed632fdb593d260cff5be02dfed4941b9450597ff3f13b57f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page