Skip to main content

C++ accelerated data preparation for pandas and the Python data stack

Project description


Arnio



Fast data preparation for the Python data stack.


Arnio is a compiled C++ data preparation engine for messy CSV and pandas workflows.
It parses, infers types, strips whitespace, deduplicates, validates, and profiles data —
then hands clean results back to the tools you already use.
Use Arnio before and alongside pandas, NumPy, scikit-learn, DuckDB, and Arrow.


PyPI  Python  CI  Coverage  MIT  GSSoC 2026  Join Discord PyPI Downloads



pip install arnio

Colab install smoke test: COLAB_SMOKE_TEST.md


Quickstart · Integrations · Why Arnio · Architecture · Benchmarks · Community · Contribute




⚡ Quickstart

A simple workflow in just a few steps.

New to Arnio? Start with the pandas workflow example below before exploring advanced pipelines.

import arnio as ar

# Load CSV directly through C++ — no Python parsing overhead
frame = ar.read_csv("messy_sales_data.csv")

# Declare what clean data looks like — arnio handles the rest
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# Out comes a standard pandas DataFrame — use it like you always have
df = ar.to_pandas(clean)

# Use copy=True when you need defensive pandas-owned buffers
safe_df = ar.to_pandas(clean, copy=True)

Already have a pandas DataFrame? Use Arnio in-place in your existing pandas workflow:

import pandas as pd
import arnio as ar

df = pd.read_csv("messy_sales_data.csv")

clean_df = df.arnio.clean([
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_duplicates",),
])

report = clean_df.arnio.profile()

Select specific columns

Use select_columns() to create a new ArFrame with only the required columns before converting to pandas.

selected = frame.select_columns(["name", "revenue"])

print(selected.columns)
# ['name', 'revenue']

Every step above executes in C++. Your Python code is a configuration — not the execution engine.


📸 Peek at a 100 GB file without loading it

scan_csv reads only the header + a sample to infer the schema. Zero data loaded.

schema = ar.scan_csv("100GB_file.csv")
# {'id': 'int64', 'name': 'string', 'is_active': 'bool', 'revenue': 'float64'}

Useful for exploring datasets before committing memory.

👀 Preview rows without pandas conversion or full-column Python list materialization

preview() reads only the first n rows directly from the C++ frame — no pandas conversion triggered.

frame = ar.read_csv("huge_file.csv")

print(frame.preview())      # first 5 rows (default)
print(frame.preview(n=10))  # first 10 rows

Raises ValueError for invalid n (zero, negative, or non-integer).

🧩 Add custom steps without touching C++

Register any Python function as a pipeline step. It receives a DataFrame, returns a DataFrame.

def remove_outliers(df, column="revenue", threshold=100_000):
    return df[df[column] <= threshold]

ar.register_step("remove_outliers", remove_outliers)

# Now use it in any pipeline alongside native C++ steps
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("remove_outliers", {"column": "revenue", "threshold": 50000}),
    ("drop_duplicates",),
])

Custom steps run through a pandas↔ArFrame conversion bridge. Prototype in Python, then optionally migrate hot paths to C++ for full speed.




🔗 Integrations

Arnio is designed to make the rest of the Python data stack more productive, not to replace it.

Workflow How Arnio helps
pandas Clean, validate, and profile messy DataFrames through df.arnio.
NumPy Prepare typed numeric data before array/modeling workflows.
scikit-learn Use Arnio cleaning as a preprocessing layer before model training.
DuckDB / Arrow Validate and prepare data before analytics and columnar exchange.
notebooks Inspect quality issues and cleaning suggestions before analysis.

Pandas accessor

df = pd.read_csv("raw_customers.csv")

clean_df = df.arnio.clean(drop_duplicates=True)
quality = clean_df.arnio.profile()
validation = clean_df.arnio.validate({
    "email": ar.Email(nullable=False),
    "age": ar.Int64(nullable=True, min=0),
})

This keeps pandas as the analysis tool while Arnio handles the preparation, quality, and validation layer.

Product direction: PROJECT_DIRECTION.md




🔍 Why Arnio exists

Every data project starts the same way:

df = pd.read_csv("data.csv")              # 💥 RAM spike — entire file as raw strings
df.columns = df.columns.str.strip()        # Why is this not automatic?
df["name"] = df["name"].str.strip()        # Python loop over every cell
df["name"] = df["name"].str.lower()        # Another Python loop
df = df.dropna()                           # Another pass
df = df.drop_duplicates()                  # Another pass

Six lines. Four full-data passes. All in interpreted Python. This is fine for a Jupyter demo — but it doesn't scale, it doesn't compose, and it definitely doesn't belong in production.

Arnio intercepts this entire pattern. It moves the preparation layer into a predictable pipeline, accelerates supported operations in C++, and gives you clean data for pandas, NumPy, scikit-learn, DuckDB, or notebooks.

Without Arnio

df = pd.read_csv(path)
df.columns = df.columns.str.strip()
for col in str_cols:
    df[col] = df[col].str.strip()
    df[col] = df[col].str.lower()
df = df.dropna(subset=["revenue"])
df = df.drop_duplicates()
# 6+ lines, multiple passes, pure Python

With Arnio

frame = ar.read_csv(path)
df = ar.to_pandas(ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_nulls", {"subset": ["revenue"]}),
    ("drop_duplicates",),
]))
# Declarative. Single pipeline. C++ execution.



🏗️ Architecture

Arnio is not a pandas wrapper. It's a separate runtime with its own data model.

┌──────────────────────────────────────────────────────────────┐
│  Your Python Code                                            │
│  frame = ar.read_csv("data.csv")                             │
│  clean = ar.pipeline(frame, [...])                           │
│  df = ar.to_pandas(clean)                                    │
└────────────────────────┬─────────────────────────────────────┘
                         │  pybind11 boundary
┌────────────────────────▼─────────────────────────────────────┐
│  C++ Runtime  (_arnio_cpp)                                   │
│                                                              │
│  ┌─────────────┐  ┌─────────────────┐  ┌──────────────────┐ │
│  │  CsvReader   │  │  Frame/Column   │  │  Cleaning Engine │ │
│  │  • RFC 4180  │  │  • Columnar     │  │  • drop_nulls    │ │
│  │  • BOM strip │  │  • std::variant │  │  • fill_nulls    │ │
│  │  • Type      │  │  • Bool null    │  │  • drop_dupes    │ │
│  │    inference │  │    masks        │  │  • strip_ws      │ │
│  │  • Quoted    │  │  • O(1) column  │  │  • normalize     │ │
│  │    fields    │  │    lookup       │  │  • rename/cast   │ │
│  └─────────────┘  └─────────────────┘  └──────────────────┘ │
│                                                              │
│  to_pandas() ──→ zero-copy NumPy buffer (numerics/bools)     │
└──────────────────────────────────────────────────────────────┘

Design decisions that matter

Decision What it means
Columnar storage Data lives in typed std::vectors — vector<int64_t>, vector<double>, vector<string> — not rows of variants. Cache-friendly and SIMD-ready.
Boolean null masks Nulls are tracked in a separate vector<bool>, keeping data vectors dense. No sentinel values, no NaN tricks.
Two-pass CSV read Pass 1 infers types across all rows. Pass 2 parses values directly into the correct typed column. No string→object→cast overhead.
Zero-copy bridge to_pandas() exposes C++ memory directly via NumPy's buffer protocol where supported. Numeric columns preserve the fast zero-copy path by default, while copy=True requests defensive pandas-owned buffers.
Step registry Pipeline steps map to C++ function pointers. Adding a new cleaning primitive is a single function + one registry entry.

Full architecture documentation: ARCHITECTURE.md




🏎️ Benchmarks

Reference environment: Ubuntu, Python 3.12, synthetic messy CSV inputs.
Reproduce: make benchmark — generates deterministic tall and wide datasets and runs both engines.

To reproduce the published numbers from a fresh checkout:

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
python benchmarks/generate_data.py
python benchmarks/benchmark_vs_pandas.py

benchmarks/generate_data.py uses deterministic NumPy seeds, so every run creates the same benchmarks/benchmark_1m.csv tall input and benchmarks/benchmark_wide.csv wide input. The benchmark then executes three pandas runs and three arnio runs for each case, printing average wall-clock time from time.perf_counter() and peak Python allocation from tracemalloc. For cleaner comparisons, close other memory-heavy processes and run the script from the repository root after installing the same Python, pandas, NumPy, compiler, and arnio commit you want to compare.

Expected output format:

Tall CSV (1,000,000 rows x 12 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       4.73s         5.75s
Peak RAM               211MB         212MB
Speed: 0.8x | RAM: -1% reduction

Wide CSV (5,000 rows x 256 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       ...s          ...s
Peak RAM              ...MB         ...MB
Speed: ...x | RAM: ...% reduction

Small differences are expected across CPUs, operating systems, compilers, Python builds, and pandas/NumPy versions. If you share benchmark results in an issue or PR, include your OS, Python version, CPU model, pandas/NumPy versions, arnio commit, and the full command output so maintainers can compare like for like.

Arnio is near memory parity in the reference benchmark while replacing ad-hoc Python string loops with a compiled, declarative pipeline. Validate memory and speed on your own workload. The execution time gap is a known, active optimization target — the current drop_duplicates and strip_whitespace implementations use unoptimized row-key serialization.

What's already won 🎯 What's being optimized
  • Native C++ parsing eliminates Python memory spikes
  • Columnar storage matches pandas' internal efficiency
  • Declarative API eliminates .apply() spaghetti
  • Zero-copy bridge for numeric conversions
  • drop_duplicates — replace string serialization with hash-based comparisons
  • strip_whitespace — in-place mutation instead of copy-on-write
  • Parallel column processing via std::thread
  • Help close the gap →

🧠 Auto Clean Memory Benchmark

To measure the peak memory and execution time of the auto_clean pipeline using realistic dataset sizes:

python benchmarks/benchmark_auto_clean_memory.py --rows 100000

This script generates a reproducible synthetic dataset with mixed column types (strings, ints, floats, booleans, nulls, and duplicates) and measures:

  • ar.read_csv performance
  • ar.auto_clean(mode="safe") performance (low-risk cleanup like whitespace trimming)
  • ar.auto_clean(mode="strict") performance (includes type casting and deduplication)

The dataset is regenerated deterministically unless --reuse-file is provided. Each auto_clean benchmark run reloads the dataset to avoid mutation or caching effects between runs.

Options:

  • --repeat N runs each operation multiple times and reports average (and min/max range).
  • --seed N changes the deterministic dataset seed.
  • --reuse-file reuses an existing dataset file instead of regenerating it.
  • --keep-file keeps the generated CSV (otherwise it is removed at the end).

Expected output format:

Operation                    Time(s)     Peak Py(MiB)
--------------------------------------------------------------------
ar.read_csv           0.042 (0.041-0.044)    4.52 (4.50-4.60)
ar.auto_clean(safe)   0.012 (0.011-0.013)    0.15 (0.14-0.16)
ar.auto_clean(strict) 0.035 (0.034-0.036)    1.20 (1.18-1.22)
--------------------------------------------------------------------
Total avg (Read+Strict)       0.077             4.52



🧰 Cleaning primitives

Most operations below run natively in C++. The current filter_rows step uses the Python pipeline backend and may be optimized in C++ later.

Primitive What it does Example
drop_nulls Remove rows with null/empty values ar.drop_nulls(frame, subset=["age"])
keep_rows_with_nulls Keep only rows that contain at least one null ar.keep_rows_with_nulls(frame, subset=["age"])
validate_columns_exist Fail early when required columns are missing ar.validate_columns_exist(frame, ["age"])
filter_rows Filter rows using comparison operators ar.filter_rows(frame, column="age", op=">", value=18)
fill_nulls Replace nulls with a scalar ar.fill_nulls(frame, 0, subset=["revenue"])
drop_duplicates Deduplicate rows (first/last/none) ar.drop_duplicates(frame, keep="first")
drop_constant_columns Remove columns with only one unique value ar.drop_constant_columns(frame)
clip_numeric Clip numeric values to lower and/or upper bounds ar.clip_numeric(frame, lower=0, upper=100)
strip_whitespace Trim leading/trailing spaces from strings ar.strip_whitespace(frame)
normalize_case Force lower/upper/title case ar.normalize_case(frame, case_type="title")
rename_columns Rename columns via mapping ar.rename_columns(frame, {"old": "new"})
cast_types Cast column types ar.cast_types(frame, {"age": "int64"})
round_numeric_columns Round numeric columns (non-numeric columns in subset ignored safely) ar.round_numeric_columns(frame, decimals=2)
clean Convenience shorthand ar.clean(frame, drop_nulls=True)
safe_divide_columns Divide one column by another, handling zero/null denominators ar.safe_divide_columns(frame, numerator="revenue", denominator="cost", output_column="ratio")
trim_column_names Strip leading/trailing whitespace from column names ar.trim_column_names(frame)

ArFrame.select_dtypes — type-based column selection

Returns a new ArFrame containing only the columns whose dtype matches the filter. Raises ValueError if no columns match.

frame = ar.read_csv("data.csv")

# Keep only numeric columns
numeric = frame.select_dtypes(include=["int64", "float64"])

# Drop string columns
without_strings = frame.select_dtypes(exclude="string")

Valid dtype strings: "int64", "float64", "string", "bool", "null"

  • At least one of include or exclude must be given — raises ValueError otherwise.
  • include and exclude must not overlap — raises ValueError if they share a dtype.
  • Unknown dtype strings raise ValueError with a list of valid options.
  • Raises ValueError when no columns match (never returns an empty frame silently).
  • Column order in the result always matches the original frame.

Or compose them all into a pipeline:

clean = ar.pipeline(frame, [
    ("validate_columns_exist", {"columns": ["name", "city", "revenue"]}),
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": "unknown", "subset": ["city"]}),
    ("drop_duplicates", {"keep": "first"}),
])

🔎 Filter rows inside pipelines

Use filter_rows to keep only rows matching a condition.

clean = ar.pipeline(frame, [
    ("filter_rows", {
        "column": "revenue",
        "op": ">=",
        "value": 1000
    }),
])

Supported operators:

  • >
  • <
  • >=
  • <=
  • ==
  • !=

Works with:

  • integers
  • floats
  • strings
  • booleans

🔎 Isolate rows with null values

Use keep_rows_with_nulls to audit incomplete data — keep only rows that have at least one null.

frame = ar.read_csv("data.csv")

# Keep all rows that have at least one null anywhere
nulls = ar.keep_rows_with_nulls(frame)

# Keep rows where specifically 'age' or 'score' is null
nulls = ar.keep_rows_with_nulls(frame, subset=["age", "score"])

# Works inside a pipeline too
result = ar.pipeline(frame, [
    ("keep_rows_with_nulls", {"subset": ["age"]}),
])

Useful for data auditing — inspect what's missing before deciding how to fill or drop.


### 🔢 Safe column division

Divide one column by another while handling division by zero and null denominators explicitly:

result = ar.safe_divide_columns(
    frame,
    numerator="revenue",
    denominator="cost",
    output_column="ratio",
    fill_value=0.0,  # used when denominator is zero or null
)

When the denominator is zero or null, the result is replaced with fill_value (default 0.0) instead of raising an error or producing NaN/Inf.



📊 Pandas Dtype Support Matrix

This table helps users understand which pandas dtypes and workflows are fully supported, partially supported, unsupported, or planned.

If a dtype is partially supported, users may need conversion before processing. Unsupported dtypes should raise clear errors where applicable.

Pandas Dtype Support Status Notes
int64 ✅ Supported Fully supported with native C++ columnar storage
float64 ✅ Supported Fully supported with zero-copy conversion where possible
bool ✅ Supported Native supported boolean type
string ✅ Supported Recommended over object dtype for text workflows
datetime64[ns] ❌ Unsupported No native datetime parsing or conversion support yet
category ⚠️ Limited Converted to string/object during processing
object (mixed columns) ⚠️ Limited Mixed object columns may coerce to string and reduce type inference reliability
nullable pandas dtypes (Int64, boolean) ⚠️ Limited Supported through pandas extension dtypes with null-mask handling
timedelta64[ns] ❌ Unsupported Not currently supported

Notes

  • Numeric columns are optimized for zero-copy conversion between C++ and pandas where supported.
  • Pass copy=True to to_pandas() when downstream pandas code needs defensive pandas-owned column buffers.
  • Boolean conversion is already copied by the binding because std::vector<bool> cannot be exposed as a zero-copy NumPy buffer in the current implementation.
  • Columns with null masks may require copies so pandas can apply nullable values safely.
  • String columns require Python string object creation during to_pandas() conversion.
  • Mixed object columns may reduce type inference accuracy and may require preprocessing.
  • Unsupported dtypes should raise clear user-facing errors instead of silent failures.



🧠 Data quality engine

Arnio now includes built-in dataset understanding before you analyze in pandas.

report = ar.profile(frame)
print(report.summary())

suggestions = ar.suggest_cleaning(frame)
clean = ar.pipeline(frame, suggestions)

For production data contracts:

schema = ar.Schema({
    "id": ar.Int64(nullable=False, unique=True),
    "email": ar.Email(nullable=False),
    "username": ar.String(min_length=3, max_length=20),
    "revenue": ar.Float64(nullable=True, min=0),
})

result = ar.validate(frame, schema)
if not result.passed:
    summary = result.summary()
    print(summary["issues_by_rule"])
    print(summary["issues_by_column"])
    print(summary["issues_by_column_and_rule"])
    print(result.to_pandas())
    print(result.to_markdown(max_issues=10))

ValidationResult.to_markdown() is useful in CI logs, GitHub comments, or data quality reports because it renders a compact validation summary plus a GitHub-friendly issue table. Severity counts are not included in summary() yet because ValidationIssue does not currently carry severity information.

For low-risk automatic cleanup:

clean, report = ar.auto_clean(frame, mode="strict", return_report=True)

This is the layer pandas does not try to own: profiling, data contracts, row-level validation issues, and safe cleaning suggestions for messy incoming datasets.


Beginner-friendly auto-clean tutorial

Use this workflow when you receive a small messy dataset and want to inspect what Arnio will change before applying it.

import arnio as ar
import pandas as pd

raw = pd.DataFrame(
    {
        "order_id": [1001, 1002, 1002, 1003, 1004],
        "customer": [" Ishan ", " Prasoon ", " Prasoon ", " Pranay ", " Dhruv "],
        "city": [" Paris ", "London", "London", " New York ", " Tokyo "],
    }
)

frame = ar.from_pandas(raw)

report = ar.profile(frame)
summary = report.summary()
print(summary)

suggestions = ar.suggest_cleaning(frame)
print(suggestions)
# [('strip_whitespace', {'subset': ['customer', 'city']}), ('drop_duplicates', {'keep': 'first'})]

safe = ar.auto_clean(frame)
strict = ar.auto_clean(frame, mode="strict")

Messy input:

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

Expected cleaned output with mode="strict":

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

mode="safe" only trims whitespace. Use mode="strict" when you also want deterministic built-in cleanup such as exact duplicate removal.

See examples/auto_clean_tutorial.py for a runnable version of this walkthrough.


Data Quality Reports

Arnio provides detailed profiling for datasets via ar.profile(). To generate the report shown in these examples, the following code was used:

import arnio as ar
import pandas as pd

# Sample dataset used for these examples
data = {
    "user_id": [101, 102, 103, 104],
    "email": ["test@arnio.ai", "invalid-email", None, "test@arnio.ai"],
    "score": [85.5, 90.0, None, 88.2]
}
df = ar.from_pandas(pd.DataFrame(data))
# Bounded profiling for large datasets (controls how many sample values are kept)
report = ar.profile(df, sample_size=5)

1. Terminal Representation (Simplified Example)

A simplified view of the standard string representation of the report object:

DataQualityReport(
    row_count=4,
    column_count=3,
    memory_usage=733,
    duplicate_rows=0,
    columns={
        'user_id': ColumnProfile(dtype='int64', semantic_type='identifier', unique_count=4),
        'email': ColumnProfile(dtype='string', semantic_type='categorical', null_count=1, unique_ratio=0.666667),
        'score': ColumnProfile(dtype='float64', semantic_type='numeric', mean=87.9, min=85.5, max=90.0)
    }
)

2. JSON Format (Excerpts from .to_dict())

Key fields from the structured JSON export for integration with APIs or dashboards:

{
  "row_count": 4,
  "column_count": 3,
  "memory_usage": 733,
  "duplicate_rows": 0,
  "duplicate_ratio": 0.0,
  "columns": {
    "user_id": {
      "dtype": "int64",
      "semantic_type": "identifier",
      "null_count": 0,
      "unique_ratio": 1.0
    },
    "email": {
      "dtype": "string",
      "semantic_type": "categorical",
      "null_count": 1,
      "unique_ratio": 0.666667,
      "warnings": ["contains_nulls"]
    },
    "score": {
      "dtype": "float64",
      "semantic_type": "numeric",
      "null_count": 1,
      "mean": 87.9,
      "min": 85.5,
      "max": 90.0,
      "warnings": ["contains_nulls"]
    }
  }
}

3. Example Summary Table

A manually formatted Markdown table representing the core metrics:

Metric Value
Row Count 4
Column Count 3
Memory Usage 733 bytes
Duplicates 0 (0.0%)

🗺️ Roadmap

Version Focus Status
v1.0 Stable release · cross-platform wheels · CI/CD · PyPI publishing · Google Colab support ✅ Shipped
v1.1 Production readiness · release hardening · docs/tooling ✅ Shipped
v1.2 C++ pipeline optimization · speed parity with pandas · hash-based deduplication 🔨 Active
v1.3 Chunked / streaming processing · Parquet & JSON readers 📋 Planned
v1.4 Parallel column processing · SIMD string operations 💭 Exploring



💬 Community

Join the Arnio Discord Community for quick setup help, contributor onboarding, GSSoC 2026 coordination, feature discussion, and community updates.

Discord is for fast conversation and support. GitHub remains the source of truth for issue assignment, PR reviews, bugs, roadmap decisions, and releases.

Join Arnio Discord




🤝 Contribute

Arnio is a GSSoC 2026 project with a structured contributor backlog across beginner, intermediate, and advanced tracks.

You don't need C++ to contribute

Most new features are pure Python pipeline steps:

# 1. Write a function that takes a DataFrame and returns a DataFrame
def remove_special_chars(df, columns=None):
    cols = columns or df.select_dtypes("object").columns
    for col in cols:
        df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
    return df

# 2. Register it
ar.register_step("remove_special_chars", remove_special_chars)

# 3. Write tests, open a PR. That's it.

If you do know C++

The biggest performance wins are in:

  • drop_duplicates — replacing std::ostringstream row serialization with proper hash-based comparisons
  • strip_whitespace — converting from copy-on-write to in-place mutation
  • Parallel column processingstd::thread across independent columns

Getting started

# macOS / Linux
git clone https://github.com/im-anishraj/arnio.git && cd arnio
make install   # pip install -e ".[dev]" + pre-commit
make test      # pytest with coverage
make lint      # ruff + black

# Windows
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

PR titles must follow Conventional Commitsfeat:, fix:, docs:, chore:. Our release pipeline auto-generates changelogs from these.

For GSSoC contributors, please read GSSOC_GUIDE.md before asking to be assigned. It explains issue claiming, contribution levels, review expectations, and what maintainers look for in a strong PR. If you want a quick onboarding refresher, see the GSSoC FAQ. If you are new to Arnio terms, see the contributor glossary.

📖 Full Contributing Guide ·  GSSoC Guide ·  🐛 Open Issues ·  💬 Discussions ·  Discord




🚢 Release process

Arnio releases are automated through Release Please and GitHub Actions.

  1. Merge user-facing changes with Conventional Commit PR titles (feat:, fix:, docs:, or chore:) so Release Please can choose the version bump and changelog entries.
  2. Review and merge the Release Please PR on main; this updates release metadata and creates the GitHub release and tag.
  3. Confirm the Build & Publish Wheels workflow succeeds for the release tag. It builds the sdist and wheels, then publishes to PyPI through Trusted Publishing.
  4. Smoke test the published package in a clean environment:
python -m venv /tmp/arnio-smoke
source /tmp/arnio-smoke/bin/activate
python -m pip install -U pip
python -m pip install arnio
printf 'name,revenue\n Ada,10\n' > /tmp/arnio-smoke.csv
python - <<'PY'
import arnio as ar
print(ar.__version__)
print(ar.scan_csv("/tmp/arnio-smoke.csv"))
PY
  1. Verify the GitHub release, PyPI project page, and install command all show the expected version before announcing the release.

If any publish or smoke-test step fails, leave the failed tag and GitHub release in place until maintainers agree on the recovery plan.




📐 Project structure

arnio/
├── cpp/
│   ├── include/arnio/      # C++ headers — types, column, frame, csv_reader, cleaning
│   └── src/                 # C++ implementations (~30 KB of compiled logic)
├── bindings/
│   └── bind_arnio.cpp       # pybind11 module — the Python↔C++ bridge
├── arnio/
│   ├── __init__.py          # Public API surface
│   ├── io.py                # read_csv, scan_csv
│   ├── cleaning.py          # Python wrappers for C++ cleaning functions
│   ├── pipeline.py          # Step registry + pipeline executor
│   ├── convert.py           # to_pandas (zero-copy), from_pandas
│   ├── frame.py             # ArFrame — lightweight C++ Frame wrapper
│   └── exceptions.py        # ArnioError, UnknownStepError, CsvReadError, TypeCastError
├── tests/                   # pytest suite — CSV, cleaning, pipeline, conversions
├── benchmarks/              # Reproducible arnio vs pandas benchmark
├── examples/                # basic_usage.py, auto_clean_tutorial.py, custom_step.py
└── website/                 # Project website — arnio.vercel.app



Arnio



Stop writing cleaning scripts. Declare clean data.


DownloadsStarsForksWebsiteDiscord


Built with C++ and pybind11 · Licensed under MIT · Maintained by @im-anishraj

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arnio-1.8.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arnio-1.8.0-cp313-cp313-win_amd64.whl (205.2 kB view details)

Uploaded CPython 3.13Windows x86-64

arnio-1.8.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (253.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arnio-1.8.0-cp313-cp313-macosx_11_0_arm64.whl (194.3 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arnio-1.8.0-cp313-cp313-macosx_10_13_x86_64.whl (209.4 kB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

arnio-1.8.0-cp312-cp312-win_amd64.whl (205.2 kB view details)

Uploaded CPython 3.12Windows x86-64

arnio-1.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (253.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arnio-1.8.0-cp312-cp312-macosx_11_0_arm64.whl (194.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arnio-1.8.0-cp312-cp312-macosx_10_13_x86_64.whl (209.4 kB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

arnio-1.8.0-cp311-cp311-win_amd64.whl (202.8 kB view details)

Uploaded CPython 3.11Windows x86-64

arnio-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (252.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arnio-1.8.0-cp311-cp311-macosx_11_0_arm64.whl (193.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arnio-1.8.0-cp311-cp311-macosx_10_9_x86_64.whl (207.6 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

arnio-1.8.0-cp310-cp310-win_amd64.whl (202.1 kB view details)

Uploaded CPython 3.10Windows x86-64

arnio-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (250.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

arnio-1.8.0-cp310-cp310-macosx_11_0_arm64.whl (192.4 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

arnio-1.8.0-cp310-cp310-macosx_10_9_x86_64.whl (206.3 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

arnio-1.8.0-cp39-cp39-win_amd64.whl (208.7 kB view details)

Uploaded CPython 3.9Windows x86-64

arnio-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (250.9 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

arnio-1.8.0-cp39-cp39-macosx_11_0_arm64.whl (192.5 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

arnio-1.8.0-cp39-cp39-macosx_10_9_x86_64.whl (206.4 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file arnio-1.8.0.tar.gz.

File metadata

  • Download URL: arnio-1.8.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0.tar.gz
Algorithm Hash digest
SHA256 119bf66b2cf4e9c300af835b2069b5cf9ddf4e215acdced80e8f2911d6064e82
MD5 9f642131666e5ebc64a758d8495baba0
BLAKE2b-256 fa4d3eb7f147f69a246696b5884832b72c67fb97de473b4f44d044fd6a888760

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: arnio-1.8.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 205.2 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d7ca26787915f6e1bc000bebf247f29b38095d2b2cf17989d6a0be77a824aa56
MD5 0368c2849e79279cfab48b83dd4d1322
BLAKE2b-256 6f0b9033a81bfea72d16410733f623493959d6e6df8bbfbbf63ff9ca5c9f0567

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bec1f2e05ea80da0ec3e84f62172fb88db1228428835c3796df9f1187e02377d
MD5 437337e1cd84b71469e4b5cc1bacd87f
BLAKE2b-256 dabfe65ab42ae1809e37996b8cc6be0400acb4dc591f1d1992b989b0a6d9592b

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7733c4b30eb89803ee2e07fba25054bc9881bbda2c08b1f3cc087359c7a02af1
MD5 984a52e93f37715d07847ddb0d24acdc
BLAKE2b-256 fa97ccd3399e5c5e08613c7be222504ad624ca53e8cf0a3cfd86318b81d2d119

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 652eca30e090e591d77da2d489b31b2fcc3e61125ecdecdda7d642728fee6de5
MD5 943b11757b029fb1d63946b37190518c
BLAKE2b-256 f1b5a1542c2a8b709d6158d424c17bc5130aeb9b566e7e3e2880aad17c1f1ccd

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: arnio-1.8.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 205.2 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 71e2075777f8edf09de6ef5627c6160c2c212821e370fa1eff56aaf5e6b486a4
MD5 ce149f0dca3aec11b771995660b9b650
BLAKE2b-256 bbb0ec1ad1dd31d1bf79782cb5a344a687bc343fc71575b714e39cd6cd0f014a

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c4d3c54072c0a0500fb8510ee1c3775a52720b3a95057cf70d49376e73c02cd5
MD5 7ecce1be28ff8b74051201e855724e7e
BLAKE2b-256 cad645e169338c4f990d956e8763465cb1a29f568f5bdf94f2260b25554c5011

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc140e57f9c8d93ed69fe35e34b4592a74a8071f81b393e27f25592848b06997
MD5 d5d2b18022c0dda9cffcb6185a0b878e
BLAKE2b-256 864234a25a53e1fbd407a0024cff1d4da05d477dfb57f5dbd7840b23e5f4a750

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 5773824a8fe6e32c9f72259f73d5f7894fe5472b9afdea3fdec79ec46bd17b6f
MD5 454420ed470ef5494066e1d3ff18d087
BLAKE2b-256 00a1b5149f05e35dfb741540d938cfd0f9495b66fc0819ee99445cee4ebb2f9a

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: arnio-1.8.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 202.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4e375b7f5b7732644289f10604536ec423916e1d22e18a549028ecd9cf0bff16
MD5 772c0b78ec89d04a8633900f3bba12ce
BLAKE2b-256 9920b3a1f407c730d5b5da7883a84d3c5602e8729b652dd2926c0e7a01897cee

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dc688dfd8a0b99e2d52fe48125740857509b3f5f22db3431e9e7c4591aed146d
MD5 c855893ec3b40dc2e2278cb18be723af
BLAKE2b-256 cf798ffa35a05886708f1aa6706bcff41c9256daf26e22235291ab48cf8ba5a0

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cfef1efe7907b20547983409cdc512b4a23c1a4373a574fb96759852f54b4d15
MD5 9bcc7d99d22c88f4222a20416553511e
BLAKE2b-256 2651a3ad703eed0dd9fe100133be48069df9690ae1c77561978a9d7c91da225e

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3c49d20db2ea233376687980650ed963a6f9fe9d7297340a89462c3f321691bf
MD5 e9d061271868e6fc44b20a81ec321cca
BLAKE2b-256 9eec0ae5809d8abcc80d986a68dd3088f27dadb83827f8268442826564380ae1

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: arnio-1.8.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 202.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7140f10cab81f08e0568248f358529bb3a26002c9d5bf0da1c878cd63d8770a3
MD5 f01af28fa4337a77039bb0b640c06e7a
BLAKE2b-256 5b36ba0c8cf2f7ac56b36aa181523409a7801710ea317d39aca8cfcb0aa1ec93

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8d9ae60cc57aab115db20125ae50eee9b560d52f192ee1144c31809ed9bc6d4a
MD5 1221a4205755159a52d7f3ec4b5a267d
BLAKE2b-256 4261204309fcb9591588a854cf9cb5231828d2bd849b8169c28214008d12aea2

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2fb2023916c79b85c7f47f21309cf4e94f4a7ecad5a62ee1646f11f12d6f78bf
MD5 9830d1804240f33ed5f0084e9f0b13e2
BLAKE2b-256 11b5f3b8a82d9ec65e56108d6bd4115e2d1867030560efd67b97065f7227a9d2

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2d03a9c8fc6d107bcd42d133f0e0416102b63d4b2857476d06562419efa4c2c0
MD5 57d759f0f451ff2cfe1e1c8876807350
BLAKE2b-256 7cb66dfbcc2a2ac3f8051a65e64ca6f56573fd2e218e52a7b441a17230937467

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: arnio-1.8.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 208.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b5225792e3cb88e63c6f2bcabe2d9aace810b647c17565c7a09754f06e4257c4
MD5 9aefc13dedac7585eae38390aca85e41
BLAKE2b-256 8b9ca630199d532b9ffb478aa1c7b5dadd4fdc9039f52541b3b0ef50b3ae93e2

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d6b9c0db9f7f037f99179a4679afacc0ddf62b153adb54bf6ee84ac579ae83a3
MD5 0582c7cb7fa587cf380041c93539261d
BLAKE2b-256 61fc52aa095b3127b76d10ffdc5769f68ae0a5d58c5b0be2fe0b68e26d3ae764

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

  • Download URL: arnio-1.8.0-cp39-cp39-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 192.5 kB
  • Tags: CPython 3.9, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2cf04942e15e596a679047f0c3a9377ab64bf5dc7238d43cb4fbd624ecc4f162
MD5 d6b8c9e7f7d01882b6bbe4152187e3b3
BLAKE2b-256 bab59dcd8d3f5b1b5b526c34edccb74961606ae703bd1866c4912ed56a37232b

See more details on using hashes here.

File details

Details for the file arnio-1.8.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: arnio-1.8.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 206.4 kB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.8.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 482743d612e6e381ea0ae26bb180bfcb5964b4e6c7b755c94a96aa0125bb59d2
MD5 a14afead3a0926a3e870ccea8cb0e2bf
BLAKE2b-256 9505c3a12bf64bf9f7a4cdca167bb9862e316e39209e2ba17619306f031fa027

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page