Skip to main content

C++ accelerated data preparation for pandas and the Python data stack

Project description


Arnio



Fast data preparation for the Python data stack.


Arnio is a compiled C++ data preparation engine for messy CSV and pandas workflows.
It parses, infers types, strips whitespace, deduplicates, validates, and profiles data —
then hands clean results back to the tools you already use.
Use Arnio before and alongside pandas, NumPy, scikit-learn, DuckDB, and Arrow.


PyPI  Python  CI  Coverage  MIT  GSSoC 2026  Join Discord PyPI Downloads



pip install arnio

Colab install smoke test: COLAB_SMOKE_TEST.md


Quickstart · Integrations · Why Arnio · Architecture · Benchmarks · Community · Contribute




⚡ Quickstart

A simple workflow in just a few steps.

New to Arnio? Start with the pandas workflow example below before exploring advanced pipelines.

import arnio as ar

# Load CSV directly through C++ — no Python parsing overhead
frame = ar.read_csv("messy_sales_data.csv")

# Declare what clean data looks like — arnio handles the rest
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# Out comes a standard pandas DataFrame — use it like you always have
df = ar.to_pandas(clean)

# Use copy=True when you need defensive pandas-owned buffers
safe_df = ar.to_pandas(clean, copy=True)

Already have a pandas DataFrame? Use Arnio in-place in your existing pandas workflow:

import pandas as pd
import arnio as ar

df = pd.read_csv("messy_sales_data.csv")

clean_df = df.arnio.clean([
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_duplicates",),
])

report = clean_df.arnio.profile()

Select specific columns

Use select_columns() to create a new ArFrame with only the required columns before converting to pandas.

selected = frame.select_columns(["name", "revenue"])

print(selected.columns)
# ['name', 'revenue']

Every step above executes in C++. Your Python code is a configuration — not the execution engine.


📸 Peek at a 100 GB file without loading it

scan_csv reads only the header + a sample to infer the schema. Zero data loaded.

schema = ar.scan_csv("100GB_file.csv")
# {'id': 'int64', 'name': 'string', 'is_active': 'bool', 'revenue': 'float64'}

Useful for exploring datasets before committing memory.

👀 Preview rows without pandas conversion or full-column Python list materialization

preview() reads only the first n rows directly from the C++ frame — no pandas conversion triggered.

frame = ar.read_csv("huge_file.csv")

print(frame.preview())      # first 5 rows (default)
print(frame.preview(n=10))  # first 10 rows

Raises ValueError for invalid n (zero, negative, or non-integer).

🧩 Add custom steps without touching C++

Register any Python function as a pipeline step. It receives a DataFrame, returns a DataFrame.

def remove_outliers(df, column="revenue", threshold=100_000):
    return df[df[column] <= threshold]

ar.register_step("remove_outliers", remove_outliers)

# Now use it in any pipeline alongside native C++ steps
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("remove_outliers", {"column": "revenue", "threshold": 50000}),
    ("drop_duplicates",),
])

Custom steps run through a pandas↔ArFrame conversion bridge. Prototype in Python, then optionally migrate hot paths to C++ for full speed.




🔗 Integrations

Arnio is designed to make the rest of the Python data stack more productive, not to replace it.

Workflow How Arnio helps
pandas Clean, validate, and profile messy DataFrames through df.arnio.
NumPy Prepare typed numeric data before array/modeling workflows.
scikit-learn Use Arnio cleaning as a preprocessing layer before model training.
DuckDB / Arrow Validate and prepare data before analytics and columnar exchange.
notebooks Inspect quality issues and cleaning suggestions before analysis.

Pandas accessor

df = pd.read_csv("raw_customers.csv")

clean_df = df.arnio.clean(drop_duplicates=True)
quality = clean_df.arnio.profile()
validation = clean_df.arnio.validate({
    "email": ar.Email(nullable=False),
    "age": ar.Int64(nullable=True, min=0),
})

This keeps pandas as the analysis tool while Arnio handles the preparation, quality, and validation layer.

Product direction: PROJECT_DIRECTION.md




🔍 Why Arnio exists

Every data project starts the same way:

df = pd.read_csv("data.csv")              # 💥 RAM spike — entire file as raw strings
df.columns = df.columns.str.strip()        # Why is this not automatic?
df["name"] = df["name"].str.strip()        # Python loop over every cell
df["name"] = df["name"].str.lower()        # Another Python loop
df = df.dropna()                           # Another pass
df = df.drop_duplicates()                  # Another pass

Six lines. Four full-data passes. All in interpreted Python. This is fine for a Jupyter demo — but it doesn't scale, it doesn't compose, and it definitely doesn't belong in production.

Arnio intercepts this entire pattern. It moves the preparation layer into a predictable pipeline, accelerates supported operations in C++, and gives you clean data for pandas, NumPy, scikit-learn, DuckDB, or notebooks.

Without Arnio

df = pd.read_csv(path)
df.columns = df.columns.str.strip()
for col in str_cols:
    df[col] = df[col].str.strip()
    df[col] = df[col].str.lower()
df = df.dropna(subset=["revenue"])
df = df.drop_duplicates()
# 6+ lines, multiple passes, pure Python

With Arnio

frame = ar.read_csv(path)
df = ar.to_pandas(ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_nulls", {"subset": ["revenue"]}),
    ("drop_duplicates",),
]))
# Declarative. Single pipeline. C++ execution.



🏗️ Architecture

Arnio is not a pandas wrapper. It's a separate runtime with its own data model.

┌──────────────────────────────────────────────────────────────┐
│  Your Python Code                                            │
│  frame = ar.read_csv("data.csv")                             │
│  clean = ar.pipeline(frame, [...])                           │
│  df = ar.to_pandas(clean)                                    │
└────────────────────────┬─────────────────────────────────────┘
                         │  pybind11 boundary
┌────────────────────────▼─────────────────────────────────────┐
│  C++ Runtime  (_arnio_cpp)                                   │
│                                                              │
│  ┌─────────────┐  ┌─────────────────┐  ┌──────────────────┐ │
│  │  CsvReader   │  │  Frame/Column   │  │  Cleaning Engine │ │
│  │  • RFC 4180  │  │  • Columnar     │  │  • drop_nulls    │ │
│  │  • BOM strip │  │  • std::variant │  │  • fill_nulls    │ │
│  │  • Type      │  │  • Bool null    │  │  • drop_dupes    │ │
│  │    inference │  │    masks        │  │  • strip_ws      │ │
│  │  • Quoted    │  │  • O(1) column  │  │  • normalize     │ │
│  │    fields    │  │    lookup       │  │  • rename/cast   │ │
│  └─────────────┘  └─────────────────┘  └──────────────────┘ │
│                                                              │
│  to_pandas() ──→ zero-copy NumPy buffer (numerics/bools)     │
└──────────────────────────────────────────────────────────────┘

Design decisions that matter

Decision What it means
Columnar storage Data lives in typed std::vectors — vector<int64_t>, vector<double>, vector<string> — not rows of variants. Cache-friendly and SIMD-ready.
Boolean null masks Nulls are tracked in a separate vector<bool>, keeping data vectors dense. No sentinel values, no NaN tricks.
Two-pass CSV read Pass 1 infers types across all rows. Pass 2 parses values directly into the correct typed column. No string→object→cast overhead.
Zero-copy bridge to_pandas() exposes C++ memory directly via NumPy's buffer protocol where supported. Numeric columns preserve the fast zero-copy path by default, while copy=True requests defensive pandas-owned buffers.
Step registry Pipeline steps map to C++ function pointers. Adding a new cleaning primitive is a single function + one registry entry.

Full architecture documentation: ARCHITECTURE.md




🏎️ Benchmarks

Reference environment: Ubuntu, Python 3.12, synthetic messy CSV inputs.
Reproduce: make benchmark — generates deterministic tall and wide datasets and runs both engines.

To reproduce the published numbers from a fresh checkout:

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
python benchmarks/generate_data.py
python benchmarks/benchmark_vs_pandas.py

benchmarks/generate_data.py uses deterministic NumPy seeds, so every run creates the same benchmarks/benchmark_1m.csv tall input and benchmarks/benchmark_wide.csv wide input. The benchmark then executes three pandas runs and three arnio runs for each case, printing average wall-clock time from time.perf_counter() and peak Python allocation from tracemalloc. For cleaner comparisons, close other memory-heavy processes and run the script from the repository root after installing the same Python, pandas, NumPy, compiler, and arnio commit you want to compare.

Expected output format:

Tall CSV (1,000,000 rows x 12 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       4.73s         5.75s
Peak RAM               211MB         212MB
Speed: 0.8x | RAM: -1% reduction

Wide CSV (5,000 rows x 256 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       ...s          ...s
Peak RAM              ...MB         ...MB
Speed: ...x | RAM: ...% reduction

Small differences are expected across CPUs, operating systems, compilers, Python builds, and pandas/NumPy versions. If you share benchmark results in an issue or PR, include your OS, Python version, CPU model, pandas/NumPy versions, arnio commit, and the full command output so maintainers can compare like for like.

Arnio is near memory parity in the reference benchmark while replacing ad-hoc Python string loops with a compiled, declarative pipeline. Validate memory and speed on your own workload. The execution time gap is a known, active optimization target — the current drop_duplicates and strip_whitespace implementations use unoptimized row-key serialization.

What's already won 🎯 What's being optimized
  • Native C++ parsing eliminates Python memory spikes
  • Columnar storage matches pandas' internal efficiency
  • Declarative API eliminates .apply() spaghetti
  • Zero-copy bridge for numeric conversions
  • drop_duplicates — replace string serialization with hash-based comparisons
  • strip_whitespace — in-place mutation instead of copy-on-write
  • Parallel column processing via std::thread
  • Help close the gap →

🧠 Auto Clean Memory Benchmark

To measure the peak memory and execution time of the auto_clean pipeline using realistic dataset sizes:

python benchmarks/benchmark_auto_clean_memory.py --rows 100000

This script generates a reproducible synthetic dataset with mixed column types (strings, ints, floats, booleans, nulls, and duplicates) and measures:

  • ar.read_csv performance
  • ar.auto_clean(mode="safe") performance (low-risk cleanup like whitespace trimming)
  • ar.auto_clean(mode="strict") performance (includes type casting and deduplication)

The dataset is regenerated deterministically unless --reuse-file is provided. Each auto_clean benchmark run reloads the dataset to avoid mutation or caching effects between runs.

Options:

  • --repeat N runs each operation multiple times and reports average (and min/max range).
  • --seed N changes the deterministic dataset seed.
  • --reuse-file reuses an existing dataset file instead of regenerating it.
  • --keep-file keeps the generated CSV (otherwise it is removed at the end).

Expected output format:

Operation                    Time(s)     Peak Py(MiB)
--------------------------------------------------------------------
ar.read_csv           0.042 (0.041-0.044)    4.52 (4.50-4.60)
ar.auto_clean(safe)   0.012 (0.011-0.013)    0.15 (0.14-0.16)
ar.auto_clean(strict) 0.035 (0.034-0.036)    1.20 (1.18-1.22)
--------------------------------------------------------------------
Total avg (Read+Strict)       0.077             4.52



🧰 Cleaning primitives

Most operations below run natively in C++. Currently, filter_rows and replace_values run via the Python (pandas) backend and may be optimized in C++ later.

Primitive What it does Example
drop_nulls Remove rows with null/empty values ar.drop_nulls(frame, subset=["age"])
keep_rows_with_nulls Keep only rows that contain at least one null ar.keep_rows_with_nulls(frame, subset=["age"])
validate_columns_exist Fail early when required columns are missing ar.validate_columns_exist(frame, ["age"])
filter_rows Filter rows using comparison operators ar.filter_rows(frame, column="age", op=">", value=18)
fill_nulls Replace nulls with a scalar ar.fill_nulls(frame, 0, subset=["revenue"])
drop_duplicates Deduplicate rows (first/last/none) ar.drop_duplicates(frame, keep="first")
drop_constant_columns Remove columns with only one unique value ar.drop_constant_columns(frame)
clip_numeric Clip numeric values to lower and/or upper bounds ar.clip_numeric(frame, lower=0, upper=100)
strip_whitespace Trim leading/trailing spaces from strings ar.strip_whitespace(frame)
normalize_case Force lower/upper/title case ar.normalize_case(frame, case_type="title")
rename_columns Rename columns via mapping ar.rename_columns(frame, {"old": "new"})
cast_types Cast column types ar.cast_types(frame, {"age": "int64"})
round_numeric_columns Round numeric columns (non-numeric columns in subset ignored safely) ar.round_numeric_columns(frame, decimals=2)
replace_values Replace values using a mapping (column or whole-frame). Handles None/NaN. ar.replace_values(frame, {"active": "A", "inactive": "I"}, column="status")
clean Convenience shorthand ar.clean(frame, drop_nulls=True)
safe_divide_columns Divide one column by another, handling zero/null denominators ar.safe_divide_columns(frame, numerator="revenue", denominator="cost", output_column="ratio")
trim_column_names Strip leading/trailing whitespace from column names ar.trim_column_names(frame)

ArFrame.select_dtypes — type-based column selection

Returns a new ArFrame containing only the columns whose dtype matches the filter. Raises ValueError if no columns match.

frame = ar.read_csv("data.csv")

# Keep only numeric columns
numeric = frame.select_dtypes(include=["int64", "float64"])

# Drop string columns
without_strings = frame.select_dtypes(exclude="string")

Valid dtype strings: "int64", "float64", "string", "bool", "null"

  • At least one of include or exclude must be given — raises ValueError otherwise.
  • include and exclude must not overlap — raises ValueError if they share a dtype.
  • Unknown dtype strings raise ValueError with a list of valid options.
  • Raises ValueError when no columns match (never returns an empty frame silently).
  • Column order in the result always matches the original frame.

Or compose them all into a pipeline:

clean = ar.pipeline(frame, [
    ("validate_columns_exist", {"columns": ["name", "city", "revenue"]}),
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": "unknown", "subset": ["city"]}),
    ("drop_duplicates", {"keep": "first"}),
])

🔁 Replace values

Use replace_values to substitute values using a mapping. It works as a pipeline step (Python backend) and can operate on a single column or the whole frame when column is omitted. It also understands null semantics: using None (or np.nan) as a mapping key targets existing nulls, and mapping a value to None creates real nulls.

Column-specific example:

clean = ar.pipeline(frame, [
    ("replace_values", {"mapping": {"active": "A", "inactive": "I"}, "column": "status"}),
])

Whole-frame example (no column):

clean = ar.pipeline(frame, [
    ("replace_values", {"mapping": {None: "MISSING", "active": "A", "inactive": "I"}}),
])

Direct API:

frame2 = ar.replace_values(frame, {"active": "A", "inactive": "I"})

🔎 Filter rows inside pipelines

Use filter_rows to keep only rows matching a condition.

clean = ar.pipeline(frame, [
    ("filter_rows", {
        "column": "revenue",
        "op": ">=",
        "value": 1000
    }),
])

Supported operators:

  • >
  • <
  • >=
  • <=
  • ==
  • !=

Works with:

  • integers
  • floats
  • strings
  • booleans

🔎 Isolate rows with null values

Use keep_rows_with_nulls to audit incomplete data — keep only rows that have at least one null.

frame = ar.read_csv("data.csv")

# Keep all rows that have at least one null anywhere
nulls = ar.keep_rows_with_nulls(frame)

# Keep rows where specifically 'age' or 'score' is null
nulls = ar.keep_rows_with_nulls(frame, subset=["age", "score"])

# Works inside a pipeline too
result = ar.pipeline(frame, [
    ("keep_rows_with_nulls", {"subset": ["age"]}),
])

Useful for data auditing — inspect what's missing before deciding how to fill or drop.


### 🔢 Safe column division

Divide one column by another while handling division by zero and null denominators explicitly:

result = ar.safe_divide_columns(
    frame,
    numerator="revenue",
    denominator="cost",
    output_column="ratio",
    fill_value=0.0,  # used when denominator is zero or null
)

When the denominator is zero or null, the result is replaced with fill_value (default 0.0) instead of raising an error or producing NaN/Inf.



📊 Pandas Dtype Support Matrix

This table helps users understand which pandas dtypes and workflows are fully supported, partially supported, unsupported, or planned.

If a dtype is partially supported, users may need conversion before processing. Unsupported dtypes should raise clear errors where applicable.

Pandas Dtype Support Status Notes
int64 ✅ Supported Fully supported with native C++ columnar storage
float64 ✅ Supported Fully supported with zero-copy conversion where possible
bool ✅ Supported Native supported boolean type
string ✅ Supported Recommended over object dtype for text workflows
datetime64[ns] ❌ Unsupported No native datetime parsing or conversion support yet
category ⚠️ Limited Converted to string/object during processing
object (mixed columns) ⚠️ Limited Mixed object columns may coerce to string and reduce type inference reliability
nullable pandas dtypes (Int64, boolean) ⚠️ Limited Supported through pandas extension dtypes with null-mask handling
timedelta64[ns] ❌ Unsupported Not currently supported

Notes

  • Numeric columns are optimized for zero-copy conversion between C++ and pandas where supported.
  • Pass copy=True to to_pandas() when downstream pandas code needs defensive pandas-owned column buffers.
  • Boolean conversion is already copied by the binding because std::vector<bool> cannot be exposed as a zero-copy NumPy buffer in the current implementation.
  • Columns with null masks may require copies so pandas can apply nullable values safely.
  • String columns require Python string object creation during to_pandas() conversion.
  • Mixed object columns may reduce type inference accuracy and may require preprocessing.
  • Unsupported dtypes should raise clear user-facing errors instead of silent failures.

Note: pandas DataFrame indexes are currently not preserved during from_pandas() conversion. Converted frames receive a default RangeIndex when converted back via to_pandas().




🧠 Data quality engine

Arnio now includes built-in dataset understanding before you analyze in pandas.

report = ar.profile(frame)
print(report.summary())

suggestions = ar.suggest_cleaning(frame)
clean = ar.pipeline(frame, suggestions)

For production data contracts:

schema = ar.Schema({
    "id": ar.Int64(nullable=False, unique=True),
    "email": ar.Email(nullable=False),
    # CountryCode expects uppercase ISO alpha-2 values, for example IN, US, GB.
    "country": ar.CountryCode(nullable=False),
    "username": ar.String(min_length=3, max_length=20),
    "revenue": ar.Float64(nullable=True, min=0),
})

result = ar.validate(frame, schema)
if not result.passed:
    summary = result.summary()
    print(summary["issues_by_rule"])
    print(summary["issues_by_column"])
    print(summary["issues_by_column_and_rule"])
    print(result.to_pandas())
    print(result.to_markdown(max_issues=10))

ValidationResult.to_markdown() is useful in CI logs, GitHub comments, or data quality reports because it renders a compact validation summary plus a GitHub-friendly issue table.

For multi-column uniqueness (composite keys):

schema = ar.Schema({
    "user_id": ar.Int64(nullable=False),
    "course_id": ar.Int64(nullable=False),
}, unique=["user_id", "course_id"])

result = ar.validate(frame, schema)

Severity counts are not included in summary() yet because ValidationIssue does not currently carry severity information.

For low-risk automatic cleanup:

clean, report = ar.auto_clean(frame, mode="strict", return_report=True)

This is the layer pandas does not try to own: profiling, data contracts, row-level validation issues, and safe cleaning suggestions for messy incoming datasets.


Beginner-friendly auto-clean tutorial

Use this workflow when you receive a small messy dataset and want to inspect what Arnio will change before applying it.

import arnio as ar
import pandas as pd

raw = pd.DataFrame(
    {
        "order_id": [1001, 1002, 1002, 1003, 1004],
        "customer": [" Ishan ", " Prasoon ", " Prasoon ", " Pranay ", " Dhruv "],
        "city": [" Paris ", "London", "London", " New York ", " Tokyo "],
    }
)

frame = ar.from_pandas(raw)

report = ar.profile(frame)
summary = report.summary()
print(summary)

suggestions = ar.suggest_cleaning(frame)
print(suggestions)
# [('strip_whitespace', {'subset': ['customer', 'city']}), ('drop_duplicates', {'keep': 'first'})]

safe = ar.auto_clean(frame)
strict = ar.auto_clean(frame, mode="strict")

Messy input:

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

Expected cleaned output with mode="strict":

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

mode="safe" only trims whitespace. Use mode="strict" when you also want deterministic built-in cleanup such as exact duplicate removal.

See examples/auto_clean_tutorial.py for a runnable version of this walkthrough.


Data Quality Reports

Arnio provides detailed profiling for datasets via ar.profile(). To generate the report shown in these examples, the following code was used:

import arnio as ar
import pandas as pd

# Sample dataset used for these examples
data = {
    "user_id": [101, 102, 103, 104],
    "email": ["test@arnio.ai", "invalid-email", None, "test@arnio.ai"],
    "score": [85.5, 90.0, None, 88.2]
}
df = ar.from_pandas(pd.DataFrame(data))
# Bounded profiling for large datasets (controls how many sample values are kept)
report = ar.profile(df, sample_size=5)

1. Terminal Representation (Simplified Example)

A simplified view of the standard string representation of the report object:

DataQualityReport(
    row_count=4,
    column_count=3,
    memory_usage=733,
    duplicate_rows=0,
    columns={
        'user_id': ColumnProfile(dtype='int64', semantic_type='identifier', unique_count=4),
        'email': ColumnProfile(dtype='string', semantic_type='categorical', null_count=1, unique_ratio=0.666667),
        'score': ColumnProfile(dtype='float64', semantic_type='numeric', mean=87.9, min=85.5, max=90.0)
    }
)

2. JSON Format (Excerpts from .to_dict())

Key fields from the structured JSON export for integration with APIs or dashboards:

{
  "row_count": 4,
  "column_count": 3,
  "memory_usage": 733,
  "duplicate_rows": 0,
  "duplicate_ratio": 0.0,
  "columns": {
    "user_id": {
      "dtype": "int64",
      "semantic_type": "identifier",
      "null_count": 0,
      "unique_ratio": 1.0
    },
    "email": {
      "dtype": "string",
      "semantic_type": "categorical",
      "null_count": 1,
      "unique_ratio": 0.666667,
      "warnings": ["contains_nulls"]
    },
    "score": {
      "dtype": "float64",
      "semantic_type": "numeric",
      "null_count": 1,
      "mean": 87.9,
      "min": 85.5,
      "max": 90.0,
      "warnings": ["contains_nulls"]
    }
  }
}

3. Example Summary Table

A manually formatted Markdown table representing the core metrics:

Metric Value
Row Count 4
Column Count 3
Memory Usage 733 bytes
Duplicates 0 (0.0%)

🗺️ Roadmap

Version Focus Status
v1.0 Stable release · cross-platform wheels · CI/CD · PyPI publishing · Google Colab support ✅ Shipped
v1.1 Production readiness · release hardening · docs/tooling ✅ Shipped
v1.2 C++ pipeline optimization · speed parity with pandas · hash-based deduplication 🔨 Active
v1.3 Chunked / streaming processing · Parquet & JSON readers 📋 Planned
v1.4 Parallel column processing · SIMD string operations 💭 Exploring



💬 Community

Join the Arnio Discord Community for quick setup help, contributor onboarding, GSSoC 2026 coordination, feature discussion, and community updates.

Discord is for fast conversation and support. GitHub remains the source of truth for issue assignment, PR reviews, bugs, roadmap decisions, and releases.

Join Arnio Discord




🤝 Contribute

Arnio is a GSSoC 2026 project with a structured contributor backlog across beginner, intermediate, and advanced tracks.

You don't need C++ to contribute

Most new features are pure Python pipeline steps:

# 1. Write a function that takes a DataFrame and returns a DataFrame
def remove_special_chars(df, columns=None):
    cols = columns or df.select_dtypes("object").columns
    for col in cols:
        df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
    return df

# 2. Register it
ar.register_step("remove_special_chars", remove_special_chars)

# 3. Write tests, open a PR. That's it.

If you do know C++

The biggest performance wins are in:

  • drop_duplicates — replacing std::ostringstream row serialization with proper hash-based comparisons
  • strip_whitespace — converting from copy-on-write to in-place mutation
  • Parallel column processingstd::thread across independent columns

Getting started

# macOS / Linux
git clone https://github.com/im-anishraj/arnio.git && cd arnio
make install   # pip install -e ".[dev]" + pre-commit
make test      # pytest with coverage
make lint      # ruff + black

# Windows
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

PR titles must follow Conventional Commitsfeat:, fix:, docs:, chore:. Our release pipeline auto-generates changelogs from these.

For GSSoC contributors, please read GSSOC_GUIDE.md before asking to be assigned. It explains issue claiming, contribution levels, review expectations, and what maintainers look for in a strong PR. If you want a quick onboarding refresher, see the GSSoC FAQ. If you are new to Arnio terms, see the contributor glossary.

📖 Full Contributing Guide ·  GSSoC Guide ·  🐛 Open Issues ·  💬 Discussions ·  Discord

💖 Contributors

Thanks to everyone who contributes to Arnio and helps improve the project.



🚢 Release process

Arnio releases are automated through Release Please and GitHub Actions.

  1. Merge user-facing changes with Conventional Commit PR titles (feat:, fix:, docs:, or chore:) so Release Please can choose the version bump and changelog entries.
  2. Review and merge the Release Please PR on main; this updates release metadata and creates the GitHub release and tag.
  3. Confirm the Build & Publish Wheels workflow succeeds for the release tag. It builds the sdist and wheels, then publishes to PyPI through Trusted Publishing.
  4. Smoke test the published package in a clean environment:
python -m venv /tmp/arnio-smoke
source /tmp/arnio-smoke/bin/activate
python -m pip install -U pip
python -m pip install arnio
printf 'name,revenue\n Ada,10\n' > /tmp/arnio-smoke.csv
python - <<'PY'
import arnio as ar
print(ar.__version__)
print(ar.scan_csv("/tmp/arnio-smoke.csv"))
PY
  1. Verify the GitHub release, PyPI project page, and install command all show the expected version before announcing the release.

If any publish or smoke-test step fails, leave the failed tag and GitHub release in place until maintainers agree on the recovery plan.




📐 Project structure

arnio/
├── cpp/
│   ├── include/arnio/      # C++ headers — types, column, frame, csv_reader, cleaning
│   └── src/                 # C++ implementations (~30 KB of compiled logic)
├── bindings/
│   └── bind_arnio.cpp       # pybind11 module — the Python↔C++ bridge
├── arnio/
│   ├── __init__.py          # Public API surface
│   ├── io.py                # read_csv, scan_csv
│   ├── cleaning.py          # Python wrappers for C++ cleaning functions
│   ├── pipeline.py          # Step registry + pipeline executor
│   ├── convert.py           # to_pandas (zero-copy), from_pandas
│   ├── frame.py             # ArFrame — lightweight C++ Frame wrapper
│   └── exceptions.py        # ArnioError, UnknownStepError, CsvReadError, TypeCastError
├── tests/                   # pytest suite — CSV, cleaning, pipeline, conversions
├── benchmarks/              # Reproducible arnio vs pandas benchmark
├── examples/                # basic_usage.py, auto_clean_tutorial.py, custom_step.py
└── website/                 # Project website — arnio.vercel.app



Arnio



Stop writing cleaning scripts. Declare clean data.


DownloadsStarsForksWebsiteDiscord


Built with C++ and pybind11 · Licensed under MIT · Maintained by @im-anishraj

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arnio-1.9.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arnio-1.9.0-cp313-cp313-win_amd64.whl (206.8 kB view details)

Uploaded CPython 3.13Windows x86-64

arnio-1.9.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (254.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arnio-1.9.0-cp313-cp313-macosx_11_0_arm64.whl (196.0 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arnio-1.9.0-cp313-cp313-macosx_10_13_x86_64.whl (211.0 kB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

arnio-1.9.0-cp312-cp312-win_amd64.whl (206.8 kB view details)

Uploaded CPython 3.12Windows x86-64

arnio-1.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (254.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arnio-1.9.0-cp312-cp312-macosx_11_0_arm64.whl (195.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arnio-1.9.0-cp312-cp312-macosx_10_13_x86_64.whl (211.0 kB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

arnio-1.9.0-cp311-cp311-win_amd64.whl (204.4 kB view details)

Uploaded CPython 3.11Windows x86-64

arnio-1.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (253.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arnio-1.9.0-cp311-cp311-macosx_11_0_arm64.whl (195.0 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arnio-1.9.0-cp311-cp311-macosx_10_9_x86_64.whl (209.2 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

arnio-1.9.0-cp310-cp310-win_amd64.whl (203.7 kB view details)

Uploaded CPython 3.10Windows x86-64

arnio-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (252.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

arnio-1.9.0-cp310-cp310-macosx_11_0_arm64.whl (194.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

arnio-1.9.0-cp310-cp310-macosx_10_9_x86_64.whl (207.9 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

arnio-1.9.0-cp39-cp39-win_amd64.whl (210.3 kB view details)

Uploaded CPython 3.9Windows x86-64

arnio-1.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (252.5 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

arnio-1.9.0-cp39-cp39-macosx_11_0_arm64.whl (194.1 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

arnio-1.9.0-cp39-cp39-macosx_10_9_x86_64.whl (208.0 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file arnio-1.9.0.tar.gz.

File metadata

  • Download URL: arnio-1.9.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0.tar.gz
Algorithm Hash digest
SHA256 9af18fdc79836909508aea00acce90a9bb0569abe2077de9e084d40b9c706c21
MD5 1a12829bd0b030edb14fb3c0c62c456a
BLAKE2b-256 f609839a1f6af56852997db84b33ef13891f7ff204f3d5272860796e2b862d83

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: arnio-1.9.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 206.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 997ed918239a783c32f2fc8261de07f02c352d2e238a9528ee5943ebab2b41c3
MD5 4d42222db835dc5b12b6921055845577
BLAKE2b-256 9da06b2b6622a0c88edb19255727b70897719d1194b77b8e479aa4c40e224a59

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5553e38ed213899c43c815d2e2953c60e9bf0f6518d3c86bf417f5b9e85b2b82
MD5 fb0b1c2b485d6232a9cd6dc93608050c
BLAKE2b-256 b17d048946047ed9881262d54d96181452038b7fd84c2eac2e75c294f7cf7988

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a08db637ff93076bf1c307c005190b6b7c9f87d62c18129fa287813779e134d8
MD5 ec015cf0a14f68e5d04dd413147b00f2
BLAKE2b-256 d314fb4fd0adebc624449e00e7be89d2506af4a9dc2ddda347e15213f4417f6c

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b35de717d4a06b7af3f972999befd0748ae2a8c3da318fa2f81f3d5b67226532
MD5 9e61b5f6c3d0d2da66505b80b5534a58
BLAKE2b-256 4adbcd0abf24441dc7d1afa7c9329571abad570888c10b160ab35d0d995552a4

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: arnio-1.9.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 206.8 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 b910abea1616cf31381f3862f3b7c499d87cd98f519ec10b91576f38b6ead248
MD5 8299f3988788a76b8b5682b611a63444
BLAKE2b-256 bac32001ffa8fb85d9ed747c4d32b910418db2739b2993fdeb099db95f83a987

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9edecac46b202ca30d9fd02b65a8c315e18da8a936ac7de0154294b062c047d2
MD5 5a2c61e46db64932d47cce6f827449ff
BLAKE2b-256 182fba3d8acc55b270332f2841421d5004c25823a649aa562aad3605675d9727

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 20419f6c5cf59375a0b38ba50c07c70263335bd7091e1965603b97eae4e0e6d7
MD5 d2880e998de0d11b2c236d0ce14bcbab
BLAKE2b-256 649dd809c701985160df1d2f94287c4cb9164fb83e1e9527adfc0cbfe6302031

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 e75bc93cb7bc835a25401f25dffdfe7ecdc0c0de60d28b4cde00bc5cb4f17261
MD5 b0516eb00a60d20adf8feda7a47a82f4
BLAKE2b-256 db65a9e3ce123e0fd2f209119404bb0bfd5c1d5e50a4c7a753a4bf59f346aa02

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: arnio-1.9.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 204.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 dc7aa6c5f1299ddf236458bb4ea59c3f5729775ca2cd30df3e8b7fe21867906b
MD5 627a914675a782fc8e9d2e253c3fce16
BLAKE2b-256 e852669a65c71c03961912caa73fc61796eb3e9063c957913ce7ab26385f2477

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54bfe3bc9b674248515d77eb7030c7792b3a5396f1c9a5b3ea0f0973767b8ba4
MD5 8fba530ec31e097e7c29a3ad3b9e5b0e
BLAKE2b-256 522b077b8cc2592a5d39b966daecaa8b4fb1e9042e7f3fa2a7ba249a6271780e

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8df076e0b726fb7afe2145d4540dbcbd5fe3c7ea2b82072b6947360fde41db9b
MD5 57683a0987c718f8699aecec13852e25
BLAKE2b-256 8b5e95e7684dcea9239fc60190a5d28e1b559168e1e001a1bfc598eee768695e

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c3cf7352de8d9da905cb6d4136b0b1e7a8134f8d1138603c69bfac8ebf352d93
MD5 1b9bd2b0f48beb64f249810a56b1ee84
BLAKE2b-256 f11833adceafdfb6a6e043af9e2f1bbd84efecffad64ceab5c6e51b5ac1a3f39

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: arnio-1.9.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 203.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 508fd4f4172741249da04fdc2ef91ad553c0e2a6f1b9fd907771449f4e1b0716
MD5 52162598de5fedd3c34e7bfe1e74e16f
BLAKE2b-256 6b8a79eebfa1805ed0f1872ca782f828e94fed61694ef9c66a2e3820988ad94c

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1439ecae15a0f113bfa7101c7d8dde4f4182ee1d7acd99352abada683f833e33
MD5 0b31e2d2f76dfd20fffec4aecc6ea60a
BLAKE2b-256 3fe1540e0a9420ec81d8399ac1b08a6eeeb2c9e00a6fea4f277fb6b7690a02fd

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 53f6c23ec5b5425685d2fed1ff59aba7b09a35505656bd3ec5a8928b71cf06f3
MD5 4e8cb894b2ee2a2ce30351d21ad03a4f
BLAKE2b-256 83e47b23645fc02d8de9006e9f8f7238717bc7097737bbe2960437bd42162bf6

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9e17dab2add22d712b888125585d8c02e8ce9b3d7fa4c63d0e34b9e99f9c7e88
MD5 81cb380b4a130809cf5b6e4b03066f3d
BLAKE2b-256 1fe63f4816144370acbbbceec65af8ee74513ecbcc703c29166bf378ace9756f

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: arnio-1.9.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 210.3 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9feaa99848590fd6f5c32ca577f6e77e110068eb58e289f0db5613cbf1c790bd
MD5 15a7e2b09f5939b10ccf48ca6bebb156
BLAKE2b-256 770aa55c8651ad8f57abd2f2fc84d9a20e3ab92abc75cdaba3470a5343cc46f3

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5a1a1d7d58ab030b1595f117d926c4d5881a54d6e21936d2ae5774793a4405d5
MD5 6a0bba3c5251c5a3c786f857b6c6cad2
BLAKE2b-256 216b0f64a723d594a5a8f7a3861800b45d06603a76aefa676211aebabffcae26

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

  • Download URL: arnio-1.9.0-cp39-cp39-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 194.1 kB
  • Tags: CPython 3.9, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f0142c1b0fdb8f603407a134e199c845ce3c823ba3a358f637094d9a71d7ba3d
MD5 19562dcc61460596534954e2461cfa37
BLAKE2b-256 04a82b1008a72cdf857e5c9247c2dc41fe17dd54d6bfff8a3ed2e2126ab443fa

See more details on using hashes here.

File details

Details for the file arnio-1.9.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: arnio-1.9.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 208.0 kB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.9.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 15cb2d41da09ed4b94eaa54b86026360e39fbe61197bd5c495b56c439df114ea
MD5 eef74895befd5967e2df3dc3603bc03a
BLAKE2b-256 bcc5730ef646961eaf0c863ce5162d71fceac2a654b6ed44a39dcc70874fad2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page