Skip to main content

C++ accelerated data preparation for pandas and the Python data stack

Project description


Arnio



Fast data preparation for the Python data stack.


Arnio is a compiled C++ data preparation engine for messy CSV and pandas workflows.
It parses, infers types, strips whitespace, deduplicates, validates, and profiles data —
then hands clean results back to the tools you already use.
Use Arnio before and alongside pandas, NumPy, scikit-learn, DuckDB, and Arrow.


PyPI  Python  CI  Coverage  MIT  GSSoC 2026  Join Discord PyPI Downloads



pip install arnio

Colab install smoke test: COLAB_SMOKE_TEST.md


Quickstart · Integrations · Why Arnio · Architecture · Benchmarks · Community · Contribute




⚡ Quickstart

A simple workflow in just a few steps.

New to Arnio? Start with the pandas workflow example below before exploring advanced pipelines.

import arnio as ar

# Load CSV directly through C++ — no Python parsing overhead
frame = ar.read_csv("messy_sales_data.csv")

# Declare what clean data looks like — arnio handles the rest
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# Out comes a standard pandas DataFrame — use it like you always have
df = ar.to_pandas(clean)

# Use copy=True when you need defensive pandas-owned buffers
safe_df = ar.to_pandas(clean, copy=True)

Already have a pandas DataFrame? Use Arnio in-place in your existing pandas workflow:

import pandas as pd
import arnio as ar

df = pd.read_csv("messy_sales_data.csv")

clean_df = df.arnio.clean([
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_duplicates",),
])

report = clean_df.arnio.profile()

Select specific columns

Use select_columns() to create a new ArFrame with only the required columns before converting to pandas.

selected = frame.select_columns(["name", "revenue"])

print(selected.columns)
# ['name', 'revenue']

Every step above executes in C++. Your Python code is a configuration — not the execution engine.


📸 Peek at a 100 GB file without loading it

scan_csv reads only the header + a sample to infer the schema. Zero data loaded.

schema = ar.scan_csv("100GB_file.csv")
# {'id': 'int64', 'name': 'string', 'is_active': 'bool', 'revenue': 'float64'}

Useful for exploring datasets before committing memory.

👀 Preview rows without pandas conversion or full-column Python list materialization

preview() reads only the first n rows directly from the C++ frame — no pandas conversion triggered.

frame = ar.read_csv("huge_file.csv")

print(frame.preview())      # first 5 rows (default)
print(frame.preview(n=10))  # first 10 rows

Raises ValueError for invalid n (zero, negative, or non-integer).

🧩 Add custom steps without touching C++

Register any Python function as a pipeline step. It receives a DataFrame, returns a DataFrame.

def remove_outliers(df, column="revenue", threshold=100_000):
    return df[df[column] <= threshold]

ar.register_step("remove_outliers", remove_outliers)

# Now use it in any pipeline alongside native C++ steps
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("remove_outliers", {"column": "revenue", "threshold": 50000}),
    ("drop_duplicates",),
])

Custom steps run through a pandas↔ArFrame conversion bridge. Prototype in Python, then optionally migrate hot paths to C++ for full speed.




🔗 Integrations

Arnio is designed to make the rest of the Python data stack more productive, not to replace it.

Workflow How Arnio helps
pandas Clean, validate, and profile messy DataFrames through df.arnio.
NumPy Prepare typed numeric data before array/modeling workflows.
scikit-learn Use Arnio cleaning as a preprocessing layer before model training.
DuckDB / Arrow Validate and prepare data before analytics and columnar exchange.
notebooks Inspect quality issues and cleaning suggestions before analysis.

Pandas accessor

df = pd.read_csv("raw_customers.csv")

clean_df = df.arnio.clean(drop_duplicates=True)
quality = clean_df.arnio.profile()
validation = clean_df.arnio.validate({
    "email": ar.Email(nullable=False),
    "user_code": ar.Regex(r"^USR-\d{4}$", nullable=False),
    "age": ar.Int64(nullable=True, min=0),
})

This keeps pandas as the analysis tool while Arnio handles the preparation, quality, and validation layer.

Product direction: PROJECT_DIRECTION.md

📘 Examples

These examples demonstrate how Arnio integrates with the Python data ecosystem.

They follow a simple workflow:

clean/validate data with Arnio → analyze with other tools

🔹 Interoperability Examples

  • Arnio + pandas Clean and normalize messy tabular data using Arnio, then analyze it using pandas. Run:
  python examples/arnio_with_pandas.py
  • Arnio + NumPy Prepare numeric data safely using Arnio, then perform computations using NumPy. Run:
  python examples/arnio_with_numpy.py
  • Arnio + scikit-learn Prepare messy data with Arnio, then train a model with scikit-learn. Run:
  python examples/arnio_with_sklearn.py
  • Arnio + DuckDB Clean data with Arnio, then run SQL queries using DuckDB. Run:
  python examples/arnio_with_duckdb.py



🔍 Why Arnio exists

Every data project starts the same way:

df = pd.read_csv("data.csv")              # 💥 RAM spike — entire file as raw strings
df.columns = df.columns.str.strip()        # Why is this not automatic?
df["name"] = df["name"].str.strip()        # Python loop over every cell
df["name"] = df["name"].str.lower()        # Another Python loop
df = df.dropna()                           # Another pass
df = df.drop_duplicates()                  # Another pass

Six lines. Four full-data passes. All in interpreted Python. This is fine for a Jupyter demo — but it doesn't scale, it doesn't compose, and it definitely doesn't belong in production.

Arnio intercepts this entire pattern. It moves the preparation layer into a predictable pipeline, accelerates supported operations in C++, and gives you clean data for pandas, NumPy, scikit-learn, DuckDB, or notebooks.

Without Arnio

df = pd.read_csv(path)
df.columns = df.columns.str.strip()
for col in str_cols:
    df[col] = df[col].str.strip()
    df[col] = df[col].str.lower()
df = df.dropna(subset=["revenue"])
df = df.drop_duplicates()
# 6+ lines, multiple passes, pure Python

With Arnio

frame = ar.read_csv(path)
df = ar.to_pandas(ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_nulls", {"subset": ["revenue"]}),
    ("drop_duplicates",),
]))
# Declarative. Single pipeline. C++ execution.



🏗️ Architecture

Arnio is not a pandas wrapper. It's a separate runtime with its own data model.

flowchart LR
  subgraph python["Your Python Code"]
    PY["frame = ar.read_csv('data.csv')\nclean = ar.pipeline(frame, [...])\ndf = ar.to_pandas(clean)"]
  end

  python -->|"pybind11 boundary"| cpp

  subgraph cpp["C++ Runtime (_arnio_cpp)"]
    direction TB
    CSV["CsvReader\n• RFC 4180\n• BOM strip\n• Type inference\n• Quoted fields"]
    FRAME["Frame / Column\n• Columnar\n• std::variant\n• Bool null masks\n• O(1) column lookup"]
    CLEAN["Cleaning Engine\n• drop_nulls\n• fill_nulls\n• drop_dupes\n• strip_ws\n• normalize\n• rename/cast"]
    CSV --> FRAME --> CLEAN
  end

  cpp -->|"to_pandas() → zero-copy NumPy buffer (numerics/bools)"| OUT["pandas DataFrame"]

Design decisions that matter

Decision What it means
Columnar storage Data lives in typed std::vectors — vector<int64_t>, vector<double>, vector<string> — not rows of variants. Cache-friendly and SIMD-ready.
Boolean null masks Nulls are tracked in a separate vector<bool>, keeping data vectors dense. No sentinel values, no NaN tricks.
Two-pass CSV read Pass 1 infers types across all rows. Pass 2 parses values directly into the correct typed column. No string→object→cast overhead.
Zero-copy bridge to_pandas() exposes C++ memory directly via NumPy's buffer protocol where supported. Numeric columns preserve the fast zero-copy path by default, while copy=True requests defensive pandas-owned buffers.
Step registry Pipeline steps map to C++ function pointers. Adding a new cleaning primitive is a single function + one registry entry.

Full architecture documentation: ARCHITECTURE.md API reference guide: Arnio API Reference




🏎️ Benchmarks

Reference environment: Ubuntu, Python 3.12, synthetic messy CSV inputs.
Reproduce: make benchmark — generates deterministic tall and wide datasets and runs both engines.

To reproduce the published numbers from a fresh checkout:

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
python benchmarks/generate_data.py
python benchmarks/benchmark_vs_pandas.py

benchmarks/generate_data.py uses deterministic NumPy seeds, so every run creates the same benchmarks/benchmark_1m.csv tall input and benchmarks/benchmark_wide.csv wide input. The benchmark then executes three pandas runs and three arnio runs for each case, printing average wall-clock time from time.perf_counter() and peak Python allocation from tracemalloc. For cleaner comparisons, close other memory-heavy processes and run the script from the repository root after installing the same Python, pandas, NumPy, compiler, and arnio commit you want to compare.

Expected output format:

Tall CSV (1,000,000 rows x 12 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       4.73s         5.75s
Peak RAM               211MB         212MB
Speed: 0.8x | RAM: -1% reduction

Wide CSV (5,000 rows x 256 columns)
Metric                     pandas        arnio
────────────────────────────────────────────
Exec Time (avg)       ...s          ...s
Peak RAM              ...MB         ...MB
Speed: ...x | RAM: ...% reduction

Small differences are expected across CPUs, operating systems, compilers, Python builds, and pandas/NumPy versions. If you share benchmark results in an issue or PR, include your OS, Python version, CPU model, pandas/NumPy versions, arnio commit, and the full command output so maintainers can compare like for like.

Arnio is near memory parity in the reference benchmark while replacing ad-hoc Python string loops with a compiled, declarative pipeline. Validate memory and speed on your own workload. The execution time gap is a known, active optimization target — the current drop_duplicates and strip_whitespace implementations use unoptimized row-key serialization.

What's already won 🎯 What's being optimized
  • Native C++ parsing eliminates Python memory spikes
  • Columnar storage matches pandas' internal efficiency
  • Declarative API eliminates .apply() spaghetti
  • Zero-copy bridge for numeric conversions
  • drop_duplicates — replace string serialization with hash-based comparisons
  • strip_whitespace — in-place mutation instead of copy-on-write
  • Parallel column processing via std::thread
  • Help close the gap →

🧠 Auto Clean Memory Benchmark

To measure the peak memory and execution time of the auto_clean pipeline using realistic dataset sizes:

python benchmarks/benchmark_auto_clean_memory.py --rows 100000

This script generates a reproducible synthetic dataset with mixed column types (strings, ints, floats, booleans, nulls, and duplicates) and measures:

  • ar.read_csv performance
  • ar.auto_clean(mode="safe") performance (low-risk cleanup like whitespace trimming)
  • ar.auto_clean(mode="strict") performance (includes type casting and deduplication)

The dataset is regenerated deterministically unless --reuse-file is provided. Each auto_clean benchmark run reloads the dataset to avoid mutation or caching effects between runs.

Options:

  • --repeat N runs each operation multiple times and reports average (and min/max range).
  • --seed N changes the deterministic dataset seed.
  • --reuse-file reuses an existing dataset file instead of regenerating it.
  • --keep-file keeps the generated CSV (otherwise it is removed at the end).

Expected output format:

Operation                    Time(s)     Peak Py(MiB)
--------------------------------------------------------------------
ar.read_csv           0.042 (0.041-0.044)    4.52 (4.50-4.60)
ar.auto_clean(safe)   0.012 (0.011-0.013)    0.15 (0.14-0.16)
ar.auto_clean(strict) 0.035 (0.034-0.036)    1.20 (1.18-1.22)
--------------------------------------------------------------------
Total avg (Read+Strict)       0.077             4.52



🧰 Cleaning primitives

Most operations below run natively in C++. Currently, filter_rows and replace_values run via the Python (pandas) backend and may be optimized in C++ later.

Primitive What it does Example
drop_nulls Remove rows with null/empty values ar.drop_nulls(frame, subset=["age"])
keep_rows_with_nulls Keep only rows that contain at least one null ar.keep_rows_with_nulls(frame, subset=["age"])
validate_columns_exist Fail early when required columns are missing ar.validate_columns_exist(frame, ["age"])
filter_rows Filter rows using comparison operators ar.filter_rows(frame, column="age", op=">", value=18)
fill_nulls Replace nulls with a scalar ar.fill_nulls(frame, 0, subset=["revenue"])
drop_duplicates Deduplicate rows (first/last/none) ar.drop_duplicates(frame, keep="first")
drop_constant_columns Remove columns with only one unique value ar.drop_constant_columns(frame)
clip_numeric Clip numeric values to lower and/or upper bounds ar.clip_numeric(frame, lower=0, upper=100)
strip_whitespace Trim leading/trailing spaces from strings ar.strip_whitespace(frame)
normalize_case Force lower/upper/title case ar.normalize_case(frame, case_type="title")
rename_columns Rename columns via mapping ar.rename_columns(frame, {"old": "new"})
cast_types Cast column types ar.cast_types(frame, {"age": "int64"})
round_numeric_columns Round numeric columns (non-numeric columns in subset ignored safely) ar.round_numeric_columns(frame, decimals=2)
replace_values Replace values using a mapping (column or whole-frame). Handles None/NaN. ar.replace_values(frame, {"active": "A", "inactive": "I"}, column="status")
clean Convenience shorthand ar.clean(frame, drop_nulls=True)
safe_divide_columns Divide one column by another, handling zero/null denominators ar.safe_divide_columns(frame, numerator="revenue", denominator="cost", output_column="ratio")
trim_column_names Strip leading/trailing whitespace from column names ar.trim_column_names(frame)

ArFrame.select_dtypes — type-based column selection

Returns a new ArFrame containing only the columns whose dtype matches the filter. Raises ValueError if no columns match.

frame = ar.read_csv("data.csv")

# Keep only numeric columns
numeric = frame.select_dtypes(include=["int64", "float64"])

# Drop string columns
without_strings = frame.select_dtypes(exclude="string")

Valid dtype strings: "int64", "float64", "string", "bool", "null"

  • At least one of include or exclude must be given — raises ValueError otherwise.
  • include and exclude must not overlap — raises ValueError if they share a dtype.
  • Unknown dtype strings raise ValueError with a list of valid options.
  • Raises ValueError when no columns match (never returns an empty frame silently).
  • Column order in the result always matches the original frame.

Or compose them all into a pipeline:

clean = ar.pipeline(frame, [
    ("validate_columns_exist", {"columns": ["name", "city", "revenue"]}),
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": "unknown", "subset": ["city"]}),
    ("drop_duplicates", {"keep": "first"}),
])

🔁 Replace values

Use replace_values to substitute values using a mapping. It works as a pipeline step (Python backend) and can operate on a single column or the whole frame when column is omitted. It also understands null semantics: using None (or np.nan) as a mapping key targets existing nulls, and mapping a value to None creates real nulls.

Column-specific example:

clean = ar.pipeline(frame, [
    ("replace_values", {"mapping": {"active": "A", "inactive": "I"}, "column": "status"}),
])

Whole-frame example (no column):

clean = ar.pipeline(frame, [
    ("replace_values", {"mapping": {None: "MISSING", "active": "A", "inactive": "I"}}),
])

Direct API:

frame2 = ar.replace_values(frame, {"active": "A", "inactive": "I"})

🔎 Filter rows inside pipelines

Use filter_rows to keep only rows matching a condition.

clean = ar.pipeline(frame, [
    ("filter_rows", {
        "column": "revenue",
        "op": ">=",
        "value": 1000
    }),
])

Supported operators:

  • >
  • <
  • >=
  • <=
  • ==
  • !=

Works with:

  • integers
  • floats
  • strings
  • booleans

🔎 Isolate rows with null values

Use keep_rows_with_nulls to audit incomplete data — keep only rows that have at least one null.

frame = ar.read_csv("data.csv")

# Keep all rows that have at least one null anywhere
nulls = ar.keep_rows_with_nulls(frame)

# Keep rows where specifically 'age' or 'score' is null
nulls = ar.keep_rows_with_nulls(frame, subset=["age", "score"])

# Works inside a pipeline too
result = ar.pipeline(frame, [
    ("keep_rows_with_nulls", {"subset": ["age"]}),
])

Useful for data auditing — inspect what's missing before deciding how to fill or drop.

Boolean string normalization

clean = ar.parse_bool_strings(frame)

This normalizes values such as "yes", "no", "true", "false", "y", "n", "1", and "0" into boolean values while preserving unsupported values unchanged.

Columns containing both parsed boolean values and unsupported string values may round-trip as strings because of ArFrame column typing semantics.


### 🔢 Safe column division

Divide one column by another while handling division by zero and null denominators explicitly:

result = ar.safe_divide_columns(
    frame,
    numerator="revenue",
    denominator="cost",
    output_column="ratio",
    fill_value=0.0,  # used when denominator is zero or null
)

When the denominator is zero or null, the result is replaced with fill_value (default 0.0) instead of raising an error or producing NaN/Inf.



📊 Pandas Dtype Support Matrix

This table helps users understand which pandas dtypes and workflows are fully supported, partially supported, unsupported, or planned.

If a dtype is partially supported, users may need conversion before processing. Unsupported dtypes should raise clear errors where applicable.

Pandas Dtype Support Status Notes
int64 ✅ Supported Fully supported with native C++ columnar storage
float64 ✅ Supported Fully supported with zero-copy conversion where possible
bool ✅ Supported Native supported boolean type
string ✅ Supported Recommended over object dtype for text workflows
datetime64[ns] ❌ Unsupported for native storage No native datetime parsing or conversion support yet. Use ar.DateTime() for schema validation of string timestamp columns.
category ⚠️ Limited Converted to string/object during processing
object (mixed columns) ⚠️ Limited Mixed object columns may coerce to string and reduce type inference reliability
nullable pandas dtypes (Int64, boolean) ⚠️ Limited Supported through pandas extension dtypes with null-mask handling
timedelta64[ns] ❌ Unsupported Not currently supported

Notes

  • Numeric columns are optimized for zero-copy conversion between C++ and pandas where supported.
  • Pass copy=True to to_pandas() when downstream pandas code needs defensive pandas-owned column buffers.
  • Boolean conversion is already copied by the binding because std::vector<bool> cannot be exposed as a zero-copy NumPy buffer in the current implementation.
  • Columns with null masks may require copies so pandas can apply nullable values safely.
  • String columns require Python string object creation during to_pandas() conversion.
  • ar.DateTime() validates string timestamp columns with optional format, min, and max; it does not add native datetime64[ns] storage or automatic datetime conversion.
  • Mixed object columns may reduce type inference accuracy and may require preprocessing.
  • Unsupported dtypes should raise clear user-facing errors instead of silent failures.

Note: pandas DataFrame indexes are currently not preserved during from_pandas() conversion. Converted frames receive a default RangeIndex when converted back via to_pandas().




🧠 Data quality engine

Arnio now includes built-in dataset understanding before you analyze in pandas.

report = ar.profile(frame)
print(report.summary())

suggestions = ar.suggest_cleaning(frame)
clean = ar.pipeline(frame, suggestions)

For production data contracts:

schema = ar.Schema({
    "id": ar.Int64(nullable=False, unique=True),
    "email": ar.Email(nullable=False),
    # CountryCode expects uppercase ISO alpha-2 values, for example IN, US, GB.
    "country": ar.CountryCode(nullable=False),
    "username": ar.String(min_length=3, max_length=20),
    "user_code": ar.Regex(r"^USR-\d{4}$", nullable=False),
    "revenue": ar.Float64(nullable=True, min=0),
    "created_at": ar.DateTime(nullable=False, format="%Y-%m-%d"),
})

result = ar.validate(frame, schema)
if not result.passed:
    summary = result.summary()
    print(summary["issues_by_rule"])
    print(summary["issues_by_column"])
    print(summary["issues_by_column_and_rule"])
    print(result.to_pandas())
    print(result.to_markdown(max_issues=10))

ValidationResult.to_markdown() is useful in CI logs, GitHub comments, or data quality reports because it renders a compact validation summary plus a GitHub-friendly issue table.

For multi-column uniqueness (composite keys):

schema = ar.Schema({
    "user_id": ar.Int64(nullable=False),
    "course_id": ar.Int64(nullable=False),
}, unique=["user_id", "course_id"])

result = ar.validate(frame, schema)

Severity counts are not included in summary() yet because ValidationIssue does not currently carry severity information.

For low-risk automatic cleanup:

clean, report = ar.auto_clean(frame, mode="strict", return_report=True)

This is the layer pandas does not try to own: profiling, data contracts, row-level validation issues, and safe cleaning suggestions for messy incoming datasets.


Beginner-friendly auto-clean tutorial

Use this workflow when you receive a small messy dataset and want to inspect what Arnio will change before applying it.

import arnio as ar
import pandas as pd

raw = pd.DataFrame(
    {
        "order_id": [1001, 1002, 1002, 1003, 1004],
        "customer": [" Ishan ", " Prasoon ", " Prasoon ", " Pranay ", " Dhruv "],
        "city": [" Paris ", "London", "London", " New York ", " Tokyo "],
    }
)

frame = ar.from_pandas(raw)

report = ar.profile(frame)
summary = report.summary()
print(summary)

suggestions = ar.suggest_cleaning(frame)
print(suggestions)
# [('strip_whitespace', {'subset': ['customer', 'city']}), ('drop_duplicates', {'keep': 'first'})]

safe = ar.auto_clean(frame)
strict = ar.auto_clean(frame, mode="strict")

Messy input:

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

Expected cleaned output with mode="strict":

order_id customer city
1001 Ishan Paris
1002 Prasoon London
1003 Pranay New York
1004 Dhruv Tokyo

mode="safe" only trims whitespace. Use mode="strict" when you also want deterministic built-in cleanup such as exact duplicate removal.

See examples/auto_clean_tutorial.py for a runnable version of this walkthrough.


Data Quality Reports

Arnio provides detailed profiling for datasets via ar.profile(). To generate the report shown in these examples, the following code was used:

import arnio as ar
import pandas as pd

# Sample dataset used for these examples
data = {
    "user_id": [101, 102, 103, 104],
    "email": ["test@arnio.ai", "invalid-email", None, "test@arnio.ai"],
    "score": [85.5, 90.0, None, 88.2]
}
df = ar.from_pandas(pd.DataFrame(data))
# Bounded profiling for large datasets (controls how many sample values are kept)
report = ar.profile(df, sample_size=5)
safe_report = report.to_dict(redact_sample_values=True)

Use report.to_dict(redact_sample_values=True) when sharing reports outside your team and you want to avoid exposing raw example/sample values.

1. Terminal Representation (Simplified Example)

A simplified view of the standard string representation of the report object:

DataQualityReport(
    row_count=4,
    column_count=3,
    memory_usage=733,
    duplicate_rows=0,
    columns={
        'user_id': ColumnProfile(dtype='int64', semantic_type='identifier', unique_count=4),
        'email': ColumnProfile(dtype='string', semantic_type='categorical', null_count=1, unique_ratio=0.666667, min=13, max=13, mean=13.0),
        'score': ColumnProfile(dtype='float64', semantic_type='numeric', mean=87.9, min=85.5, max=90.0)
    }
)

2. JSON Format (Excerpts from .to_dict())

Key fields from the structured JSON export for integration with APIs or dashboards:

{
  "row_count": 4,
  "column_count": 3,
  "memory_usage": 733,
  "duplicate_rows": 0,
  "duplicate_ratio": 0.0,
  "columns": {
    "user_id": {
      "dtype": "int64",
      "semantic_type": "identifier",
      "null_count": 0,
      "unique_ratio": 1.0
    },
    "email": {
      "dtype": "string",
      "semantic_type": "categorical",
      "null_count": 1,
      "unique_ratio": 0.666667,
      "min": 13,
      "max": 13,
      "mean": 13.0,
      "warnings": ["contains_nulls"]
    },
    "score": {
      "dtype": "float64",
      "semantic_type": "numeric",
      "null_count": 1,
      "mean": 87.9,
      "min": 85.5,
      "max": 90.0,
      "warnings": ["contains_nulls"]
    },
    "city": {
      "dtype": "string",
      "semantic_type": "categorical",
      "null_count": 0,
      "top_values": [
        {"value": "London", "count": 3, "ratio": 0.5},
        {"value": "Paris", "count": 2, "ratio": 0.333}
      ]
    }
  }
}

3. Example Summary Table

A manually formatted Markdown table representing the core metrics:

Metric Value
Row Count 4
Column Count 3
Memory Usage 733 bytes
Duplicates 0 (0.0%)

🗺️ Roadmap

Version Focus Status
v1.0 Stable release · cross-platform wheels · CI/CD · PyPI publishing · Google Colab support ✅ Shipped
v1.1 Production readiness · release hardening · docs/tooling ✅ Shipped
v1.2 C++ pipeline optimization · speed parity with pandas · hash-based deduplication 🔨 Active
v1.3 Chunked / streaming processing · Parquet & JSON readers 📋 Planned
v1.4 Parallel column processing · SIMD string operations 💭 Exploring



💬 Community

Join the Arnio Discord Community for quick setup help, contributor onboarding, GSSoC 2026 coordination, feature discussion, and community updates.

Discord is for fast conversation and support. GitHub remains the source of truth for issue assignment, PR reviews, bugs, roadmap decisions, and releases.

Join Arnio Discord




🤝 Contribute

Arnio is a GSSoC 2026 project with a structured contributor backlog across beginner, intermediate, and advanced tracks.

You don't need C++ to contribute

Most new features are pure Python pipeline steps:

# 1. Write a function that takes a DataFrame and returns a DataFrame
def remove_special_chars(df, columns=None):
    cols = columns or df.select_dtypes("object").columns
    for col in cols:
        df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
    return df

# 2. Register it
ar.register_step("remove_special_chars", remove_special_chars)

# 3. Write tests, open a PR. That's it.

If you do know C++

The biggest performance wins are in:

  • drop_duplicates — replacing std::ostringstream row serialization with proper hash-based comparisons
  • strip_whitespace — converting from copy-on-write to in-place mutation
  • Parallel column processingstd::thread across independent columns

Getting started

# macOS / Linux
git clone https://github.com/im-anishraj/arnio.git && cd arnio
make install   # pip install -e ".[dev]" + pre-commit
make test      # pytest with coverage
make lint      # ruff + black

# Windows
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

PR titles must follow Conventional Commitsfeat:, fix:, docs:, chore:. Our release pipeline auto-generates changelogs from these.

For GSSoC contributors, please read GSSOC_GUIDE.md before asking to be assigned. It explains issue claiming, contribution levels, review expectations, and what maintainers look for in a strong PR. If you want a quick onboarding refresher, see the GSSoC FAQ. If you are new to Arnio terms, see the contributor glossary.

📖 Full Contributing Guide ·  GSSoC Guide ·  🐛 Open Issues ·  💬 Discussions ·  Discord

💖 Contributors

Thanks to everyone who contributes to Arnio and helps improve the project.



🚢 Release process

Arnio releases are automated through Release Please and GitHub Actions.

  1. Merge user-facing changes with Conventional Commit PR titles (feat:, fix:, docs:, or chore:) so Release Please can choose the version bump and changelog entries.
  2. Review and merge the Release Please PR on main; this updates release metadata and creates the GitHub release and tag.
  3. Confirm the Build & Publish Wheels workflow succeeds for the release tag. It builds the sdist and wheels, then publishes to PyPI through Trusted Publishing.
  4. Smoke test the published package in a clean environment:
python -m venv /tmp/arnio-smoke
source /tmp/arnio-smoke/bin/activate
python -m pip install -U pip
python -m pip install arnio
printf 'name,revenue\n Ada,10\n' > /tmp/arnio-smoke.csv
python - <<'PY'
import arnio as ar
print(ar.__version__)
print(ar.scan_csv("/tmp/arnio-smoke.csv"))
PY
  1. Verify the GitHub release, PyPI project page, and install command all show the expected version before announcing the release.

If any publish or smoke-test step fails, leave the failed tag and GitHub release in place until maintainers agree on the recovery plan.




📐 Project structure

arnio/
├── cpp/
│   ├── include/arnio/      # C++ headers — types, column, frame, csv_reader, cleaning
│   └── src/                 # C++ implementations (~30 KB of compiled logic)
├── bindings/
│   └── bind_arnio.cpp       # pybind11 module — the Python↔C++ bridge
├── arnio/
│   ├── __init__.py          # Public API surface
│   ├── io.py                # read_csv, scan_csv
│   ├── cleaning.py          # Python wrappers for C++ cleaning functions
│   ├── pipeline.py          # Step registry + pipeline executor
│   ├── convert.py           # to_pandas (zero-copy), from_pandas
│   ├── frame.py             # ArFrame — lightweight C++ Frame wrapper
│   └── exceptions.py        # ArnioError, UnknownStepError, CsvReadError, TypeCastError
├── tests/                   # pytest suite — CSV, cleaning, pipeline, conversions
├── benchmarks/              # Reproducible arnio vs pandas benchmark
├── examples/                # basic_usage.py, auto_clean_tutorial.py, custom_step.py
└── website/                 # Project website — arnio.vercel.app



Arnio



Stop writing cleaning scripts. Declare clean data.


DownloadsStarsForksWebsiteDiscord


Built with C++ and pybind11 · Licensed under MIT · Maintained by @im-anishraj

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arnio-1.14.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arnio-1.14.0-cp313-cp313-win_amd64.whl (213.4 kB view details)

Uploaded CPython 3.13Windows x86-64

arnio-1.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (261.5 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arnio-1.14.0-cp313-cp313-macosx_11_0_arm64.whl (201.9 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arnio-1.14.0-cp313-cp313-macosx_10_13_x86_64.whl (217.4 kB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

arnio-1.14.0-cp312-cp312-win_amd64.whl (213.4 kB view details)

Uploaded CPython 3.12Windows x86-64

arnio-1.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (261.4 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arnio-1.14.0-cp312-cp312-macosx_11_0_arm64.whl (201.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arnio-1.14.0-cp312-cp312-macosx_10_13_x86_64.whl (217.3 kB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

arnio-1.14.0-cp311-cp311-win_amd64.whl (211.0 kB view details)

Uploaded CPython 3.11Windows x86-64

arnio-1.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arnio-1.14.0-cp311-cp311-macosx_11_0_arm64.whl (201.0 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arnio-1.14.0-cp311-cp311-macosx_10_9_x86_64.whl (215.4 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

arnio-1.14.0-cp310-cp310-win_amd64.whl (210.3 kB view details)

Uploaded CPython 3.10Windows x86-64

arnio-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (258.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

arnio-1.14.0-cp310-cp310-macosx_11_0_arm64.whl (200.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

arnio-1.14.0-cp310-cp310-macosx_10_9_x86_64.whl (214.2 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

arnio-1.14.0-cp39-cp39-win_amd64.whl (217.2 kB view details)

Uploaded CPython 3.9Windows x86-64

arnio-1.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (258.9 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

arnio-1.14.0-cp39-cp39-macosx_11_0_arm64.whl (200.2 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

arnio-1.14.0-cp39-cp39-macosx_10_9_x86_64.whl (214.3 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file arnio-1.14.0.tar.gz.

File metadata

  • Download URL: arnio-1.14.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.14.0.tar.gz
Algorithm Hash digest
SHA256 2f1e6315f0511ff9eda23cb89f3a66ff7fa54986acbf28abc6c685ee63a7864d
MD5 829bb8d33f29ec4b0a68ed2917135546
BLAKE2b-256 670e0febdec0c71b0814c44b64d75261ec77f9938eb2de6bff1a200a16dd2a44

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: arnio-1.14.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 213.4 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.14.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 679865b05132f86b94a7ffdf3ab762ebab20b54f91b3b0b5389cfd58dc3d7760
MD5 76cb3848c14e9df81bd345f2a4c6c8fa
BLAKE2b-256 e34d60ae716ad9093436e2144e3c812094193d2a3ab92234ce5cb0741a65eb31

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e643a7a6b2a40c8fe1b6c00e202f20a2d591c6c8f48f5fe6977e67f828577964
MD5 13f5e7570db708dc5c90f99b0d06305f
BLAKE2b-256 0f272bacb8b6d1b7009fd50e245f84e7b6c934461dcf35c0ebcf5059290fb78b

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e64efb29aad6787f654555f6e6b76cfbaf22dd3fefdfa41d8372129c73196048
MD5 04dc2ab57db4388b2bf33d4b9be5cecf
BLAKE2b-256 96fb1673b147821eec366e69f024b52836e49f55df46c86ac2a12c58d3d3518b

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 771d02ee24e9ac0c81d106d74338baf81ae3da8c6169c1ce1774cab4d1bfcefa
MD5 9635d6e5a8f41b7dfd8839929c276bf6
BLAKE2b-256 3c4d39051edecdfa1641b751e9e71abdf46fc5f826419281a522931c45922705

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: arnio-1.14.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 213.4 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.14.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 742316690afa021fb53157b11323957436637e3b73394d21b6b381f27cd454f2
MD5 d13f3c2f90eeaf4b049b482aa3f156c1
BLAKE2b-256 edc2088efa97f97cb81a3b0ec7cb91544d444cc817154e29252a84b4ccd8bd02

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5b0322ec27ba360b4382b81c634e577e3477921ae425f395cc15295a52dadadd
MD5 dc53c44a84bf83e2393c1a44cf821877
BLAKE2b-256 c304e88c05776bd8e6b7e7ff3cc8dfffcb4d4bfde02a14f0f8290821f8ce8d64

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a000352d9c791c1ed544c63ffdeebc2a98a61db0dc4eacb75c7bcf4b43d7d17f
MD5 066d7c109b82e05d4ff1b512920e45a9
BLAKE2b-256 ae3e9329c3a681431dff496db9fe6abcacfdc891ce085142eb2b7050ef0842d7

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ce85aa08467a9cec1d085af256a4bd15aab99f459be6e2ce14c673d6e4aa8682
MD5 a26ec7db5cf4a41cc5163f1f5c944994
BLAKE2b-256 88c59f47e8594a107d488eadea34bdb10b616ffe4e16a85acf28e7d7c979810f

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: arnio-1.14.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 211.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.14.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 71f9dfa0d08c6a45edb7b8d98c10c4ca0e068c9eecd15bbccbc51ede2e71c0b8
MD5 bd918cf3014c4dc2adda24a165e36bfc
BLAKE2b-256 25635dd547d38ee0e0ef841bc7ab37c106f10b7d526d27a9fac58ee3add59642

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a8d9baf943ffbd061d6eda7e72799fabd177e5371b7ccc992357591a2ca8ffdc
MD5 fe4328902d0d45a9cf07d17c3d10a335
BLAKE2b-256 ac7e1cbd140e9918ea45ceecf646b92c88d05cafbd95027ec6e3dc344858a2db

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ee7c9adbb97a6993a4c77dde5208032deada6596a942e95f77d24e0822190e3f
MD5 34b4d3fdc10468664605b58daeb225e4
BLAKE2b-256 ee9a5cc8333ac7d6043a7e1859eb3da8f95d720855478e424b0f158aecc9c4fa

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aa38a914fb7691f1d2cb13bf29e7e5e02b448ffcfb17df002d8b3174832812a1
MD5 0f8d09239d1c15a39b052cdd214e0e36
BLAKE2b-256 6210c4cb3f43303be2af85a99e01d91890eb1e24e4eaac62f304abafbcf7319e

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: arnio-1.14.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 210.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.14.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 cd72678ff7dbfff170ae3162c0ca020e1bc26ffbb2c06c975387e665a548c487
MD5 63d04aab2a32249905a6df2a876d0c64
BLAKE2b-256 960f8e01c1f9ece33820dbf04fe53c1bdf954e31fb8d5085d192218e82f911b3

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c99a5c48d3cdaa5b17ad7811e7f946135e45e3e301a4b5ec1a4113125bacefaf
MD5 4f65be9605ce45a03ef8c5c7b0eeeaa8
BLAKE2b-256 15a6bc8e215823d7e400c0d07cbf93d1900f264a66a4ce3eefa171e23c437d25

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e1962547dc7ece3f74a2633de36831bcfb6a3f0992793fa243847a9a647a3189
MD5 784caed6e696b9539050278118be037d
BLAKE2b-256 eeb8f29153573a0c09248b4245d133678f10a4ca10a7a188981d54862598a4b3

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8e8d825846db54f4feb65b5df5df3bc3c037ed5f4f3a97fd398920c2631009cd
MD5 7ec7df765aa7a38932f7e8fef59c5a29
BLAKE2b-256 6f43b1790b1e8846ba277127af306099eb0a4951bbf813e3012c6b18313fc16e

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: arnio-1.14.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 217.2 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arnio-1.14.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9d6e6a974d68f11806be51ff29218ee585ad5c2ad7fbc23343c5810a1a5c47f0
MD5 228c69f25d62d62ce066c28841e09429
BLAKE2b-256 bbf93e4cb32928bc6c8c0370c121a99ea0473a2b62d1c3eba9755c621cc6ec81

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75bd64b29c0954855175ed7f04793f12191b794278677e9fe7797f3be0ec8702
MD5 b3b9faf65bf5751a705865bf3cef8705
BLAKE2b-256 eb5247f2221d38b694b77140c51d85d84191f1ce2ecddae5920b82c8e3856c15

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c7eaad45bf6a3dec2342a1bc19c6e6667e68491d8dfb770c25079035bb149892
MD5 bccdf17375dd7881f86da73b62bd5f84
BLAKE2b-256 fa9297602415feb2cfc450331bca8c517ad5927db47b975b1d30b1075aa3ed6e

See more details on using hashes here.

File details

Details for the file arnio-1.14.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for arnio-1.14.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 abecd0a4ba3aebfb6d1e3d9a8cf1481b9342b6e49c4759187de603939a0816f3
MD5 ef32a7de7eba60877ceb56b7becd9d25
BLAKE2b-256 8f933b7e689b65c0bc018e27c4dad888691c96d981a5947a5d768586d5d0950c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page