C++ accelerated CSV preprocessing and data cleaning for pandas

These details have not been verified by PyPI

Project links

Project description

⚡ arnio

Arnio is an open-source C++ accelerated data preprocessing library
for Python. Built for speed and memory efficiency — and actively being optimized during GSSoC 2026.

The Problem • The Solution • Benchmarks • Quickstart

Pandas is incredible for analysis. It is notoriously slow and memory-hungry for ingesting and cleaning raw CSVs.
Arnio exists to do exactly one thing: intercept your messy CSVs, clean them natively in C++, and hand you a pristine Pandas DataFrame in half the time.

arnio demo

🧨 The Problem

Every data project starts the same way. You load a CSV. It crashes your RAM. You load it again in chunks. You find random nulls, weird capitalization, and trailing whitespaces. You write a 15-line script chaining .apply(), .dropna(), and .str.strip(). You copy-paste this script into your next 5 Jupyter notebooks.

It's slow. It's unreadable. It's error-prone.

✨ The Solution: Arnio

Arnio replaces your messy ingestion script with a high-performance, declarative pipeline powered by pybind11 and C++.

❌ The Old Way (Pandas)	⚡ The Arnio Way
Memory Spikes: Python loads the entire raw string file before casting.	C++ Native: Parses and infers types directly into columnar memory.
Spaghetti Code: `.apply()` lambda functions scattered across cells.	Declarative: A strict, readable list of cleaning steps.
Slow Execution: Python loops over strings to strip whitespaces.	Blazing Fast: Cleaning primitives run at near metal speeds.

🚀 Getting Started

If you have Python 3.9+, you are 5 seconds away from faster data pipelines.

pip install arnio

The 3-Step Workflow

Drop Arnio into the very top of your Jupyter Notebook or Python script.

import arnio as ar

# 1. Load the raw file using the C++ core (no Python overhead)
frame = ar.read_csv("messy_sales_data.csv")

# 2. Define a strict, readable cleaning pipeline
clean_frame = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# 3. Export to a clean pandas DataFrame and start your analysis!
df = ar.to_pandas(clean_frame)

# -> Now, use `df` exactly like you always have.

🏎️ Benchmarks

Tested on Ubuntu, Python 3.12, 1M row CSV.
Run make benchmark to reproduce on your machine.

Metric	pandas	arnio v0.1.3
Execution Time	4.73s	5.75s
Peak RAM	211MB	212MB

Current state: arnio's C++ CSV reader matches pandas on memory.
Speed parity is the active engineering goal for v0.2.0 — specifically
drop_duplicates and strip_whitespace are unoptimized C++ and are
the primary contributors to the gap.

Help close the gap →

🔍 Want to peek at a massive file without loading it?

Arnio lets you instantly scan a massive CSV to infer its schema without loading the data into memory.

import arnio as ar

schema = ar.scan_csv("100GB_file.csv")
print(schema) 
# {'id': 'INT64', 'name': 'STRING', 'is_active': 'BOOL'}

🛠️ What's Inside?

Arnio ships with a growing library of hyper-optimized C++ cleaning primitives:

drop_nulls: Rip out bad rows instantly.
fill_nulls: Patch holes with scalar values.
drop_duplicates: Deduplicate rows based on exact matches.
strip_whitespace: Trim invisible spaces from string columns.
normalize_case: Force upper or lower case instantly.
rename_columns & cast_types: Shape your data exactly how you need it.

🤝 Contributing

Arnio is a GSSoC 2026 project. We welcome contributors of all levels.

No C++ required: Add pipeline steps in pure Python
C++ contributors: Help optimize drop_duplicates and strip_whitespace
— these are the current performance bottleneck
Docs & examples: Always needed

Read the Contribution Guide → | Browse open issues →

🗺️ Roadmap

Version	Focus	Status
v0.1.3	Bug fixes, cross-platform wheels, contributor infrastructure	✅ Released
v0.2.0	C++ pipeline optimization, speed parity with pandas	🔨 Active
v0.3.0	Chunked processing, Parquet/JSON support	📋 Planned

Stop fighting your data. Let Arnio clean it.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.19.0

May 28, 2026

1.18.0

May 22, 2026

1.17.1

May 22, 2026

1.16.0

May 20, 2026

1.15.0

May 20, 2026

1.14.0

May 18, 2026

1.13.0

May 18, 2026

1.12.1

May 18, 2026

1.12.0

May 17, 2026

1.11.3

May 17, 2026

1.11.2

May 17, 2026

1.11.1

May 17, 2026

1.11.0

May 17, 2026

1.10.0

May 17, 2026

1.9.1

May 17, 2026

1.9.0

May 17, 2026

1.8.0

May 17, 2026

1.7.0

May 17, 2026

1.6.2

May 17, 2026

1.6.1

May 16, 2026

1.6.0

May 16, 2026

1.5.1

May 16, 2026

1.5.0

May 16, 2026

1.4.0

May 16, 2026

1.3.1

May 16, 2026

1.3.0

May 16, 2026

1.2.0

May 15, 2026

1.1.1

May 14, 2026

1.1.0

May 14, 2026

1.0.1

May 9, 2026

1.0.0

May 7, 2026

This version

0.1.3

May 7, 2026

0.1.2

May 3, 2026

0.1.1

May 3, 2026

0.1.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arnio-0.1.3.tar.gz (14.5 MB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arnio-0.1.3-cp313-cp313-win_amd64.whl (456.4 kB view details)

Uploaded May 7, 2026 CPython 3.13Windows x86-64

File details

Details for the file arnio-0.1.3.tar.gz.

File metadata

Download URL: arnio-0.1.3.tar.gz
Upload date: May 7, 2026
Size: 14.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for arnio-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`6234e80ef16719fe3657fb8f1b22a177d7cf4d4f94e3fff7fcc3331c6c960558`
MD5	`0669b7cc6abb93e78a3e33cc8e2f2916`
BLAKE2b-256	`5752daaeae9ff4c6e8fbb5b7c2c48a1fee3c9f9edd7d9b2dfd99ccf1bff2b7b0`

See more details on using hashes here.

File details

Details for the file arnio-0.1.3-cp313-cp313-win_amd64.whl.

File metadata

Download URL: arnio-0.1.3-cp313-cp313-win_amd64.whl
Upload date: May 7, 2026
Size: 456.4 kB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for arnio-0.1.3-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`c97b1e46ce132a61f048c745dbc386805c8ada8d77800ae559f9820243b14f08`
MD5	`6029ad9c391d9305baa7b2d82d1e66e9`
BLAKE2b-256	`aa8e3771d2b3d9d0952869562bb01831762b2beeb6bbcd0411aa964074cd5bf7`

See more details on using hashes here.

arnio 0.1.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

⚡ arnio

🧨 The Problem

✨ The Solution: Arnio

🚀 Getting Started

The 3-Step Workflow

🏎️ Benchmarks

🛠️ What's Inside?

🤝 Contributing

🗺️ Roadmap

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes