Skip to main content

C++ accelerated CSV preprocessing and data cleaning for pandas

Project description


⚡ arnio

Arnio is an open-source C++ accelerated data preprocessing library
for Python. Built for speed and memory efficiency — and actively being optimized during GSSoC 2026.

CI PyPI Python License

The ProblemThe SolutionBenchmarksQuickstart


Pandas is incredible for analysis. It is notoriously slow and memory-hungry for ingesting and cleaning raw CSVs.
Arnio exists to do exactly one thing: intercept your messy CSVs, clean them natively in C++, and hand you a pristine Pandas DataFrame in half the time.

arnio demo

🧨 The Problem

Every data project starts the same way. You load a CSV. It crashes your RAM. You load it again in chunks. You find random nulls, weird capitalization, and trailing whitespaces. You write a 15-line script chaining .apply(), .dropna(), and .str.strip(). You copy-paste this script into your next 5 Jupyter notebooks.

It's slow. It's unreadable. It's error-prone.

✨ The Solution: Arnio

Arnio replaces your messy ingestion script with a high-performance, declarative pipeline powered by pybind11 and C++.

❌ The Old Way (Pandas) ⚡ The Arnio Way
Memory Spikes: Python loads the entire raw string file before casting. C++ Native: Parses and infers types directly into columnar memory.
Spaghetti Code: .apply() lambda functions scattered across cells. Declarative: A strict, readable list of cleaning steps.
Slow Execution: Python loops over strings to strip whitespaces. Blazing Fast: Cleaning primitives run at near metal speeds.

🚀 Getting Started

If you have Python 3.9+, you are 5 seconds away from faster data pipelines.

pip install arnio

The 3-Step Workflow

Drop Arnio into the very top of your Jupyter Notebook or Python script.

import arnio as ar

# 1. Load the raw file using the C++ core (no Python overhead)
frame = ar.read_csv("messy_sales_data.csv")

# 2. Define a strict, readable cleaning pipeline
clean_frame = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# 3. Export to a clean pandas DataFrame and start your analysis!
df = ar.to_pandas(clean_frame)

# -> Now, use `df` exactly like you always have.

🏎️ Benchmarks

Tested on Ubuntu, Python 3.12, 1M row CSV.
Run make benchmark to reproduce on your machine.

Metric pandas arnio v0.1.3
Execution Time 4.73s 5.75s
Peak RAM 211MB 212MB

Current state: arnio's C++ CSV reader matches pandas on memory.
Speed parity is the active engineering goal for v0.2.0 — specifically
drop_duplicates and strip_whitespace are unoptimized C++ and are
the primary contributors to the gap.

Help close the gap →

🔍 Want to peek at a massive file without loading it?

Arnio lets you instantly scan a massive CSV to infer its schema without loading the data into memory.

import arnio as ar

schema = ar.scan_csv("100GB_file.csv")
print(schema) 
# {'id': 'INT64', 'name': 'STRING', 'is_active': 'BOOL'}

🛠️ What's Inside?

Arnio ships with a growing library of hyper-optimized C++ cleaning primitives:

  • drop_nulls: Rip out bad rows instantly.
  • fill_nulls: Patch holes with scalar values.
  • drop_duplicates: Deduplicate rows based on exact matches.
  • strip_whitespace: Trim invisible spaces from string columns.
  • normalize_case: Force upper or lower case instantly.
  • rename_columns & cast_types: Shape your data exactly how you need it.

🤝 Contributing

Arnio is a GSSoC 2026 project. We welcome contributors of all levels.

  • No C++ required: Add pipeline steps in pure Python
  • C++ contributors: Help optimize drop_duplicates and strip_whitespace
    — these are the current performance bottleneck
  • Docs & examples: Always needed

Read the Contribution Guide → | Browse open issues →


🗺️ Roadmap

Version Focus Status
v0.1.3 Bug fixes, cross-platform wheels, contributor infrastructure ✅ Released
v0.2.0 C++ pipeline optimization, speed parity with pandas 🔨 Active
v0.3.0 Chunked processing, Parquet/JSON support 📋 Planned

Stop fighting your data. Let Arnio clean it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arnio-0.1.3.tar.gz (14.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arnio-0.1.3-cp313-cp313-win_amd64.whl (456.4 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file arnio-0.1.3.tar.gz.

File metadata

  • Download URL: arnio-0.1.3.tar.gz
  • Upload date:
  • Size: 14.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for arnio-0.1.3.tar.gz
Algorithm Hash digest
SHA256 6234e80ef16719fe3657fb8f1b22a177d7cf4d4f94e3fff7fcc3331c6c960558
MD5 0669b7cc6abb93e78a3e33cc8e2f2916
BLAKE2b-256 5752daaeae9ff4c6e8fbb5b7c2c48a1fee3c9f9edd7d9b2dfd99ccf1bff2b7b0

See more details on using hashes here.

File details

Details for the file arnio-0.1.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: arnio-0.1.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 456.4 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for arnio-0.1.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c97b1e46ce132a61f048c745dbc386805c8ada8d77800ae559f9820243b14f08
MD5 6029ad9c391d9305baa7b2d82d1e66e9
BLAKE2b-256 aa8e3771d2b3d9d0952869562bb01831762b2beeb6bbcd0411aa964074cd5bf7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page