Skip to main content

High-performance CSV parser with SIMD optimizations

Project description

CISV Python Bindings (nanobind)

High-performance Python bindings for the CISV CSV parser using nanobind.

Performance

These bindings are 10-100x faster than the ctypes-based bindings because they:

  1. Use the batch API: All data is parsed in C and returned at once, eliminating millions of per-field callbacks
  2. Use nanobind: Much lower overhead than ctypes or pybind11
  3. Release the GIL: Parallel parsing runs without holding the Python GIL
File Size ctypes nanobind Speedup
142MB (1M rows × 10 cols) ~20s <0.8s 25x+

Installation

From PyPI (recommended)

pip install cisv

From source

cd cisv
pip install .

Development install

cd cisv
pip install -e .

Usage

import cisv

# Parse a file
rows = cisv.parse_file('data.csv')

# Parse with options
rows = cisv.parse_file(
    'data.csv',
    delimiter=';',
    quote="'",
    trim=True,
    skip_empty_lines=True
)

# Parse large files in parallel (faster on multi-core systems)
rows = cisv.parse_file('large.csv', parallel=True)

# Parse a string
rows = cisv.parse_string("a,b,c\n1,2,3")

# Count rows quickly (SIMD-accelerated)
count = cisv.count_rows('data.csv')

# Row-by-row iteration (memory efficient, supports early exit)
with cisv.CisvIterator('large.csv') as reader:
    for row in reader:
        print(row)  # List[str]
        if row[0] == 'stop':
            break  # Early exit - no wasted work

# Or use the convenience function
for row in cisv.open_iterator('data.csv', delimiter=',', trim=True):
    process(row)

API Reference

parse_file(path, delimiter=',', quote='"', *, trim=False, skip_empty_lines=False, parallel=False, num_threads=0)

Parse a CSV file and return all rows.

Parameters:

  • path: Path to the CSV file
  • delimiter: Field delimiter character (default: ',')
  • quote: Quote character (default: '"')
  • trim: Whether to trim whitespace from fields
  • skip_empty_lines: Whether to skip empty lines
  • parallel: Use multi-threaded parsing (faster for large files)
  • num_threads: Number of threads for parallel parsing (0 = auto-detect)

Returns: List of rows, where each row is a list of field values.

parse_string(content, delimiter=',', quote='"', *, trim=False, skip_empty_lines=False)

Parse a CSV string and return all rows.

Parameters:

  • content: CSV content as a string
  • delimiter: Field delimiter character (default: ',')
  • quote: Quote character (default: '"')
  • trim: Whether to trim whitespace from fields
  • skip_empty_lines: Whether to skip empty lines

Returns: List of rows, where each row is a list of field values.

count_rows(path)

Count the number of rows in a CSV file without full parsing.

This is very fast as it only scans for newlines using SIMD instructions.

Parameters:

  • path: Path to the CSV file

Returns: Number of rows in the file.

CisvIterator(path, delimiter=',', quote='"', *, trim=False, skip_empty_lines=False)

Row-by-row iterator for streaming CSV parsing with minimal memory footprint.

Provides fgetcsv-style iteration that supports early exit - breaking out of iteration stops parsing immediately with no wasted work.

Parameters:

  • path: Path to the CSV file
  • delimiter: Field delimiter character (default: ',')
  • quote: Quote character (default: '"')
  • trim: Whether to trim whitespace from fields
  • skip_empty_lines: Whether to skip empty lines

Methods:

  • next(): Get the next row as List[str], or None if at end of file
  • close(): Close the iterator and release resources
  • closed: Property indicating whether the iterator has been closed

Protocols:

  • Iterator protocol: Use in for loops with for row in iterator
  • Context manager: Use with with statement for automatic cleanup

Example:

# Context manager (recommended)
with cisv.CisvIterator('data.csv') as reader:
    for row in reader:
        if row[0] == 'target':
            print(f"Found: {row}")
            break  # Early exit

# Manual iteration
reader = cisv.CisvIterator('data.csv')
try:
    while True:
        row = reader.next()
        if row is None:
            break
        process(row)
finally:
    reader.close()

open_iterator(path, delimiter=',', quote='"', *, trim=False, skip_empty_lines=False)

Convenience function that returns a CisvIterator. Same parameters as CisvIterator.

Example:

for row in cisv.open_iterator('data.csv'):
    print(row)

Running Tests

cd cisv
pip install -e ".[test]"
pytest

Benchmarking

pip install -e ".[benchmark]"
python -c "
import cisv
import time

# Create test file
with open('/tmp/test.csv', 'w') as f:
    f.write('col1,col2,col3\n')
    for i in range(100000):
        f.write(f'value{i}_1,value{i}_2,value{i}_3\n')

# Benchmark
start = time.time()
rows = cisv.parse_file('/tmp/test.csv')
print(f'Parsed {len(rows)} rows in {time.time()-start:.3f}s')
"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cisv-0.6.0.tar.gz (92.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cisv-0.6.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (137.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file cisv-0.6.0.tar.gz.

File metadata

  • Download URL: cisv-0.6.0.tar.gz
  • Upload date:
  • Size: 92.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cisv-0.6.0.tar.gz
Algorithm Hash digest
SHA256 e783864fc10ce118b3512719fa3a2ddf180f408645a8708b8640df5c910b686d
MD5 8cffb1bc2649bebb68eb904ef0a55720
BLAKE2b-256 58f1984abe51354ea1f6e86696f6b1cd5e138831090163d5615a375f5d4a30db

See more details on using hashes here.

File details

Details for the file cisv-0.6.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cisv-0.6.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 426b41c3be8001dc5fda8d241d96a4a2dd68539cb2fd707995edbae2bce6e62f
MD5 a3ed61dc58529b0e3f437cfae11ede90
BLAKE2b-256 da07b232cbceecf4478c4d40117381c7bef002029a29fc411bc8ce2324c32709

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page