Iterable data processing Python library
Project description
Iterable Data
Iterable Data is a Python library for reading and writing data files row by row in a consistent, iterator-based interface. It provides a unified API for working with various data formats (CSV, JSON, Parquet, XML, etc.) similar to csv.DictReader but supporting many more formats.
This library simplifies data processing and conversion between formats while preserving complex nested data structures (unlike pandas DataFrames which require flattening).
Features
- Unified API: Single interface for reading/writing multiple data formats
- Automatic Format Detection: Detects file type and compression from filename
- Support for Compression: Works seamlessly with compressed files
- Preserves Nested Data: Handles complex nested structures as Python dictionaries
- DuckDB Integration: Optional DuckDB engine for high-performance queries
- Pipeline Processing: Built-in pipeline support for data transformation
- Encoding Detection: Automatic encoding and delimiter detection for text files
- Bulk Operations: Efficient batch reading and writing
Supported File Types
- BSON - Binary JSON format
- JSON - Standard JSON files
- JSONL/NDJSON - JSON Lines format (one JSON object per line)
- XML - XML files with configurable tag parsing
- CSV/TSV - Comma and tab-separated values
- XLS/XLSX - Microsoft Excel files
- Parquet - Apache Parquet columnar format
- ORC - Optimized Row Columnar format
- Avro - Apache Avro binary format
- Pickle - Python pickle format
Supported Compression Codecs
- GZip (.gz)
- BZip2 (.bz2)
- LZMA (.xz, .lzma)
- LZ4 (.lz4)
- ZIP (.zip)
- Brotli (.br)
- ZStandard (.zst, .zstd)
Requirements
Python 3.10+
Installation
pip install iterabledata
Or install from source:
git clone https://github.com/apicrafter/pyiterable.git
cd pyiterable
pip install .
Quick Start
Basic Reading
from iterable.helpers.detect import open_iterable
# Automatically detects format and compression
source = open_iterable('data.csv.gz')
for row in source:
print(row)
# Process your data here
source.close()
Writing Data
from iterable.helpers.detect import open_iterable
# Write compressed JSONL file
dest = open_iterable('output.jsonl.zst', mode='w')
for item in my_data:
dest.write(item)
dest.close()
Usage Examples
Reading Compressed CSV Files
from iterable.helpers.detect import open_iterable
# Read compressed CSV file (supports .gz, .bz2, .xz, .zst, .lz4, .br)
source = open_iterable('data.csv.xz')
n = 0
for row in source:
n += 1
# Process row data
if n % 1000 == 0:
print(f'Processed {n} rows')
source.close()
Reading Different Formats
from iterable.helpers.detect import open_iterable
# Read JSONL file
jsonl_file = open_iterable('data.jsonl')
for row in jsonl_file:
print(row)
jsonl_file.close()
# Read Parquet file
parquet_file = open_iterable('data.parquet')
for row in parquet_file:
print(row)
parquet_file.close()
# Read XML file (specify tag name)
xml_file = open_iterable('data.xml', iterableargs={'tagname': 'item'})
for row in xml_file:
print(row)
xml_file.close()
# Read Excel file
xlsx_file = open_iterable('data.xlsx')
for row in xlsx_file:
print(row)
xlsx_file.close()
Format Detection and Encoding
from iterable.helpers.detect import open_iterable, detect_file_type
from iterable.helpers.utils import detect_encoding, detect_delimiter
# Detect file type and compression
result = detect_file_type('data.csv.gz')
print(f"Type: {result['datatype']}, Codec: {result['codec']}")
# Detect encoding for CSV files
encoding_info = detect_encoding('data.csv')
print(f"Encoding: {encoding_info['encoding']}, Confidence: {encoding_info['confidence']}")
# Detect delimiter for CSV files
delimiter = detect_delimiter('data.csv', encoding=encoding_info['encoding'])
# Open with detected settings
source = open_iterable('data.csv', iterableargs={
'encoding': encoding_info['encoding'],
'delimiter': delimiter
})
Format Conversion
from iterable.helpers.detect import open_iterable
from iterable.convert.core import convert
# Simple format conversion
convert('input.jsonl.gz', 'output.parquet')
# Convert with options
convert(
'input.csv.xz',
'output.jsonl.zst',
iterableargs={'delimiter': ';', 'encoding': 'utf-8'},
batch_size=10000
)
# Convert and flatten nested structures
convert(
'input.jsonl',
'output.csv',
is_flatten=True,
batch_size=50000
)
Using Pipeline for Data Processing
from iterable.helpers.detect import open_iterable
from iterable.pipeline.core import pipeline
source = open_iterable('input.parquet')
destination = open_iterable('output.jsonl.xz', mode='w')
def transform_record(record, state):
"""Transform each record"""
# Add processing logic
out = {}
for key in ['name', 'email', 'age']:
if key in record:
out[key] = record[key]
return out
def progress_callback(stats, state):
"""Called every trigger_on records"""
print(f"Processed {stats['rec_count']} records, "
f"Duration: {stats.get('duration', 0):.2f}s")
def final_callback(stats, state):
"""Called when processing completes"""
print(f"Total records: {stats['rec_count']}")
print(f"Total time: {stats['duration']:.2f}s")
pipeline(
source=source,
destination=destination,
process_func=transform_record,
trigger_func=progress_callback,
trigger_on=1000,
final_func=final_callback,
start_state={}
)
source.close()
destination.close()
Manual Format and Codec Usage
from iterable.datatypes.jsonl import JSONLinesIterable
from iterable.datatypes.bsonf import BSONIterable
from iterable.codecs.gzipcodec import GZIPCodec
from iterable.codecs.lzmacodec import LZMACodec
# Read gzipped JSONL
read_codec = GZIPCodec('input.jsonl.gz', mode='r', open_it=True)
reader = JSONLinesIterable(codec=read_codec)
# Write LZMA compressed BSON
write_codec = LZMACodec('output.bson.xz', mode='wb', open_it=False)
writer = BSONIterable(codec=write_codec, mode='w')
for row in reader:
writer.write(row)
reader.close()
writer.close()
Using DuckDB Engine
from iterable.helpers.detect import open_iterable
# Use DuckDB engine for CSV, JSON, JSONL files
# Supported formats: csv, jsonl, ndjson, json
# Supported codecs: gz, zstd, zst
source = open_iterable(
'data.csv.gz',
engine='duckdb'
)
# DuckDB engine supports totals
total = source.totals()
print(f"Total records: {total}")
for row in source:
print(row)
source.close()
Bulk Operations
from iterable.helpers.detect import open_iterable
source = open_iterable('input.jsonl')
destination = open_iterable('output.parquet', mode='w')
# Read and write in batches for better performance
batch = []
for row in source:
batch.append(row)
if len(batch) >= 10000:
destination.write_bulk(batch)
batch = []
# Write remaining records
if batch:
destination.write_bulk(batch)
source.close()
destination.close()
Working with Excel Files
from iterable.helpers.detect import open_iterable
# Read Excel file (specify sheet or page)
xls_file = open_iterable('data.xlsx', iterableargs={'page': 0})
for row in xls_file:
print(row)
xls_file.close()
# Read specific sheet in XLSX
xlsx_file = open_iterable('data.xlsx', iterableargs={'page': 'Sheet2'})
XML Processing
from iterable.helpers.detect import open_iterable
# Parse XML with specific tag name
xml_file = open_iterable(
'data.xml',
iterableargs={
'tagname': 'book',
'prefix_strip': True # Strip XML namespace prefixes
}
)
for item in xml_file:
print(item)
xml_file.close()
Advanced: Converting Compressed XML to Parquet
from iterable.datatypes.xml import XMLIterable
from iterable.datatypes.parquet import ParquetIterable
from iterable.codecs.bz2codec import BZIP2Codec
# Read compressed XML
read_codec = BZIP2Codec('data.xml.bz2', mode='r')
reader = XMLIterable(codec=read_codec, tagname='page')
# Write to Parquet with schema adaptation
writer = ParquetIterable(
'output.parquet',
mode='w',
use_pandas=False,
adapt_schema=True,
batch_size=10000
)
batch = []
for row in reader:
batch.append(row)
if len(batch) >= 10000:
writer.write_bulk(batch)
batch = []
if batch:
writer.write_bulk(batch)
reader.close()
writer.close()
API Reference
Main Functions
open_iterable(filename, mode='r', engine='internal', codecargs={}, iterableargs={})
Opens a file and returns an iterable object.
Parameters:
filename(str): Path to the filemode(str): File mode ('r' for read, 'w' for write)engine(str): Processing engine ('internal' or 'duckdb')codecargs(dict): Arguments for codec initializationiterableargs(dict): Arguments for iterable initialization
Returns: Iterable object for the detected file type
detect_file_type(filename)
Detects file type and compression codec from filename.
Returns: Dictionary with success, datatype, and codec keys
convert(fromfile, tofile, iterableargs={}, scan_limit=1000, batch_size=50000, silent=True, is_flatten=False)
Converts data between formats.
Parameters:
fromfile(str): Source file pathtofile(str): Destination file pathiterableargs(dict): Options for iterablescan_limit(int): Number of records to scan for schema detectionbatch_size(int): Batch size for bulk operationssilent(bool): Suppress progress outputis_flatten(bool): Flatten nested structures
Iterable Methods
All iterable objects support:
read()- Read single recordread_bulk(num)- Read multiple recordswrite(record)- Write single recordwrite_bulk(records)- Write multiple recordsreset()- Reset iterator to beginningclose()- Close file handles
Engines
Internal Engine (Default)
The internal engine uses pure Python implementations for all formats. It supports all file types and compression codecs.
DuckDB Engine
The DuckDB engine provides high-performance querying capabilities for supported formats:
- Formats: CSV, JSONL, NDJSON, JSON
- Codecs: GZIP, ZStandard (.zst)
- Features: Fast querying, totals counting, SQL-like operations
Use engine='duckdb' when opening files:
source = open_iterable('data.csv.gz', engine='duckdb')
Examples Directory
See the examples directory for more complete examples:
simplewiki/- Processing Wikipedia XML dumps
More Examples and Tests
See the tests directory for comprehensive usage examples and test cases.
Related Projects
This library is used in:
- undatum - Command line data processing tool
- datacrafter - Data processing ETL engine
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit pull requests or open issues.
Changelog
Version 1.0.7 (2025-12-15)
- Performance Analysis: Added comprehensive performance optimization analysis document
- Development Documentation: Enhanced project documentation with performance guides
Version 1.0.6
- Comprehensive Documentation: Enhanced README.md with detailed usage examples, API reference, and comprehensive guides
- GitHub Actions Release Workflow: Automatic release generation with version verification, testing, and PyPI publishing support
- Improved Examples: Added examples for all major use cases including format conversion, pipeline processing, and DuckDB integration
- Documentation Structure: Better organized README with clear sections for quick start, usage examples, and API reference
Version 1.0.5
- DuckDB engine support
- Enhanced format detection
- Improved compression codec handling
- Pipeline processing framework
- Bulk operations support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iterabledata-1.0.7.tar.gz.
File metadata
- Download URL: iterabledata-1.0.7.tar.gz
- Upload date:
- Size: 36.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a439ff52fddce3f45d40214b1a0e644bed255f8148143e2413ea02d306c19d3
|
|
| MD5 |
f048b38a0e85f8aecf58c494e444c096
|
|
| BLAKE2b-256 |
fc96a099331cf6105d82f5543ff97182fd3f3f59d90ffb2c4822a10619d1dea2
|
File details
Details for the file iterabledata-1.0.7-py3-none-any.whl.
File metadata
- Download URL: iterabledata-1.0.7-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bc0a7f8e563987b274bd23d6824ce29600b0d1075cbb44e012ae73e9fe91c0d
|
|
| MD5 |
720c96c6d7e6a78150df473ed4a7024c
|
|
| BLAKE2b-256 |
9ab7faba480eac99dc3847487485ddc5f8787ed72a028d77ca79ec78b9bc0fac
|