Upgrade your Pandas ETL process.
Project description
Abraxos
Abraxos is a lightweight Python toolkit for robust, row-aware data processing using Pandas and Pydantic. It helps you build resilient ETL pipelines that gracefully handle errors at the row level.
✨ Why Abraxos?
Traditional data pipelines fail completely when they encounter a single bad row. Abraxos changes that:
- 🛡️ Fault-tolerant by design - isolate and recover from row-level errors
- 🔍 Full error visibility - see exactly which rows failed and why
- 🔄 Automatic retry logic - recursive splitting to isolate problem rows
- 📊 Production-ready - 118 tests, 92% coverage, type-safe
🚀 Features
-
📄 CSV Ingestion with Bad Line Recovery
Read CSVs in full or in chunks, automatically capturing malformed lines separately. -
🔁 Transform DataFrames Resiliently
Apply transformation functions and automatically isolate rows that fail. -
🧪 Pydantic-Based Row Validation
Validate each row using Pydantic models, separating valid and invalid records. -
🛢️ SQL Insertion with Error Splitting
Insert DataFrames into SQL databases with automatic retry and chunking for failed rows.
📦 Installation
pip install abraxos
With optional dependencies:
# For SQL support
pip install abraxos[sql]
# For Pydantic validation
pip install abraxos[validate]
# For development
pip install abraxos[dev]
# Everything
pip install abraxos[all]
Requirements:
- Python 3.10+
- pandas >= 1.5.0
- numpy >= 1.23.0
- Optional: sqlalchemy >= 2.0.0
- Optional: pydantic >= 2.0.0
📖 Documentation
Full documentation is available at: https://abraxos.readthedocs.io
🎯 Quick Start
Here are real, tested examples showing Abraxos in action:
🔍 Example 1: Read CSVs with Error Recovery
Abraxos captures malformed lines instead of crashing your pipeline:
from abraxos import read_csv
# Read a CSV that has some malformed lines
result = read_csv("data.csv")
print("Bad lines:", result.bad_lines)
print("\nClean data:")
print(result.dataframe)
Output:
Bad lines: [['TOO', 'MANY', 'COLUMNS', 'HERE']]
Clean data:
id name age
0 1 Joe 28
1 2 Alice 35
2 3 Marcus 40
🧼 Example 2: Transform with Fault Isolation
Apply transformations that automatically isolate problematic rows:
import pandas as pd
from abraxos import transform
df = pd.DataFrame({
'id': [1, 2, 3],
'name': [' Joe ', ' Alice ', ' Marcus '],
'age': [28, 35, 40]
})
def clean_data(df):
df = df.copy()
df["name"] = df["name"].str.strip().str.lower()
return df
result = transform(df, clean_data)
print("Errors:", result.errors)
print("\nSuccess DataFrame:")
print(result.success_df)
Output:
Errors: []
Success DataFrame:
id name age
0 1 joe 28
1 2 alice 35
2 3 marcus 40
⚡ Example 3: Automatic Error Isolation
When transformation fails on some rows, Abraxos automatically isolates them:
import pandas as pd
from abraxos import transform
df = pd.DataFrame({'value': [1, 2, 0, 3, 4]})
def divide_by_value(df):
df = df.copy()
if (df['value'] == 0).any():
raise ValueError('Cannot divide by zero')
df['result'] = 100 / df['value']
return df
result = transform(df, divide_by_value)
print(f"Errors encountered: {len(result.errors)}")
print(f"\nSuccessful rows ({len(result.success_df)}):")
print(result.success_df)
print(f"\nFailed rows ({len(result.errored_df)}):")
print(result.errored_df)
Output:
Errors encountered: 1
Successful rows (4):
value result
0 1 100.000000
1 2 50.000000
3 3 33.333333
4 4 25.000000
Failed rows (1):
value
2 0
Notice how Abraxos automatically isolated the problematic row (value=0) and processed the rest!
✅ Example 4: Validate with Pydantic
Validate each row and separate valid from invalid data:
import pandas as pd
from abraxos import validate
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
df = pd.DataFrame({
'name': ['Joe', 'Alice', 'Marcus'],
'age': [28, 'invalid', 40]
})
result = validate(df, Person)
print("Valid rows:")
print(result.success_df)
print(f"\nNumber of validation errors: {len(result.errors)}")
print("\nInvalid rows:")
print(result.errored_df)
Output:
Valid rows:
name age
0 Joe 28
2 Marcus 40
Number of validation errors: 1
Invalid rows:
name age
1 Alice invalid
🗃️ Example 5: SQL Insertion with Retry Logic
Insert data into SQL with automatic error handling:
import pandas as pd
from abraxos import to_sql
from sqlalchemy import create_engine
engine = create_engine("sqlite:///example.db")
df = pd.DataFrame({
'name': ['Joe', 'Alice', 'Marcus'],
'age': [28, 35, 40]
})
result = to_sql(df, "people", engine)
print(f"Successful inserts: {result.success_df.shape[0]}")
print(f"Failed rows: {result.errored_df.shape[0]}")
Output:
Successful inserts: 3
Failed rows: 0
Data in database:
name age
0 Joe 28
1 Alice 35
2 Marcus 40
📚 Example 6: Process Large Files in Chunks
Read and process large CSV files efficiently:
from abraxos import read_csv
# Read in chunks of 1000 rows
for chunk_result in read_csv("large_file.csv", chunksize=1000):
print(f"Processing chunk with {len(chunk_result.dataframe)} rows")
print(f"Bad lines in this chunk: {len(chunk_result.bad_lines)}")
# Process the chunk
# ... your processing logic here
Output:
Reading in chunks of 2 rows:
Chunk 1:
id value
0 1 10
1 2 20
Chunk 2:
id value
2 3 30
3 4 40
Chunk 3:
id value
4 5 50
🔄 Complete ETL Pipeline Example
Here's a complete example combining multiple features:
from abraxos import read_csv, transform, validate, to_sql
from pydantic import BaseModel
from sqlalchemy import create_engine
# 1. Extract: Read CSV with error recovery
csv_result = read_csv("messy_data.csv")
print(f"Captured {len(csv_result.bad_lines)} bad lines")
# 2. Transform: Clean the data
def clean_data(df):
df = df.copy()
df['name'] = df['name'].str.strip().str.title()
df['age'] = pd.to_numeric(df['age'], errors='coerce')
return df.dropna()
transform_result = transform(csv_result.dataframe, clean_data)
print(f"Transformed {len(transform_result.success_df)} rows successfully")
# 3. Validate: Ensure data quality
class Person(BaseModel):
name: str
age: int
validate_result = validate(transform_result.success_df, Person)
print(f"Validated {len(validate_result.success_df)} rows")
print(f"Validation failed for {len(validate_result.errored_df)} rows")
# 4. Load: Insert into database
engine = create_engine("sqlite:///clean_data.db")
load_result = to_sql(validate_result.success_df, "people", engine)
print(f"Loaded {len(load_result.success_df)} rows to database")
# Save error reports
csv_result.bad_lines # Malformed CSV lines
transform_result.errored_df # Rows that failed transformation
validate_result.errored_df # Rows that failed validation
load_result.errored_df # Rows that failed to insert
🏗️ API Reference
Core Functions
read_csv(path, *, chunksize=None, **kwargs) -> ReadCsvResult | Generator
Read CSV files with automatic bad line recovery.
Returns: ReadCsvResult(bad_lines, dataframe) or generator of results if chunked.
transform(df, transformer, chunks=2) -> TransformResult
Apply a transformation function with automatic error isolation.
Returns: TransformResult(errors, errored_df, success_df)
validate(df, model) -> ValidateResult
Validate DataFrame rows using a Pydantic model.
Returns: ValidateResult(errors, errored_df, success_df)
to_sql(df, name, con, *, if_exists='append', chunks=2, **kwargs) -> ToSqlResult
Insert DataFrame into SQL database with retry logic.
Returns: ToSqlResult(errors, errored_df, success_df)
Utility Functions
split(df, n=2)- Split DataFrame into n partsclear(df)- Create empty DataFrame with same schemato_records(df)- Convert DataFrame to list of dicts with None for NaN
🧪 Testing & Development
Abraxos is thoroughly tested and type-safe:
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests with coverage (118 tests, 92% coverage)
pytest
# Run type checking
mypy abraxos # Success: no issues found
# Run linting and formatting
ruff check . # All checks passed
ruff format .
Test Coverage:
- 118 tests passing
- 92% code coverage
- All major code paths tested
- Type-safe with mypy
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Quick checklist:
- ✅ Add tests for new features
- ✅ Maintain 90%+ coverage
- ✅ Pass all type checks (
mypy abraxos) - ✅ Pass all lints (
ruff check .) - ✅ Update documentation
📝 Changelog
See CHANGELOG.md for version history and migration guides.
📄 License
MIT License © 2024 Odos Matthews
🧙♂️ Author
Crafted by Odos Matthews to bring resilience and magic to data workflows.
⭐ Support
If Abraxos helps your project, consider:
- ⭐ Starring the repo
- 🐛 Reporting issues
- 🤝 Contributing improvements
- 📢 Sharing with others
Happy data processing! 🚀
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abraxos-0.1.0.tar.gz.
File metadata
- Download URL: abraxos-0.1.0.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
267c426961e91e79e63b53b405e905de8a87287c63a2b6e5ecf18467e382d297
|
|
| MD5 |
c65146e5f5bc4bee64cb9250a6fb154f
|
|
| BLAKE2b-256 |
a24923fb7c81b49bbf6b18bcfc14a4ffd249b935523936b7e39883e91a2984bf
|
File details
Details for the file abraxos-0.1.0-py3-none-any.whl.
File metadata
- Download URL: abraxos-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0019f4092ccc485a0b85ba30cc1de4d34dc2277ec333d864839e4382c8c1c836
|
|
| MD5 |
6e0a2480b14aa03de70970ac5da798a0
|
|
| BLAKE2b-256 |
2034c18266eb5d6c1a8a44c2c73355847c59a419eda56caca2ec9e32b3b5a6b7
|