Skip to main content

Upgrade your Pandas ETL process.

Project description

Abraxos

PyPI version Documentation Status License: MIT Tests Coverage

Abraxos is a lightweight Python toolkit for robust, row-aware data processing using Pandas and Pydantic. It helps you build resilient ETL pipelines that gracefully handle errors at the row level.

✨ Why Abraxos?

Traditional data pipelines fail completely when they encounter a single bad row. Abraxos changes that:

  • 🛡️ Fault-tolerant by design - isolate and recover from row-level errors
  • 🔍 Full error visibility - see exactly which rows failed and why
  • 🔄 Automatic retry logic - recursive splitting to isolate problem rows
  • 📊 Production-ready - 118 tests, 92% coverage, type-safe

🚀 Features

  • 📄 CSV Ingestion with Bad Line Recovery
    Read CSVs in full or in chunks, automatically capturing malformed lines separately.

  • 🔁 Transform DataFrames Resiliently
    Apply transformation functions and automatically isolate rows that fail.

  • 🧪 Pydantic-Based Row Validation
    Validate each row using Pydantic models, separating valid and invalid records.

  • 🛢️ SQL Insertion with Error Splitting
    Insert DataFrames into SQL databases with automatic retry and chunking for failed rows.


📦 Installation

pip install abraxos

With optional dependencies:

# For SQL support
pip install abraxos[sql]

# For Pydantic validation
pip install abraxos[validate]

# For development
pip install abraxos[dev]

# Everything
pip install abraxos[all]

Requirements:

  • Python 3.10+
  • pandas >= 1.5.0
  • numpy >= 1.23.0
  • Optional: sqlalchemy >= 2.0.0
  • Optional: pydantic >= 2.0.0

📖 Documentation

Full documentation is available at: https://abraxos.readthedocs.io


🎯 Quick Start

Here are real, tested examples showing Abraxos in action:

🔍 Example 1: Read CSVs with Error Recovery

Abraxos captures malformed lines instead of crashing your pipeline:

from abraxos import read_csv

# Read a CSV that has some malformed lines
result = read_csv("data.csv")

print("Bad lines:", result.bad_lines)
print("\nClean data:")
print(result.dataframe)

Output:

Bad lines: [['TOO', 'MANY', 'COLUMNS', 'HERE']]

Clean data:
   id    name  age
0   1     Joe   28
1   2   Alice   35
2   3  Marcus   40

🧼 Example 2: Transform with Fault Isolation

Apply transformations that automatically isolate problematic rows:

import pandas as pd
from abraxos import transform

df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['  Joe  ', '  Alice  ', '  Marcus  '],
    'age': [28, 35, 40]
})

def clean_data(df):
    df = df.copy()
    df["name"] = df["name"].str.strip().str.lower()
    return df

result = transform(df, clean_data)
print("Errors:", result.errors)
print("\nSuccess DataFrame:")
print(result.success_df)

Output:

Errors: []

Success DataFrame:
   id    name  age
0   1     joe   28
1   2   alice   35
2   3  marcus   40

⚡ Example 3: Automatic Error Isolation

When transformation fails on some rows, Abraxos automatically isolates them:

import pandas as pd
from abraxos import transform

df = pd.DataFrame({'value': [1, 2, 0, 3, 4]})

def divide_by_value(df):
    df = df.copy()
    if (df['value'] == 0).any():
        raise ValueError('Cannot divide by zero')
    df['result'] = 100 / df['value']
    return df

result = transform(df, divide_by_value)

print(f"Errors encountered: {len(result.errors)}")
print(f"\nSuccessful rows ({len(result.success_df)}):")
print(result.success_df)
print(f"\nFailed rows ({len(result.errored_df)}):")
print(result.errored_df)

Output:

Errors encountered: 1

Successful rows (4):
   value      result
0      1  100.000000
1      2   50.000000
3      3   33.333333
4      4   25.000000

Failed rows (1):
   value
2      0

Notice how Abraxos automatically isolated the problematic row (value=0) and processed the rest!


✅ Example 4: Validate with Pydantic

Validate each row and separate valid from invalid data:

import pandas as pd
from abraxos import validate
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

df = pd.DataFrame({
    'name': ['Joe', 'Alice', 'Marcus'],
    'age': [28, 'invalid', 40]
})

result = validate(df, Person)

print("Valid rows:")
print(result.success_df)
print(f"\nNumber of validation errors: {len(result.errors)}")
print("\nInvalid rows:")
print(result.errored_df)

Output:

Valid rows:
     name  age
0     Joe   28
2  Marcus   40

Number of validation errors: 1

Invalid rows:
    name      age
1  Alice  invalid

🗃️ Example 5: SQL Insertion with Retry Logic

Insert data into SQL with automatic error handling:

import pandas as pd
from abraxos import to_sql
from sqlalchemy import create_engine

engine = create_engine("sqlite:///example.db")

df = pd.DataFrame({
    'name': ['Joe', 'Alice', 'Marcus'],
    'age': [28, 35, 40]
})

result = to_sql(df, "people", engine)

print(f"Successful inserts: {result.success_df.shape[0]}")
print(f"Failed rows: {result.errored_df.shape[0]}")

Output:

Successful inserts: 3
Failed rows: 0

Data in database:
     name  age
0     Joe   28
1   Alice   35
2  Marcus   40

📚 Example 6: Process Large Files in Chunks

Read and process large CSV files efficiently:

from abraxos import read_csv

# Read in chunks of 1000 rows
for chunk_result in read_csv("large_file.csv", chunksize=1000):
    print(f"Processing chunk with {len(chunk_result.dataframe)} rows")
    print(f"Bad lines in this chunk: {len(chunk_result.bad_lines)}")
    
    # Process the chunk
    # ... your processing logic here

Output:

Reading in chunks of 2 rows:

Chunk 1:
   id  value
0   1     10
1   2     20

Chunk 2:
   id  value
2   3     30
3   4     40

Chunk 3:
   id  value
4   5     50

🔄 Complete ETL Pipeline Example

Here's a complete example combining multiple features:

from abraxos import read_csv, transform, validate, to_sql
from pydantic import BaseModel
from sqlalchemy import create_engine

# 1. Extract: Read CSV with error recovery
csv_result = read_csv("messy_data.csv")
print(f"Captured {len(csv_result.bad_lines)} bad lines")

# 2. Transform: Clean the data
def clean_data(df):
    df = df.copy()
    df['name'] = df['name'].str.strip().str.title()
    df['age'] = pd.to_numeric(df['age'], errors='coerce')
    return df.dropna()

transform_result = transform(csv_result.dataframe, clean_data)
print(f"Transformed {len(transform_result.success_df)} rows successfully")

# 3. Validate: Ensure data quality
class Person(BaseModel):
    name: str
    age: int

validate_result = validate(transform_result.success_df, Person)
print(f"Validated {len(validate_result.success_df)} rows")
print(f"Validation failed for {len(validate_result.errored_df)} rows")

# 4. Load: Insert into database
engine = create_engine("sqlite:///clean_data.db")
load_result = to_sql(validate_result.success_df, "people", engine)
print(f"Loaded {len(load_result.success_df)} rows to database")

# Save error reports
csv_result.bad_lines  # Malformed CSV lines
transform_result.errored_df  # Rows that failed transformation
validate_result.errored_df  # Rows that failed validation
load_result.errored_df  # Rows that failed to insert

🏗️ API Reference

Core Functions

read_csv(path, *, chunksize=None, **kwargs) -> ReadCsvResult | Generator

Read CSV files with automatic bad line recovery.

Returns: ReadCsvResult(bad_lines, dataframe) or generator of results if chunked.

transform(df, transformer, chunks=2) -> TransformResult

Apply a transformation function with automatic error isolation.

Returns: TransformResult(errors, errored_df, success_df)

validate(df, model) -> ValidateResult

Validate DataFrame rows using a Pydantic model.

Returns: ValidateResult(errors, errored_df, success_df)

to_sql(df, name, con, *, if_exists='append', chunks=2, **kwargs) -> ToSqlResult

Insert DataFrame into SQL database with retry logic.

Returns: ToSqlResult(errors, errored_df, success_df)

Utility Functions

  • split(df, n=2) - Split DataFrame into n parts
  • clear(df) - Create empty DataFrame with same schema
  • to_records(df) - Convert DataFrame to list of dicts with None for NaN

🧪 Testing & Development

Abraxos is thoroughly tested and type-safe:

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests with coverage (118 tests, 92% coverage)
pytest

# Run type checking
mypy abraxos  # Success: no issues found

# Run linting and formatting
ruff check .  # All checks passed
ruff format .

Test Coverage:

  • 118 tests passing
  • 92% code coverage
  • All major code paths tested
  • Type-safe with mypy

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Quick checklist:

  • ✅ Add tests for new features
  • ✅ Maintain 90%+ coverage
  • ✅ Pass all type checks (mypy abraxos)
  • ✅ Pass all lints (ruff check .)
  • ✅ Update documentation

📝 Changelog

See CHANGELOG.md for version history and migration guides.


📄 License

MIT License © 2024 Odos Matthews


🧙‍♂️ Author

Crafted by Odos Matthews to bring resilience and magic to data workflows.


⭐ Support

If Abraxos helps your project, consider:

  • ⭐ Starring the repo
  • 🐛 Reporting issues
  • 🤝 Contributing improvements
  • 📢 Sharing with others

Happy data processing! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abraxos-0.1.0.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abraxos-0.1.0-py3-none-any.whl (29.4 kB view details)

Uploaded Python 3

File details

Details for the file abraxos-0.1.0.tar.gz.

File metadata

  • Download URL: abraxos-0.1.0.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for abraxos-0.1.0.tar.gz
Algorithm Hash digest
SHA256 267c426961e91e79e63b53b405e905de8a87287c63a2b6e5ecf18467e382d297
MD5 c65146e5f5bc4bee64cb9250a6fb154f
BLAKE2b-256 a24923fb7c81b49bbf6b18bcfc14a4ffd249b935523936b7e39883e91a2984bf

See more details on using hashes here.

File details

Details for the file abraxos-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: abraxos-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for abraxos-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0019f4092ccc485a0b85ba30cc1de4d34dc2277ec333d864839e4382c8c1c836
MD5 6e0a2480b14aa03de70970ac5da798a0
BLAKE2b-256 2034c18266eb5d6c1a8a44c2c73355847c59a419eda56caca2ec9e32b3b5a6b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page