Financial Data Analyst utility toolkit for data cleaning, validation, profiling, and pipelines.
Project description
📊 FDA Toolkit
Financial Data Analysis Made Simple — A production-grade Python toolkit for loading, cleaning, validating, and analyzing financial data with one-line pipelines.
Why FDA Toolkit?
Financial data analysis is messy. You spend 80% of your time cleaning, validating, and transforming data instead of analyzing it. FDA Toolkit eliminates that pain by providing:
- 67 production-ready functions grouped into 8 intelligent modules
- One-line pipelines for common workflows (e.g.,
ftk.quick_clean_finance()) - Finance-aware validation — understand sign conventions, entity names, currency formats
- Audit trail — every operation logged for compliance and debugging
- Type-safe — full type hints and IDE autocomplete throughout
- Memory efficient — optimize dtypes, handle large files with chunking
- Professional API — pandas-like, intuitive, well-documented
Module Overview
| Module | Functions | Purpose |
|---|---|---|
| core | 17 | Column cleaning, types, duplicates, missing, outliers, text |
| features | 7 | Date & categorical feature engineering |
| finance | 11 | Currency parsing, entity standardization, financial validation |
| validation | 9 | Schema, ranges, integrity, reconciliation |
| reporting | 10 | Profiling, snapshots, delta reports, quick checks |
| io | 5 | Safe CSV/Excel reading, chunked processing, parquet export |
| pipelines | 2 | Pre-built quick_clean() and quick_clean_finance() |
| utils | 6 | Logging, security, memory optimization |
| TOTAL | 67 | Production-ready functions |
Quick Start
Install
pip install -e .
Use in 3 Lines
import fda_toolkit as ftk
df = ftk.read_csv_safely("data/transactions.csv")
df_clean = ftk.quick_clean_finance(df, primary_key="transaction_id",
date_cols=["date"], currency_cols=["amount"])
ftk.quick_check(df_clean) # Profile results
Discover All Functions
# See what's available
ftk.info() # Browse by category
# Filter by domain
ftk.info(category="Finance")
📚 What's Inside?
Core Data Cleaning (17 functions)
Handle the fundamentals with confidence:
from fda_toolkit.core import columns, duplicates, missing, outliers, text, types
df = columns.clean_column_headers(df) # 'Name ' → 'name'
df = types.clean_numeric_column(df['amount']) # '$1,234.56' → 1234.56
df = missing.fill_missing(df, strategy='mean') # Handle NaN intelligently
df = duplicates.remove_duplicates(df, subset=['id'])
df = outliers.flag_outliers(df, 'amount') # Flag statistical outliers
Finance-Specific (11 functions)
Domain expertise built-in:
from fda_toolkit.finance import parsing, entities, rules
df['amount'] = parsing.parse_currency(df['amount']) # Handle $, €, £
df['vendor'] = entities.strip_legal_suffixes(df['vendor']) # ACME Ltd → ACME
rules.validate_sign_conventions(df, rules_config) # Verify debit/credit
Feature Engineering (7 functions)
Prepare data for ML in seconds:
from fda_toolkit.features import datetime, categorical
df = datetime.extract_date_features(df, 'date') # Add year, month, quarter
df['category'] = categorical.limit_cardinality(df['category'], top_n=10)
Validation Suite (9 functions)
Catch issues before they become problems:
from fda_toolkit.validation import schema, ranges, integrity
schema.validate_required_fields(df, ['id', 'date', 'amount'])
violations = ranges.validate_data_ranges(df, {'amount': (0, 1_000_000)})
integrity.reconciliation_check(original_df, clean_df, value_cols=['amount'])
Smart Pipelines (2 functions)
Pre-built, battle-tested workflows:
# Generic pipeline
df_clean = ftk.quick_clean(df)
# Finance pipeline (smart defaults for financial data)
df_clean = ftk.quick_clean_finance(
df,
primary_key="invoice_id",
date_cols=["invoice_date", "due_date"],
currency_cols=["amount", "tax"]
)
Reporting & Profiling (10 functions)
Understand your data instantly:
# Quick diagnosis
ftk.quick_check(df)
# Detailed profile
profile = ftk.profile_report(df) # Types, missingness, memory, outliers
# Track changes
snapshot_v1 = ftk.snapshot_dataset(df_before, name="before_clean")
snapshot_v2 = ftk.snapshot_dataset(df_after, name="after_clean")
delta = ftk.compare_snapshots(snapshot_v1, snapshot_v2)
Secure I/O (5 functions)
Read and write without surprises:
# Safe reading with encoding detection
df = ftk.read_csv_safely("messy_file.csv")
df = ftk.read_excel_safely("workbook.xlsx", sheet_name="Data")
# Process huge files in chunks
for chunk in ftk.chunked_processing("huge_file.csv", chunksize=50_000):
process(chunk)
# Export in optimized formats
ftk.export_parquet(df, "output.parquet") # Fast, compressed
Architecture: Dynamic & Scalable
Every function self-registers via decorator — no manual __all__ lists:
from fda_toolkit.registry import register_function
@register_function(
name="detect_fraud",
category="Validation",
module="custom.fraud"
)
def detect_fraud(df: pd.DataFrame) -> pd.DataFrame:
"""Your custom logic here."""
result = df[df['amount'] > threshold]
audit_log("detect_fraud", before=len(df), after=len(result))
return result
# Automatically appears in ftk.info()!
Audit Trail (Compliance Ready)
Every operation is logged automatically:
from fda_toolkit.utils.logging import get_global_audit_log
log = get_global_audit_log()
for event in log.events:
print(f"✓ {event.name} at {event.timestamp_utc}")
# Export for compliance teams
audit_json = log.to_dict() # JSON-ready
💡 Real-World Example
import fda_toolkit as ftk
# 1. Load and diagnose
df = ftk.read_csv_safely("sales_transactions_2024.csv")
ftk.quick_check(df)
# → Reports: types, missing %, duplicates, outliers, memory usage
# 2. Clean for analysis
df_clean = ftk.quick_clean_finance(
df,
primary_key="transaction_id",
date_cols=["date", "due_date"],
currency_cols=["amount", "tax"]
)
# 3. Validate
from fda_toolkit.validation import integrity
integrity.reconciliation_check(
original=df,
cleaned=df_clean,
value_cols=["amount"],
group_cols=["vendor_id"]
)
# 4. Engineer features for ML
df_ml = ftk.extract_date_features(df_clean, "date")
df_ml = ftk.limit_cardinality(df_ml, "vendor", top_n=20)
# 5. Export and log
ftk.export_parquet(df_ml, "ready_for_ml.parquet")
print("✅ Pipeline complete with full audit trail!")
Testing
# Run all tests
pytest
# Run specific module
pytest tests/test_core/
# Verbose output
pytest -v
Example test:
import pandas as pd
from fda_toolkit.core.columns import clean_column_headers
def test_clean_headers():
df = pd.DataFrame({'Name ': [1], 'Age (years)': [2]})
result = clean_column_headers(df)
assert result.columns.tolist() == ['name', 'age_years']
Installation & Development
From Source
# Clone or download
cd fda_toolkit_project
# Install in editable mode (dev)
pip install -e .
# With dev dependencies (if available)
pip install -e ".[dev]"
Requirements
- Python 3.9+
- pandas (data manipulation)
- numpy (numerical operations)
Security & Compliance
- Audit logging — Every operation tracked with timestamps
- Data masking —
mask_sensitive_fields()for PII protection - Type safety — Full type hints prevent common errors
- Error handling — Clear, actionable error messages
- Memory optimization — Control data footprint
📖 API Reference
Explore the full API:
ftk.info() # List all functions
ftk.info(category="Finance") # Filter by domain
ftk.get_data_summary(df) # Profile a dataset
ftk.profile_report(df) # Detailed analysis
For detailed docs on each function:
from fda_toolkit.core.outliers import detect_outliers_iqr
help(detect_outliers_iqr) # Full docstring with examples
See QUICK_REFERENCE.md for common patterns.
🎯 Use Cases
✅ Financial Reporting — Prepare data for compliance audits
✅ ML Pipelines — Clean & engineer features for models
✅ Data Migration — Validate and transform during transfers
✅ Anomaly Detection — Flag outliers in transactions
✅ Time Series Analysis — Extract date features automatically
✅ Data Quality Monitoring — Profile and compare snapshots
🚀 Next Steps
- Explore functions:
ftk.info() - Try examples: See examples/01_quick_check.py
- Read docs: docs/function_reference.md
- Run tests:
pytest - Extend: Add your own functions using
@register_function
📝 License
MIT License — see LICENSE for details.
🤝 Contributing
Found a bug? Have an idea? Open an issue or PR!
Built for financial analysts who value time, accuracy, and peace of mind. 📊✨
FDA Toolkit: Where data cleaning stops being painful and starts being productive.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fda_toolkit-0.2.3.tar.gz.
File metadata
- Download URL: fda_toolkit-0.2.3.tar.gz
- Upload date:
- Size: 41.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e1777ebec6ef4a3fc226fd74a04631acfa4c580c1193c9f5bf85e2a2acc7910
|
|
| MD5 |
74f6ceb1db233e7778d47f92c951ed44
|
|
| BLAKE2b-256 |
1d9e1a941c3aab3e3e8c8056d2345d6e581b220c740f1d9431fbe4b24a7f1500
|
File details
Details for the file fda_toolkit-0.2.3-py3-none-any.whl.
File metadata
- Download URL: fda_toolkit-0.2.3-py3-none-any.whl
- Upload date:
- Size: 54.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dbfde459eae5496f1681f4d87f74afe17ba0b9a411777001708bc97ef6a4969
|
|
| MD5 |
eebbb897f562cbc87d0c3ed81ba11801
|
|
| BLAKE2b-256 |
c81c3da32bb4cab7fbcb50dcb9f0dfb89a72f4c3543eacfa56255f2654689c33
|