Skip to main content

Financial Data Analyst utility toolkit for data cleaning, validation, profiling, and pipelines.

Project description

📊 FDA Toolkit

Financial Data Analysis Made Simple — A production-grade Python toolkit for loading, cleaning, validating, and analyzing financial data with one-line pipelines.

Python 3.9+ License: MIT Code style: black

Why FDA Toolkit?

Financial data analysis is messy. You spend 80% of your time cleaning, validating, and transforming data instead of analyzing it. FDA Toolkit eliminates that pain by providing:

  • 67 production-ready functions grouped into 8 intelligent modules
  • One-line pipelines for common workflows (e.g., ftk.quick_clean_finance())
  • Finance-aware validation — understand sign conventions, entity names, currency formats
  • Audit trail — every operation logged for compliance and debugging
  • Type-safe — full type hints and IDE autocomplete throughout
  • Memory efficient — optimize dtypes, handle large files with chunking
  • Professional API — pandas-like, intuitive, well-documented

Module Overview

Module Functions Purpose
core 17 Column cleaning, types, duplicates, missing, outliers, text
features 7 Date & categorical feature engineering
finance 11 Currency parsing, entity standardization, financial validation
validation 9 Schema, ranges, integrity, reconciliation
reporting 10 Profiling, snapshots, delta reports, quick checks
io 5 Safe CSV/Excel reading, chunked processing, parquet export
pipelines 2 Pre-built quick_clean() and quick_clean_finance()
utils 6 Logging, security, memory optimization
TOTAL 67 Production-ready functions

Quick Start

Install

pip install -e .

Use in 3 Lines

import fda_toolkit as ftk

df = ftk.read_csv_safely("data/transactions.csv")
df_clean = ftk.quick_clean_finance(df, primary_key="transaction_id", 
                                   date_cols=["date"], currency_cols=["amount"])
ftk.quick_check(df_clean)  # Profile results

Discover All Functions

# See what's available
ftk.info()  # Browse by category

# Filter by domain
ftk.info(category="Finance")

📚 What's Inside?

Core Data Cleaning (17 functions)

Handle the fundamentals with confidence:

from fda_toolkit.core import columns, duplicates, missing, outliers, text, types

df = columns.clean_column_headers(df)           # 'Name ' → 'name'
df = types.clean_numeric_column(df['amount'])   # '$1,234.56' → 1234.56
df = missing.fill_missing(df, strategy='mean')  # Handle NaN intelligently
df = duplicates.remove_duplicates(df, subset=['id'])
df = outliers.flag_outliers(df, 'amount')       # Flag statistical outliers

Finance-Specific (11 functions)

Domain expertise built-in:

from fda_toolkit.finance import parsing, entities, rules

df['amount'] = parsing.parse_currency(df['amount'])        # Handle $, €, £
df['vendor'] = entities.strip_legal_suffixes(df['vendor']) # ACME Ltd → ACME
rules.validate_sign_conventions(df, rules_config)          # Verify debit/credit

Feature Engineering (7 functions)

Prepare data for ML in seconds:

from fda_toolkit.features import datetime, categorical

df = datetime.extract_date_features(df, 'date')  # Add year, month, quarter
df['category'] = categorical.limit_cardinality(df['category'], top_n=10)

Validation Suite (9 functions)

Catch issues before they become problems:

from fda_toolkit.validation import schema, ranges, integrity

schema.validate_required_fields(df, ['id', 'date', 'amount'])
violations = ranges.validate_data_ranges(df, {'amount': (0, 1_000_000)})
integrity.reconciliation_check(original_df, clean_df, value_cols=['amount'])

Smart Pipelines (2 functions)

Pre-built, battle-tested workflows:

# Generic pipeline
df_clean = ftk.quick_clean(df)

# Finance pipeline (smart defaults for financial data)
df_clean = ftk.quick_clean_finance(
    df,
    primary_key="invoice_id",
    date_cols=["invoice_date", "due_date"],
    currency_cols=["amount", "tax"]
)

Reporting & Profiling (10 functions)

Understand your data instantly:

# Quick diagnosis
ftk.quick_check(df)

# Detailed profile
profile = ftk.profile_report(df)  # Types, missingness, memory, outliers

# Track changes
snapshot_v1 = ftk.snapshot_dataset(df_before, name="before_clean")
snapshot_v2 = ftk.snapshot_dataset(df_after, name="after_clean")
delta = ftk.compare_snapshots(snapshot_v1, snapshot_v2)

Secure I/O (5 functions)

Read and write without surprises:

# Safe reading with encoding detection
df = ftk.read_csv_safely("messy_file.csv")
df = ftk.read_excel_safely("workbook.xlsx", sheet_name="Data")

# Process huge files in chunks
for chunk in ftk.chunked_processing("huge_file.csv", chunksize=50_000):
    process(chunk)

# Export in optimized formats
ftk.export_parquet(df, "output.parquet")  # Fast, compressed

Architecture: Dynamic & Scalable

Every function self-registers via decorator — no manual __all__ lists:

from fda_toolkit.registry import register_function

@register_function(
    name="detect_fraud",
    category="Validation",
    module="custom.fraud"
)
def detect_fraud(df: pd.DataFrame) -> pd.DataFrame:
    """Your custom logic here."""
    result = df[df['amount'] > threshold]
    audit_log("detect_fraud", before=len(df), after=len(result))
    return result

# Automatically appears in ftk.info()!

Audit Trail (Compliance Ready)

Every operation is logged automatically:

from fda_toolkit.utils.logging import get_global_audit_log

log = get_global_audit_log()

for event in log.events:
    print(f"✓ {event.name} at {event.timestamp_utc}")

# Export for compliance teams
audit_json = log.to_dict()  # JSON-ready

💡 Real-World Example

import fda_toolkit as ftk

# 1. Load and diagnose
df = ftk.read_csv_safely("sales_transactions_2024.csv")
ftk.quick_check(df)
# → Reports: types, missing %, duplicates, outliers, memory usage

# 2. Clean for analysis
df_clean = ftk.quick_clean_finance(
    df,
    primary_key="transaction_id",
    date_cols=["date", "due_date"],
    currency_cols=["amount", "tax"]
)

# 3. Validate
from fda_toolkit.validation import integrity
integrity.reconciliation_check(
    original=df, 
    cleaned=df_clean,
    value_cols=["amount"],
    group_cols=["vendor_id"]
)

# 4. Engineer features for ML
df_ml = ftk.extract_date_features(df_clean, "date")
df_ml = ftk.limit_cardinality(df_ml, "vendor", top_n=20)

# 5. Export and log
ftk.export_parquet(df_ml, "ready_for_ml.parquet")
print("✅ Pipeline complete with full audit trail!")


Testing

# Run all tests
pytest

# Run specific module
pytest tests/test_core/

# Verbose output
pytest -v

Example test:

import pandas as pd
from fda_toolkit.core.columns import clean_column_headers

def test_clean_headers():
    df = pd.DataFrame({'Name ': [1], 'Age (years)': [2]})
    result = clean_column_headers(df)
    assert result.columns.tolist() == ['name', 'age_years']

Installation & Development

From Source

# Clone or download
cd fda_toolkit_project

# Install in editable mode (dev)
pip install -e .

# With dev dependencies (if available)
pip install -e ".[dev]"

Requirements

  • Python 3.9+
  • pandas (data manipulation)
  • numpy (numerical operations)

Security & Compliance

  • Audit logging — Every operation tracked with timestamps
  • Data maskingmask_sensitive_fields() for PII protection
  • Type safety — Full type hints prevent common errors
  • Error handling — Clear, actionable error messages
  • Memory optimization — Control data footprint

📖 API Reference

Explore the full API:

ftk.info()                           # List all functions
ftk.info(category="Finance")         # Filter by domain
ftk.get_data_summary(df)            # Profile a dataset
ftk.profile_report(df)              # Detailed analysis

For detailed docs on each function:

from fda_toolkit.core.outliers import detect_outliers_iqr
help(detect_outliers_iqr)  # Full docstring with examples

See QUICK_REFERENCE.md for common patterns.


🎯 Use Cases

Financial Reporting — Prepare data for compliance audits
ML Pipelines — Clean & engineer features for models
Data Migration — Validate and transform during transfers
Anomaly Detection — Flag outliers in transactions
Time Series Analysis — Extract date features automatically
Data Quality Monitoring — Profile and compare snapshots


🚀 Next Steps

  1. Explore functions: ftk.info()
  2. Try examples: See examples/01_quick_check.py
  3. Read docs: docs/function_reference.md
  4. Run tests: pytest
  5. Extend: Add your own functions using @register_function

📝 License

MIT License — see LICENSE for details.


🤝 Contributing

Found a bug? Have an idea? Open an issue or PR!


Built for financial analysts who value time, accuracy, and peace of mind. 📊✨

FDA Toolkit: Where data cleaning stops being painful and starts being productive.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fda_toolkit-0.1.0.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fda_toolkit-0.1.0-py3-none-any.whl (53.5 kB view details)

Uploaded Python 3

File details

Details for the file fda_toolkit-0.1.0.tar.gz.

File metadata

  • Download URL: fda_toolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for fda_toolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 84a29adf25efbf8ee013d893d92231516a5ef06d144ab9ee8c1655e906b30b13
MD5 6106c969c839c82727878fd8d2fd4ee6
BLAKE2b-256 cfe6a5020366b2577a77ac1ca72767640f5bc765ecf8e76e93174f0ae4be7a24

See more details on using hashes here.

File details

Details for the file fda_toolkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fda_toolkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for fda_toolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f3534e56e0b9fdce5bc8a1a2ab3a1137456c2ccc131e420070a0767b82ccba0
MD5 f8df56261213fd853ce41f42377bec19
BLAKE2b-256 b8e51c8f52bd020c981969925578b613295fd6d0669df68ed4565d15d576b29c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page