Skip to main content

Find data concentration patterns and dataspots. Built for fraud detection and risk analysis.

Project description

Dataspot ๐Ÿ”ฅ

Find data concentration patterns and dataspots in your datasets

PyPI version License: MIT Maintained by Frauddi Python 3.9+

Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

โœจ Why Dataspot?

  • ๐ŸŽฏ Purpose-built for finding data concentrations, not just clustering
  • ๐Ÿ” Fraud detection ready - spot suspicious behavior patterns
  • โšก Simple API - get insights in 3 lines of code
  • ๐Ÿ“Š Hierarchical analysis - understand data at multiple levels
  • ๐Ÿ”ง Flexible filtering - customize analysis with powerful options
  • ๐Ÿ“ˆ Production tested - battle-tested in real fraud detection systems

๐Ÿš€ Quick Start

pip install dataspot
import dataspot

# Sample transaction data
data = [
    {"country": "US", "device": "mobile", "amount": 150, "user_type": "premium"},
    {"country": "US", "device": "mobile", "amount": 200, "user_type": "premium"},
    {"country": "EU", "device": "desktop", "amount": 50, "user_type": "free"},
    {"country": "US", "device": "mobile", "amount": 300, "user_type": "premium"},
    # ... more data
]

# Find concentration dataspots
dataspot = dataspot.Dataspot()
concentrations = dataspot.find(data, fields=["country", "device", "user_type"])

# Results show where data concentrates
for pattern in concentrations[:5]:
    print(f"{pattern.path} โ†’ {pattern.percentage}% ({pattern.count} records)")

# Output:
# country=US > device=mobile > user_type=premium โ†’ 45.2% (127 records)
# country=US > device=mobile โ†’ 52.1% (146 records)
# device=mobile โ†’ 67.8% (190 records)

๐ŸŽฏ Real-World Use Cases

๐Ÿšจ Fraud Detection

# Find suspicious transaction patterns
suspicious = dataspot.find(
    transactions,
    fields=["country", "payment_method", "time_of_day"],
    min_percentage=15  # Only significant concentrations
)

# Spot unusual concentrations that might indicate fraud
for pattern in suspicious:
    if pattern.percentage > 30:
        print(f"โš ๏ธ High concentration: {pattern.path}")

๐Ÿ“Š Business Intelligence

# Discover customer behavior patterns
insights = dataspot.analyze(
    customer_data,
    fields=["region", "device", "product_category", "tier"]
)

print(f"๐Ÿ“ˆ Found {len(insights.patterns)} concentration patterns")
print(f"๐ŸŽฏ Top opportunity: {insights.top_patterns[0].path}")

๐Ÿ” Data Quality Analysis

# Find data quality issues
concentrations = dataspot.find(user_logs, ["source", "event", "status"])

# Look for unusual concentrations that might indicate data issues
anomalies = [p for p in concentrations if p.percentage > 80]
for anomaly in anomalies:
    print(f"โš ๏ธ Possible data quality issue: {anomaly.path}")

๐Ÿ› ๏ธ Advanced Usage

Flexible Filtering

# Complex analysis with multiple criteria
results = dataspot.query(
    min_percentage=10,          # Only patterns with >10% concentration
    max_depth=3,               # Limit hierarchy depth
    contains="mobile",         # Must contain "mobile" in pattern
    min_count=50,             # At least 50 records
    sort_by="concentration"   # Sort by concentration strength
)

Builder Pattern for Complex Queries

from dataspot import QueryBuilder

# Fluent interface for complex filtering
high_value_patterns = QueryBuilder(dataspot) \
    .field("country", "US") \
    .min_percentage(20) \
    .exclude(["test", "internal"]) \
    .sort_by("percentage") \
    .limit(10) \
    .execute()

Custom Analysis

# Add custom preprocessing
def extract_hour(timestamp):
    return timestamp.split("T")[1][:2]  # Extract hour from ISO timestamp

dataspot.add_preprocessor("timestamp", extract_hour)

# Now timestamp field will be analyzed by hour
patterns = dataspot.find(events, ["user_type", "timestamp", "action"])

โšก Performance

Dataspot is built for speed and scale. Our optimized algorithm delivers exceptional performance across datasets of any size.

๐Ÿš€ Blazing Fast Performance

Dataset Size Processing Time Memory Usage
1K records ~3ms ~2MB
10K records ~30ms ~15MB
100K records ~300ms ~150MB
1M records ~3s ~1.5GB

๐Ÿ“Š Algorithm Complexity

  • Time Complexity: O(n ร— f) where n = records, f = fields
  • Space Complexity: O(n ร— f) linear memory usage
  • Scalability: Linear scaling - predictable performance growth

๐Ÿ”ฅ Built for Production

import time
import dataspot

# Large dataset example
data = generate_transactions(100_000)  # 100K records
fields = ["country", "device", "payment_method", "user_tier"]

start = time.time()
patterns = dataspot.find(data, fields, min_percentage=5)
duration = time.time() - start

print(f"Analyzed 100K records in {duration:.2f}s")
# Output: Analyzed 100K records in 0.31s

๐Ÿ’ก Performance Tips

Optimize for Speed:

# Use filtering to reduce pattern count
patterns = dataspot.find(
    data,
    fields,
    min_percentage=10,    # Skip low-concentration patterns
    max_depth=3,         # Limit hierarchy depth
    limit=100           # Cap results
)

Memory Efficiency:

# Process large datasets in chunks
def analyze_large_dataset(data, chunk_size=50000):
    results = []
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i + chunk_size]
        patterns = dataspot.find(chunk, fields)
        results.extend(patterns)
    return results

๐ŸŽฏ When to Use Dataspot

โœ… Perfect for:

  • Real-time fraud detection (millisecond response times)
  • Large-scale business intelligence
  • High-frequency pattern analysis
  • Production systems with strict performance requirements

โš ๏ธ Consider alternatives for:

  • Simple data grouping (use pandas.groupby())
  • One-time data exploration (any tool works)
  • Very small datasets (<100 records)

Benchmarks run on standard hardware (Intel i7, 16GB RAM). Your results may vary.

๐Ÿ“ˆ What Makes Dataspot Different?

Traditional Clustering Dataspot Analysis
Groups similar data points Finds concentration patterns
Equal-sized clusters Identifies where data accumulates
Distance-based Percentage and count based
Hard to interpret Business-friendly hierarchy
Generic approach Built for real-world analysis

Dataspot in action

Dataspot in action - Finding data concentration patterns

See Dataspot in action as it discovers data concentration patterns and dataspots in real-time

๐Ÿ”ง Installation & Requirements

# Install from PyPI
pip install dataspot

# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"

Requirements:

  • Python 3.9+
  • No heavy dependencies (just standard library + optional speedups)

๐Ÿ› ๏ธ Development Commands

The project includes a Makefile with useful development commands:

Command Description
make lint Check code for style and quality issues
make lint-fix Automatically fix linting issues where possible
make tests Run all tests with coverage reporting
make check Run both linting and tests
make clean Remove cache files, build artifacts, and temporary files
make venv-clean Remove the virtual environment
make venv-create Create a new virtual environment with Python 3.9+
make venv-install Install the uv package manager
make install Create virtual environment and install the dependencies

๐Ÿ“š Documentation

๐ŸŒŸ Why Open Source?

Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:

  • ๐ŸŽฏ Advance fraud detection across the industry
  • ๐Ÿค Enable collaboration on pattern analysis techniques
  • ๐Ÿ” Help companies spot issues in their data
  • ๐Ÿ“ˆ Improve data quality everywhere

๐Ÿค Contributing

We welcome contributions! Whether you're:

  • ๐Ÿ› Reporting bugs
  • ๐Ÿ’ก Suggesting features
  • ๐Ÿ“ Improving documentation
  • ๐Ÿ”ง Adding new analysis methods

See our Contributing Guide for details.

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Created by @eliosf27 - Original algorithm and implementation
  • Sponsored by Frauddi - Production testing and open source support
  • Inspired by real fraud detection challenges - Built to solve actual problems

๐Ÿ”— Links



Find your data's dataspots. Discover what others miss. Built with โค๏ธ by Frauddi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspot-0.3.0.tar.gz (287.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspot-0.3.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file dataspot-0.3.0.tar.gz.

File metadata

  • Download URL: dataspot-0.3.0.tar.gz
  • Upload date:
  • Size: 287.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d161ec9564ed1df7e7cef0dcb3f37204d02d638aea345bf66f726f0c4b311ab6
MD5 1ba65d6b516da808820b59d95f5e8d7c
BLAKE2b-256 8583eae3cf6aa0db3cf81f4630b302685bb44de8ad84ebc1c977a7f69490eb87

See more details on using hashes here.

File details

Details for the file dataspot-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dataspot-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1aca97ce9bd458ae89c77180a96ed153f0727839e3a6d747668126c67c8b9e93
MD5 a08f73a6ae6f2e624df529b0d3d6ec26
BLAKE2b-256 33a6bf412a54e87b2119ec135f420b730fe608eb48ee55e761bbfdd44ca9df55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page