Skip to main content

Find data concentration patterns and dataspots. Built for fraud detection and risk analysis.

Project description

Dataspot ๐Ÿ”ฅ

Find data concentration patterns and dataspots in your datasets

PyPI version License: MIT Maintained by Frauddi Python 3.9+

Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

โœจ Why Dataspot?

  • ๐ŸŽฏ Purpose-built for finding data concentrations, not just clustering
  • ๐Ÿ” Fraud detection ready - spot suspicious behavior patterns
  • โšก Simple API - get insights in 3 lines of code
  • ๐Ÿ“Š Hierarchical analysis - understand data at multiple levels
  • ๐Ÿ”ง Flexible filtering - customize analysis with powerful options
  • ๐Ÿ“ˆ Production tested - battle-tested in real fraud detection systems

๐Ÿš€ Quick Start

pip install dataspot
import dataspot

# Sample transaction data
data = [
    {"country": "US", "device": "mobile", "amount": 150, "user_type": "premium"},
    {"country": "US", "device": "mobile", "amount": 200, "user_type": "premium"},
    {"country": "EU", "device": "desktop", "amount": 50, "user_type": "free"},
    {"country": "US", "device": "mobile", "amount": 300, "user_type": "premium"},
    # ... more data
]

# Find concentration dataspots
dataspot = dataspot.Dataspot()
concentrations = dataspot.find(data, fields=["country", "device", "user_type"])

# Results show where data concentrates
for pattern in concentrations[:5]:
    print(f"{pattern.path} โ†’ {pattern.percentage}% ({pattern.count} records)")

# Output:
# country=US > device=mobile > user_type=premium โ†’ 45.2% (127 records)
# country=US > device=mobile โ†’ 52.1% (146 records)
# device=mobile โ†’ 67.8% (190 records)

๐ŸŽฏ Real-World Use Cases

๐Ÿšจ Fraud Detection

# Find suspicious transaction patterns
suspicious = dataspot.find(
    transactions,
    fields=["country", "payment_method", "time_of_day"],
    min_percentage=15  # Only significant concentrations
)

# Spot unusual concentrations that might indicate fraud
for pattern in suspicious:
    if pattern.percentage > 30:
        print(f"โš ๏ธ High concentration: {pattern.path}")

๐Ÿ“Š Business Intelligence

# Discover customer behavior patterns
insights = dataspot.analyze(
    customer_data,
    fields=["region", "device", "product_category", "tier"]
)

print(f"๐Ÿ“ˆ Found {len(insights.patterns)} concentration patterns")
print(f"๐ŸŽฏ Top opportunity: {insights.top_patterns[0].path}")

๐Ÿ” Data Quality Analysis

# Find data quality issues
concentrations = dataspot.find(user_logs, ["source", "event", "status"])

# Look for unusual concentrations that might indicate data issues
anomalies = [p for p in concentrations if p.percentage > 80]
for anomaly in anomalies:
    print(f"โš ๏ธ Possible data quality issue: {anomaly.path}")

๐Ÿ› ๏ธ Advanced Usage

Flexible Filtering

# Complex analysis with multiple criteria
results = dataspot.query(
    min_percentage=10,          # Only patterns with >10% concentration
    max_depth=3,               # Limit hierarchy depth
    contains="mobile",         # Must contain "mobile" in pattern
    min_count=50,             # At least 50 records
    sort_by="concentration"   # Sort by concentration strength
)

Builder Pattern for Complex Queries

from dataspot import QueryBuilder

# Fluent interface for complex filtering
high_value_patterns = QueryBuilder(dataspot) \
    .field("country", "US") \
    .min_percentage(20) \
    .exclude(["test", "internal"]) \
    .sort_by("percentage") \
    .limit(10) \
    .execute()

Custom Analysis

# Add custom preprocessing
def extract_hour(timestamp):
    return timestamp.split("T")[1][:2]  # Extract hour from ISO timestamp

dataspot.add_preprocessor("timestamp", extract_hour)

# Now timestamp field will be analyzed by hour
patterns = dataspot.find(events, ["user_type", "timestamp", "action"])

โšก Performance

Dataspot is built for speed and scale. Our optimized algorithm delivers exceptional performance across datasets of any size.

๐Ÿš€ Blazing Fast Performance

Dataset Size Processing Time Memory Usage
1K records ~3ms ~2MB
10K records ~30ms ~15MB
100K records ~300ms ~150MB
1M records ~3s ~1.5GB

๐Ÿ“Š Algorithm Complexity

  • Time Complexity: O(n ร— f) where n = records, f = fields
  • Space Complexity: O(n ร— f) linear memory usage
  • Scalability: Linear scaling - predictable performance growth

๐Ÿ”ฅ Built for Production

import time
import dataspot

# Large dataset example
data = generate_transactions(100_000)  # 100K records
fields = ["country", "device", "payment_method", "user_tier"]

start = time.time()
patterns = dataspot.find(data, fields, min_percentage=5)
duration = time.time() - start

print(f"Analyzed 100K records in {duration:.2f}s")
# Output: Analyzed 100K records in 0.31s

๐Ÿ’ก Performance Tips

Optimize for Speed:

# Use filtering to reduce pattern count
patterns = dataspot.find(
    data,
    fields,
    min_percentage=10,    # Skip low-concentration patterns
    max_depth=3,         # Limit hierarchy depth
    limit=100           # Cap results
)

Memory Efficiency:

# Process large datasets in chunks
def analyze_large_dataset(data, chunk_size=50000):
    results = []
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i + chunk_size]
        patterns = dataspot.find(chunk, fields)
        results.extend(patterns)
    return results

๐ŸŽฏ When to Use Dataspot

โœ… Perfect for:

  • Real-time fraud detection (millisecond response times)
  • Large-scale business intelligence
  • High-frequency pattern analysis
  • Production systems with strict performance requirements

โš ๏ธ Consider alternatives for:

  • Simple data grouping (use pandas.groupby())
  • One-time data exploration (any tool works)
  • Very small datasets (<100 records)

Benchmarks run on standard hardware (Intel i7, 16GB RAM). Your results may vary.

๐Ÿ“ˆ What Makes Dataspot Different?

Traditional Clustering Dataspot Analysis
Groups similar data points Finds concentration patterns
Equal-sized clusters Identifies where data accumulates
Distance-based Percentage and count based
Hard to interpret Business-friendly hierarchy
Generic approach Built for real-world analysis

Dataspot in action

Dataspot in action - Finding data concentration patterns

See Dataspot in action as it discovers data concentration patterns and dataspots in real-time

๐Ÿ”ง Installation & Requirements

# Install from PyPI
pip install dataspot

# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"

Requirements:

  • Python 3.9+
  • No heavy dependencies (just standard library + optional speedups)

๐Ÿ› ๏ธ Development Commands

The project includes a Makefile with useful development commands:

Command Description
make lint Check code for style and quality issues
make lint-fix Automatically fix linting issues where possible
make tests Run all tests with coverage reporting
make check Run both linting and tests
make clean Remove cache files, build artifacts, and temporary files
make venv-clean Remove the virtual environment
make venv-create Create a new virtual environment with Python 3.9+
make venv-install Install the uv package manager
make install Create virtual environment and install the dependencies

๐Ÿ“š Documentation

๐ŸŒŸ Why Open Source?

Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:

  • ๐ŸŽฏ Advance fraud detection across the industry
  • ๐Ÿค Enable collaboration on pattern analysis techniques
  • ๐Ÿ” Help companies spot issues in their data
  • ๐Ÿ“ˆ Improve data quality everywhere

๐Ÿค Contributing

We welcome contributions! Whether you're:

  • ๐Ÿ› Reporting bugs
  • ๐Ÿ’ก Suggesting features
  • ๐Ÿ“ Improving documentation
  • ๐Ÿ”ง Adding new analysis methods

See our Contributing Guide for details.

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Created by @eliosf27 - Original algorithm and implementation
  • Sponsored by Frauddi - Production testing and open source support
  • Inspired by real fraud detection challenges - Built to solve actual problems

๐Ÿ”— Links



Find your data's dataspots. Discover what others miss. Built with โค๏ธ by Frauddi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspot-0.3.1.tar.gz (302.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspot-0.3.1-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file dataspot-0.3.1.tar.gz.

File metadata

  • Download URL: dataspot-0.3.1.tar.gz
  • Upload date:
  • Size: 302.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.3.1.tar.gz
Algorithm Hash digest
SHA256 66f7fc288b19d9056b70ea6572dbf011f88c7360905b5340a7eb9d7ac4b6e197
MD5 1cb837acfbcc85c2c3f331ffcb656321
BLAKE2b-256 5a726023f5305b58b56af7fbd8ed7ef516955d6c4889c5fa053e75d500c36062

See more details on using hashes here.

File details

Details for the file dataspot-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: dataspot-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3b8c85d7a8f807a5594f695e68af7473cac3cea81bdf4b602f9966a21c44a7fc
MD5 d7e7cb9f2186a48e4283f5ffe3ad05df
BLAKE2b-256 f92bf6ef6e3ab25cb3c9df338754e51f4ae827ecb3e2a9cdc3ba47717d5051dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page