Skip to main content

Find data concentration patterns and dataspots. Built for fraud detection and risk analysis.

Project description

Dataspot ๐Ÿ”ฅ

Find data concentration patterns and dataspots in your datasets Created by @eliosf27 at Frauddi

PyPI version License: MIT Maintained by Frauddi Python 3.8+

Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

โœจ Why Dataspot?

  • ๐ŸŽฏ Purpose-built for finding data concentrations, not just clustering
  • ๐Ÿ” Fraud detection ready - spot suspicious behavior patterns
  • โšก Simple API - get insights in 3 lines of code
  • ๐Ÿ“Š Hierarchical analysis - understand data at multiple levels
  • ๐Ÿ”ง Flexible filtering - customize analysis with powerful options
  • ๐Ÿ“ˆ Production tested - battle-tested in real fraud detection systems

๐Ÿš€ Quick Start

pip install dataspot
import dataspot

# Sample transaction data
data = [
    {"country": "US", "device": "mobile", "amount": 150, "user_type": "premium"},
    {"country": "US", "device": "mobile", "amount": 200, "user_type": "premium"},
    {"country": "EU", "device": "desktop", "amount": 50, "user_type": "free"},
    {"country": "US", "device": "mobile", "amount": 300, "user_type": "premium"},
    # ... more data
]

# Find concentration dataspots
dataspot = dataspot.Dataspot()
concentrations = dataspot.find(data, fields=["country", "device", "user_type"])

# Results show where data concentrates
for pattern in concentrations[:5]:
    print(f"{pattern.path} โ†’ {pattern.percentage}% ({pattern.count} records)")

# Output:
# country=US > device=mobile > user_type=premium โ†’ 45.2% (127 records)
# country=US > device=mobile โ†’ 52.1% (146 records)
# device=mobile โ†’ 67.8% (190 records)

๐ŸŽฏ Real-World Use Cases

๐Ÿšจ Fraud Detection

# Find suspicious transaction patterns
suspicious = dataspot.find(
    transactions,
    fields=["country", "payment_method", "time_of_day"],
    min_percentage=15  # Only significant concentrations
)

# Spot unusual concentrations that might indicate fraud
for pattern in suspicious:
    if pattern.percentage > 30:
        print(f"โš ๏ธ High concentration: {pattern.path}")

๐Ÿ“Š Business Intelligence

# Discover customer behavior patterns
insights = dataspot.analyze(
    customer_data,
    fields=["region", "device", "product_category", "tier"]
)

print(f"๐Ÿ“ˆ Found {len(insights.patterns)} concentration patterns")
print(f"๐ŸŽฏ Top opportunity: {insights.top_patterns[0].path}")

๐Ÿ” Data Quality Analysis

# Find data quality issues
concentrations = dataspot.find(user_logs, ["source", "event", "status"])

# Look for unusual concentrations that might indicate data issues
anomalies = [p for p in concentrations if p.percentage > 80]
for anomaly in anomalies:
    print(f"โš ๏ธ Possible data quality issue: {anomaly.path}")

๐Ÿ› ๏ธ Advanced Usage

Flexible Filtering

# Complex analysis with multiple criteria
results = dataspot.query(
    min_percentage=10,          # Only patterns with >10% concentration
    max_depth=3,               # Limit hierarchy depth
    contains="mobile",         # Must contain "mobile" in pattern
    min_count=50,             # At least 50 records
    sort_by="concentration"   # Sort by concentration strength
)

Builder Pattern for Complex Queries

from dataspot import QueryBuilder

# Fluent interface for complex filtering
high_value_patterns = QueryBuilder(dataspot) \
    .field("country", "US") \
    .min_percentage(20) \
    .exclude(["test", "internal"]) \
    .sort_by("percentage") \
    .limit(10) \
    .execute()

Custom Analysis

# Add custom preprocessing
def extract_hour(timestamp):
    return timestamp.split("T")[1][:2]  # Extract hour from ISO timestamp

dataspot.add_preprocessor("timestamp", extract_hour)

# Now timestamp field will be analyzed by hour
patterns = dataspot.find(events, ["user_type", "timestamp", "action"])

โšก Performance

Dataspot is built for speed and scale. Our optimized algorithm delivers exceptional performance across datasets of any size.

๐Ÿš€ Blazing Fast Performance

Dataset Size Processing Time Memory Usage
1K records ~3ms ~2MB
10K records ~30ms ~15MB
100K records ~300ms ~150MB
1M records ~3s ~1.5GB

๐Ÿ“Š Algorithm Complexity

  • Time Complexity: O(n ร— f) where n = records, f = fields
  • Space Complexity: O(n ร— f) linear memory usage
  • Scalability: Linear scaling - predictable performance growth

๐Ÿ”ฅ Built for Production

import time
import dataspot

# Large dataset example
data = generate_transactions(100_000)  # 100K records
fields = ["country", "device", "payment_method", "user_tier"]

start = time.time()
patterns = dataspot.find(data, fields, min_percentage=5)
duration = time.time() - start

print(f"Analyzed 100K records in {duration:.2f}s")
# Output: Analyzed 100K records in 0.31s

๐Ÿ’ก Performance Tips

Optimize for Speed:

# Use filtering to reduce pattern count
patterns = dataspot.find(
    data,
    fields,
    min_percentage=10,    # Skip low-concentration patterns
    max_depth=3,         # Limit hierarchy depth
    limit=100           # Cap results
)

Memory Efficiency:

# Process large datasets in chunks
def analyze_large_dataset(data, chunk_size=50000):
    results = []
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i + chunk_size]
        patterns = dataspot.find(chunk, fields)
        results.extend(patterns)
    return results

๐ŸŽฏ When to Use Dataspot

โœ… Perfect for:

  • Real-time fraud detection (millisecond response times)
  • Large-scale business intelligence
  • High-frequency pattern analysis
  • Production systems with strict performance requirements

โš ๏ธ Consider alternatives for:

  • Simple data grouping (use pandas.groupby())
  • One-time data exploration (any tool works)
  • Very small datasets (<100 records)

Benchmarks run on standard hardware (Intel i7, 16GB RAM). Your results may vary.

๐Ÿ“ˆ What Makes Dataspot Different?

Traditional Clustering Dataspot Analysis
Groups similar data points Finds concentration patterns
Equal-sized clusters Identifies where data accumulates
Distance-based Percentage and count based
Hard to interpret Business-friendly hierarchy
Generic approach Built for real-world analysis

๐Ÿ”ง Installation & Requirements

# Install from PyPI
pip install dataspot

# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"

Requirements:

  • Python 3.8+
  • No heavy dependencies (just standard library + optional speedups)

๐Ÿ› ๏ธ Development Commands

The project includes a Makefile with useful development commands:

Command Description
make lint Check code for style and quality issues
make lint-fix Automatically fix linting issues where possible
make tests Run all tests with coverage reporting
make check Run both linting and tests
make clean Remove cache files, build artifacts, and temporary files
make venv-clean Remove the virtual environment
make venv-create Create a new virtual environment with Python 3.11
make venv-install Install the uv package manager
make install Create virtual environment and install the dependencies

๐Ÿ“š Documentation

๐ŸŒŸ Why Open Source?

Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:

  • ๐ŸŽฏ Advance fraud detection across the industry
  • ๐Ÿค Enable collaboration on pattern analysis techniques
  • ๐Ÿ” Help companies spot issues in their data
  • ๐Ÿ“ˆ Improve data quality everywhere

๐Ÿค Contributing

We welcome contributions! Whether you're:

  • ๐Ÿ› Reporting bugs
  • ๐Ÿ’ก Suggesting features
  • ๐Ÿ“ Improving documentation
  • ๐Ÿ”ง Adding new analysis methods

See our Contributing Guide for details.

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Created by @eliosf27 - Original algorithm and implementation
  • Sponsored by Frauddi - Production testing and open source support
  • Inspired by real fraud detection challenges - Built to solve actual problems

๐Ÿ”— Links



Find your data's dataspots. Discover what others miss. Built with โค๏ธ by Frauddi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspot-0.1.0.tar.gz (69.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspot-0.1.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file dataspot-0.1.0.tar.gz.

File metadata

  • Download URL: dataspot-0.1.0.tar.gz
  • Upload date:
  • Size: 69.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.1.0.tar.gz
Algorithm Hash digest
SHA256 443f27636ee0667fa47ff3cce14f46c7e1cc458ed32b5f0dc63d4058f0a5f2a8
MD5 a0f77d97344d90f762e1f1c4a51d34c9
BLAKE2b-256 01266c81c346dd878301ba537c343d4823315f669a8de1220cbf44cc444609d3

See more details on using hashes here.

File details

Details for the file dataspot-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataspot-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af7d60c986e24955a0171c2d5d5bb57a0e8d54387543d8ceb1619460c9defb67
MD5 29aaab65a3d28f51d2bedd5b2ab7e46f
BLAKE2b-256 6b1d7213e5df3193e34ef0ec136764bb2854560002c7a11af6e74cc76d44cdf0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page