Skip to main content

Find data concentration patterns and dataspots. Built for fraud detection and risk analysis.

Project description

Dataspot ๐Ÿ”ฅ

Find data concentration patterns and dataspots in your datasets

PyPI version License: MIT Maintained by Frauddi Python 3.9+

Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

โœจ Why Dataspot?

  • ๐ŸŽฏ Purpose-built for finding data concentrations, not just clustering
  • ๐Ÿ” Fraud detection ready - spot suspicious behavior patterns
  • โšก Simple API - get insights in 3 lines of code
  • ๐Ÿ“Š Hierarchical analysis - understand data at multiple levels
  • ๐Ÿ”ง Flexible filtering - customize analysis with powerful options
  • ๐Ÿ“ˆ Field-tested - validated in real fraud detection systems

๐Ÿš€ Quick Start

pip install dataspot
from dataspot import Dataspot
from dataspot.models.finder import FindInput, FindOptions

# Sample transaction data
data = [
    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
    {"country": "US", "device": "mobile", "amount": "medium", "user_type": "premium"},
    {"country": "EU", "device": "desktop", "amount": "low", "user_type": "free"},
    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
]

# Find concentration patterns
dataspot = Dataspot()
result = dataspot.find(
    FindInput(data=data, fields=["country", "device", "user_type"]),
    FindOptions(min_percentage=10.0, limit=5)
)

# Results show where data concentrates
for pattern in result.patterns:
    print(f"{pattern.path} โ†’ {pattern.percentage}% ({pattern.count} records)")

# Output:
# country=US > device=mobile > user_type=premium โ†’ 75.0% (3 records)
# country=US > device=mobile โ†’ 75.0% (3 records)
# device=mobile โ†’ 75.0% (3 records)

๐ŸŽฏ Real-World Use Cases

๐Ÿšจ Fraud Detection

from dataspot.models.finder import FindInput, FindOptions

# Find suspicious transaction patterns
result = dataspot.find(
    FindInput(
        data=transactions,
        fields=["country", "payment_method", "time_of_day"]
    ),
    FindOptions(min_percentage=15.0, contains="crypto")
)

# Spot unusual concentrations that might indicate fraud
for pattern in result.patterns:
    if pattern.percentage > 30:
        print(f"โš ๏ธ High concentration: {pattern.path}")

๐Ÿ“Š Business Intelligence

from dataspot.models.analyzer import AnalyzeInput, AnalyzeOptions

# Discover customer behavior patterns
insights = dataspot.analyze(
    AnalyzeInput(
        data=customer_data,
        fields=["region", "device", "product_category", "tier"]
    ),
    AnalyzeOptions(min_percentage=10.0)
)

print(f"๐Ÿ“ˆ Found {len(insights.patterns)} concentration patterns")
print(f"๐ŸŽฏ Top opportunity: {insights.patterns[0].path}")

๐Ÿ” Temporal Analysis

from dataspot.models.compare import CompareInput, CompareOptions

# Compare patterns between time periods
comparison = dataspot.compare(
    CompareInput(
        current_data=this_month_data,
        baseline_data=last_month_data,
        fields=["country", "payment_method"]
    ),
    CompareOptions(
        change_threshold=0.20,
        statistical_significance=True
    )
)

print(f"๐Ÿ“Š Changes detected: {len(comparison.changes)}")
print(f"๐Ÿ†• New patterns: {len(comparison.new_patterns)}")

๐Ÿค– Auto Discovery

from dataspot.models.discovery import DiscoverInput, DiscoverOptions

# Automatically discover important patterns
discovery = dataspot.discover(
    DiscoverInput(data=transaction_data),
    DiscoverOptions(max_fields=3, min_percentage=15.0)
)

print(f"๐ŸŽฏ Top patterns discovered: {len(discovery.top_patterns)}")
for field_ranking in discovery.field_ranking[:3]:
    print(f"๐Ÿ“ˆ {field_ranking.field}: {field_ranking.score:.2f}")

๐Ÿ› ๏ธ Core Methods

Method Purpose Input Model Options Model Output Model
find() Find concentration patterns FindInput FindOptions FindOutput
analyze() Statistical analysis AnalyzeInput AnalyzeOptions AnalyzeOutput
compare() Temporal comparison CompareInput CompareOptions CompareOutput
discover() Auto pattern discovery DiscoverInput DiscoverOptions DiscoverOutput
tree() Hierarchical visualization TreeInput TreeOptions TreeOutput

Advanced Filtering Options

# Complex analysis with multiple criteria
result = dataspot.find(
    FindInput(
        data=data,
        fields=["country", "device", "payment"],
        query={"country": ["US", "EU"]}  # Pre-filter data
    ),
    FindOptions(
        min_percentage=10.0,      # Only patterns with >10% concentration
        max_depth=3,             # Limit hierarchy depth
        contains="mobile",       # Must contain "mobile" in pattern
        min_count=50,           # At least 50 records
        sort_by="percentage",   # Sort by concentration strength
        limit=20                # Top 20 patterns
    )
)

โšก Performance

Dataspot delivers consistent, predictable performance with exceptionally efficient memory usage and linear scaling.

๐Ÿš€ Real-World Performance

Dataset Size Processing Time Memory Usage
1K records ~4ms ~1MB
10K records ~40ms ~2MB
100K records ~400ms ~3MB
1M records ~4s ~10MB

Benchmark Details: Performance measured on standard hardware with realistic datasets (multiple fields, mixed data types). Memory usage is exceptionally efficient due to optimized algorithms. Times are averages of multiple runs for accuracy.

๐Ÿ’ก Performance Tips

# Optimize for speed
result = dataspot.find(
    FindInput(data=large_dataset, fields=fields),
    FindOptions(
        min_percentage=10.0,    # Skip low-concentration patterns
        max_depth=3,           # Limit hierarchy depth
        limit=100             # Cap results
    )
)

# Memory efficient processing
from dataspot.models.tree import TreeInput, TreeOptions

tree = dataspot.tree(
    TreeInput(data=data, fields=["country", "device"]),
    TreeOptions(min_value=10, top=5)  # Simplified tree
)

๐Ÿ“ˆ What Makes Dataspot Different?

Traditional Clustering Dataspot Analysis
Groups similar data points Finds concentration patterns
Equal-sized clusters Identifies where data accumulates
Distance-based Percentage and count based
Hard to interpret Business-friendly hierarchy
Generic approach Built for real-world analysis

๐ŸŽฌ Dataspot in Action

Dataspot in action - Finding data concentration patterns

See Dataspot discover concentration patterns and dataspots in real-time with hierarchical analysis and statistical insights.

๐Ÿ“Š API Structure

Input Models

  • FindInput - Data and fields for pattern finding
  • AnalyzeInput - Statistical analysis configuration
  • CompareInput - Current vs baseline data comparison
  • DiscoverInput - Automatic pattern discovery
  • TreeInput - Hierarchical tree visualization

Options Models

  • FindOptions - Filtering and sorting for patterns
  • AnalyzeOptions - Statistical analysis parameters
  • CompareOptions - Change detection thresholds
  • DiscoverOptions - Auto-discovery constraints
  • TreeOptions - Tree structure customization

Response Models

All methods return structured responses with:

  • patterns - Found concentration patterns
  • statistics - Analysis metrics
  • metadata - Processing information

๐Ÿ”ง Installation & Requirements

# Install from PyPI
pip install dataspot

# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"

Requirements:

  • Python 3.9+
  • No heavy dependencies (just standard library + optional speedups)

๐Ÿ› ๏ธ Development Commands

Command Description
make lint Check code for style and quality issues
make lint-fix Automatically fix linting issues where possible
make tests Run all tests with coverage reporting
make check Run both linting and tests
make clean Remove cache files, build artifacts, and temporary files
make install Create virtual environment and install dependencies

๐Ÿ“š Documentation & Examples

  • ๐Ÿ“– User Guide - Complete usage documentation
  • ๐Ÿ’ก Examples - Real-world usage examples:
    • 01_basic_query_filtering.py - Query and filtering basics
    • 02_pattern_filtering_basic.py - Pattern-based filtering
    • 06_real_world_scenarios.py - Business use cases
    • 08_auto_discovery.py - Automatic pattern discovery
    • 09_temporal_comparison.py - A/B testing and change detection
    • 10_stats.py - Statistical analysis
  • ๐Ÿค Contributing - How to contribute

๐ŸŒŸ Why Open Source?

Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:

  • ๐ŸŽฏ Advance fraud detection across the industry
  • ๐Ÿค Enable collaboration on pattern analysis techniques
  • ๐Ÿ” Help companies spot issues in their data
  • ๐Ÿ“ˆ Improve data quality everywhere

๐Ÿค Contributing

We welcome contributions! Whether you're:

  • ๐Ÿ› Reporting bugs
  • ๐Ÿ’ก Suggesting features
  • ๐Ÿ“ Improving documentation
  • ๐Ÿ”ง Adding new analysis methods

See our Contributing Guide for details.

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Created by @eliosf27 - Original algorithm and implementation
  • Sponsored by Frauddi - Field testing and open source support
  • Inspired by real fraud detection challenges - Built to solve actual problems

๐Ÿ”— Links


Find your data's dataspots. Discover what others miss. Built with โค๏ธ by Frauddi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspot-0.4.0.tar.gz (320.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspot-0.4.0-py3-none-any.whl (76.5 kB view details)

Uploaded Python 3

File details

Details for the file dataspot-0.4.0.tar.gz.

File metadata

  • Download URL: dataspot-0.4.0.tar.gz
  • Upload date:
  • Size: 320.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.4.0.tar.gz
Algorithm Hash digest
SHA256 40443ef4174a9a2e2e5f624ec9b54954ade1893954d247fe91de8b469708889a
MD5 aa4be7ae85544914a80859e443147c5e
BLAKE2b-256 e8cee82cc8daa899cd4dbdbef8bfdea1459a49690f774be13c67c5d9b7628f26

See more details on using hashes here.

File details

Details for the file dataspot-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dataspot-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 76.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0143ed00ab8a94fe51e72cc4264ef65000b9508672117e491f88ef22b92a06cd
MD5 e6c92b016d191abaca5460cd69601210
BLAKE2b-256 9073c9c098ed985ffdb826ab721c98be91ff3aa0a1acb892e43bb17c882efa83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page