Find data concentration patterns and dataspots. Built for fraud detection and risk analysis.

These details have not been verified by PyPI

Project links

Project description

Dataspot 🔥

Find data concentration patterns and dataspots in your datasets

Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

✨ Why Dataspot?

🎯 Purpose-built for finding data concentrations, not just clustering
🔍 Fraud detection ready - spot suspicious behavior patterns
⚡ Simple API - get insights in 3 lines of code
📊 Hierarchical analysis - understand data at multiple levels
🔧 Flexible filtering - customize analysis with powerful options
📈 Production tested - battle-tested in real fraud detection systems

🚀 Quick Start

pip install dataspot

import dataspot

# Sample transaction data
data = [
    {"country": "US", "device": "mobile", "amount": 150, "user_type": "premium"},
    {"country": "US", "device": "mobile", "amount": 200, "user_type": "premium"},
    {"country": "EU", "device": "desktop", "amount": 50, "user_type": "free"},
    {"country": "US", "device": "mobile", "amount": 300, "user_type": "premium"},
    # ... more data
]

# Find concentration dataspots
dataspot = dataspot.Dataspot()
concentrations = dataspot.find(data, fields=["country", "device", "user_type"])

# Results show where data concentrates
for pattern in concentrations[:5]:
    print(f"{pattern.path} → {pattern.percentage}% ({pattern.count} records)")

# Output:
# country=US > device=mobile > user_type=premium → 45.2% (127 records)
# country=US > device=mobile → 52.1% (146 records)
# device=mobile → 67.8% (190 records)

🎯 Real-World Use Cases

🚨 Fraud Detection

# Find suspicious transaction patterns
suspicious = dataspot.find(
    transactions,
    fields=["country", "payment_method", "time_of_day"],
    min_percentage=15  # Only significant concentrations
)

# Spot unusual concentrations that might indicate fraud
for pattern in suspicious:
    if pattern.percentage > 30:
        print(f"⚠️ High concentration: {pattern.path}")

📊 Business Intelligence

# Discover customer behavior patterns
insights = dataspot.analyze(
    customer_data,
    fields=["region", "device", "product_category", "tier"]
)

print(f"📈 Found {len(insights.patterns)} concentration patterns")
print(f"🎯 Top opportunity: {insights.top_patterns[0].path}")

🔍 Data Quality Analysis

# Find data quality issues
concentrations = dataspot.find(user_logs, ["source", "event", "status"])

# Look for unusual concentrations that might indicate data issues
anomalies = [p for p in concentrations if p.percentage > 80]
for anomaly in anomalies:
    print(f"⚠️ Possible data quality issue: {anomaly.path}")

🛠️ Advanced Usage

Flexible Filtering

# Complex analysis with multiple criteria
results = dataspot.query(
    min_percentage=10,          # Only patterns with >10% concentration
    max_depth=3,               # Limit hierarchy depth
    contains="mobile",         # Must contain "mobile" in pattern
    min_count=50,             # At least 50 records
    sort_by="concentration"   # Sort by concentration strength
)

Builder Pattern for Complex Queries

from dataspot import QueryBuilder

# Fluent interface for complex filtering
high_value_patterns = QueryBuilder(dataspot) \
    .field("country", "US") \
    .min_percentage(20) \
    .exclude(["test", "internal"]) \
    .sort_by("percentage") \
    .limit(10) \
    .execute()

Custom Analysis

# Add custom preprocessing
def extract_hour(timestamp):
    return timestamp.split("T")[1][:2]  # Extract hour from ISO timestamp

dataspot.add_preprocessor("timestamp", extract_hour)

# Now timestamp field will be analyzed by hour
patterns = dataspot.find(events, ["user_type", "timestamp", "action"])

⚡ Performance

Dataspot is built for speed and scale. Our optimized algorithm delivers exceptional performance across datasets of any size.

🚀 Blazing Fast Performance

Dataset Size	Processing Time	Memory Usage
1K records	~3ms	~2MB
10K records	~30ms	~15MB
100K records	~300ms	~150MB
1M records	~3s	~1.5GB

📊 Algorithm Complexity

Time Complexity: O(n × f) where n = records, f = fields
Space Complexity: O(n × f) linear memory usage
Scalability: Linear scaling - predictable performance growth

🔥 Built for Production

import time
import dataspot

# Large dataset example
data = generate_transactions(100_000)  # 100K records
fields = ["country", "device", "payment_method", "user_tier"]

start = time.time()
patterns = dataspot.find(data, fields, min_percentage=5)
duration = time.time() - start

print(f"Analyzed 100K records in {duration:.2f}s")
# Output: Analyzed 100K records in 0.31s

💡 Performance Tips

Optimize for Speed:

# Use filtering to reduce pattern count
patterns = dataspot.find(
    data,
    fields,
    min_percentage=10,    # Skip low-concentration patterns
    max_depth=3,         # Limit hierarchy depth
    limit=100           # Cap results
)

Memory Efficiency:

# Process large datasets in chunks
def analyze_large_dataset(data, chunk_size=50000):
    results = []
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i + chunk_size]
        patterns = dataspot.find(chunk, fields)
        results.extend(patterns)
    return results

🎯 When to Use Dataspot

✅ Perfect for:

Real-time fraud detection (millisecond response times)
Large-scale business intelligence
High-frequency pattern analysis
Production systems with strict performance requirements

⚠️ Consider alternatives for:

Simple data grouping (use pandas.groupby())
One-time data exploration (any tool works)
Very small datasets (<100 records)

Benchmarks run on standard hardware (Intel i7, 16GB RAM). Your results may vary.

📈 What Makes Dataspot Different?

Traditional Clustering	Dataspot Analysis
Groups similar data points	Finds concentration patterns
Equal-sized clusters	Identifies where data accumulates
Distance-based	Percentage and count based
Hard to interpret	Business-friendly hierarchy
Generic approach	Built for real-world analysis

Dataspot in action

Dataspot in action - Finding data concentration patterns

See Dataspot in action as it discovers data concentration patterns and dataspots in real-time

🔧 Installation & Requirements

# Install from PyPI
pip install dataspot

# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"

Requirements:

Python 3.9+
No heavy dependencies (just standard library + optional speedups)

🛠️ Development Commands

The project includes a Makefile with useful development commands:

Command	Description
`make lint`	Check code for style and quality issues
`make lint-fix`	Automatically fix linting issues where possible
`make tests`	Run all tests with coverage reporting
`make check`	Run both linting and tests
`make clean`	Remove cache files, build artifacts, and temporary files
`make venv-clean`	Remove the virtual environment
`make venv-create`	Create a new virtual environment with Python 3.9+
`make venv-install`	Install the uv package manager
`make install`	Create virtual environment and install the dependencies

📚 Documentation

📖 User Guide - Complete usage documentation
💡 Examples - Real-world usage examples
🤝 Contributing - How to contribute

🌟 Why Open Source?

Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:

🎯 Advance fraud detection across the industry
🤝 Enable collaboration on pattern analysis techniques
🔍 Help companies spot issues in their data
📈 Improve data quality everywhere

🤝 Contributing

We welcome contributions! Whether you're:

🐛 Reporting bugs
💡 Suggesting features
📝 Improving documentation
🔧 Adding new analysis methods

See our Contributing Guide for details.

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Created by @eliosf27 - Original algorithm and implementation
Sponsored by Frauddi - Production testing and open source support
Inspired by real fraud detection challenges - Built to solve actual problems

🔗 Links

🏠 Homepage
📦 PyPI Package (coming soon)
🐛 Issue Tracker

Find your data's dataspots. Discover what others miss. Built with ❤️ by Frauddi

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.6

Jul 24, 2025

0.4.5

Jul 24, 2025

0.4.4

Jul 24, 2025

0.4.3

Jul 1, 2025

0.4.2

Jun 30, 2025

0.4.1

Jun 30, 2025

0.4.0

Jun 28, 2025

0.3.1

Jun 27, 2025

This version

0.3.0

Jun 26, 2025

0.2.0

Jun 25, 2025

0.1.0

Jun 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspot-0.3.0.tar.gz (287.5 kB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataspot-0.3.0-py3-none-any.whl (25.2 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file dataspot-0.3.0.tar.gz.

File metadata

Download URL: dataspot-0.3.0.tar.gz
Upload date: Jun 26, 2025
Size: 287.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d161ec9564ed1df7e7cef0dcb3f37204d02d638aea345bf66f726f0c4b311ab6`
MD5	`1ba65d6b516da808820b59d95f5e8d7c`
BLAKE2b-256	`8583eae3cf6aa0db3cf81f4630b302685bb44de8ad84ebc1c977a7f69490eb87`

See more details on using hashes here.

File details

Details for the file dataspot-0.3.0-py3-none-any.whl.

File metadata

Download URL: dataspot-0.3.0-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 25.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for dataspot-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1aca97ce9bd458ae89c77180a96ed153f0727839e3a6d747668126c67c8b9e93`
MD5	`a08f73a6ae6f2e624df529b0d3d6ec26`
BLAKE2b-256	`33a6bf412a54e87b2119ec135f420b730fe608eb48ee55e761bbfdd44ca9df55`

See more details on using hashes here.

dataspot 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dataspot 🔥

✨ Why Dataspot?

🚀 Quick Start

🎯 Real-World Use Cases

🚨 Fraud Detection

📊 Business Intelligence

🔍 Data Quality Analysis

🛠️ Advanced Usage

Flexible Filtering

Builder Pattern for Complex Queries

Custom Analysis

⚡ Performance

🚀 Blazing Fast Performance

📊 Algorithm Complexity

🔥 Built for Production

💡 Performance Tips

🎯 When to Use Dataspot

📈 What Makes Dataspot Different?

Dataspot in action

🔧 Installation & Requirements

🛠️ Development Commands

📚 Documentation

🌟 Why Open Source?

🤝 Contributing

📄 License

🙏 Acknowledgments

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes