Find data concentration patterns and dataspots. Built for fraud detection and risk analysis.
Project description
Dataspot ๐ฅ
Find data concentration patterns and dataspots in your datasets
Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.
โจ Why Dataspot?
- ๐ฏ Purpose-built for finding data concentrations, not just clustering
- ๐ Fraud detection ready - spot suspicious behavior patterns
- โก Simple API - get insights in 3 lines of code
- ๐ Hierarchical analysis - understand data at multiple levels
- ๐ง Flexible filtering - customize analysis with powerful options
- ๐ Production tested - battle-tested in real fraud detection systems
๐ Quick Start
pip install dataspot
import dataspot
# Sample transaction data
data = [
{"country": "US", "device": "mobile", "amount": 150, "user_type": "premium"},
{"country": "US", "device": "mobile", "amount": 200, "user_type": "premium"},
{"country": "EU", "device": "desktop", "amount": 50, "user_type": "free"},
{"country": "US", "device": "mobile", "amount": 300, "user_type": "premium"},
# ... more data
]
# Find concentration dataspots
dataspot = dataspot.Dataspot()
concentrations = dataspot.find(data, fields=["country", "device", "user_type"])
# Results show where data concentrates
for pattern in concentrations[:5]:
print(f"{pattern.path} โ {pattern.percentage}% ({pattern.count} records)")
# Output:
# country=US > device=mobile > user_type=premium โ 45.2% (127 records)
# country=US > device=mobile โ 52.1% (146 records)
# device=mobile โ 67.8% (190 records)
๐ฏ Real-World Use Cases
๐จ Fraud Detection
# Find suspicious transaction patterns
suspicious = dataspot.find(
transactions,
fields=["country", "payment_method", "time_of_day"],
min_percentage=15 # Only significant concentrations
)
# Spot unusual concentrations that might indicate fraud
for pattern in suspicious:
if pattern.percentage > 30:
print(f"โ ๏ธ High concentration: {pattern.path}")
๐ Business Intelligence
# Discover customer behavior patterns
insights = dataspot.analyze(
customer_data,
fields=["region", "device", "product_category", "tier"]
)
print(f"๐ Found {len(insights.patterns)} concentration patterns")
print(f"๐ฏ Top opportunity: {insights.top_patterns[0].path}")
๐ Data Quality Analysis
# Find data quality issues
concentrations = dataspot.find(user_logs, ["source", "event", "status"])
# Look for unusual concentrations that might indicate data issues
anomalies = [p for p in concentrations if p.percentage > 80]
for anomaly in anomalies:
print(f"โ ๏ธ Possible data quality issue: {anomaly.path}")
๐ ๏ธ Advanced Usage
Flexible Filtering
# Complex analysis with multiple criteria
results = dataspot.query(
min_percentage=10, # Only patterns with >10% concentration
max_depth=3, # Limit hierarchy depth
contains="mobile", # Must contain "mobile" in pattern
min_count=50, # At least 50 records
sort_by="concentration" # Sort by concentration strength
)
Builder Pattern for Complex Queries
from dataspot import QueryBuilder
# Fluent interface for complex filtering
high_value_patterns = QueryBuilder(dataspot) \
.field("country", "US") \
.min_percentage(20) \
.exclude(["test", "internal"]) \
.sort_by("percentage") \
.limit(10) \
.execute()
Custom Analysis
# Add custom preprocessing
def extract_hour(timestamp):
return timestamp.split("T")[1][:2] # Extract hour from ISO timestamp
dataspot.add_preprocessor("timestamp", extract_hour)
# Now timestamp field will be analyzed by hour
patterns = dataspot.find(events, ["user_type", "timestamp", "action"])
โก Performance
Dataspot is built for speed and scale. Our optimized algorithm delivers exceptional performance across datasets of any size.
๐ Blazing Fast Performance
| Dataset Size | Processing Time | Memory Usage |
|---|---|---|
| 1K records | ~3ms | ~2MB |
| 10K records | ~30ms | ~15MB |
| 100K records | ~300ms | ~150MB |
| 1M records | ~3s | ~1.5GB |
๐ Algorithm Complexity
- Time Complexity:
O(n ร f)where n = records, f = fields - Space Complexity:
O(n ร f)linear memory usage - Scalability: Linear scaling - predictable performance growth
๐ฅ Built for Production
import time
import dataspot
# Large dataset example
data = generate_transactions(100_000) # 100K records
fields = ["country", "device", "payment_method", "user_tier"]
start = time.time()
patterns = dataspot.find(data, fields, min_percentage=5)
duration = time.time() - start
print(f"Analyzed 100K records in {duration:.2f}s")
# Output: Analyzed 100K records in 0.31s
๐ก Performance Tips
Optimize for Speed:
# Use filtering to reduce pattern count
patterns = dataspot.find(
data,
fields,
min_percentage=10, # Skip low-concentration patterns
max_depth=3, # Limit hierarchy depth
limit=100 # Cap results
)
Memory Efficiency:
# Process large datasets in chunks
def analyze_large_dataset(data, chunk_size=50000):
results = []
for i in range(0, len(data), chunk_size):
chunk = data[i:i + chunk_size]
patterns = dataspot.find(chunk, fields)
results.extend(patterns)
return results
๐ฏ When to Use Dataspot
โ Perfect for:
- Real-time fraud detection (millisecond response times)
- Large-scale business intelligence
- High-frequency pattern analysis
- Production systems with strict performance requirements
โ ๏ธ Consider alternatives for:
- Simple data grouping (use
pandas.groupby()) - One-time data exploration (any tool works)
- Very small datasets (<100 records)
Benchmarks run on standard hardware (Intel i7, 16GB RAM). Your results may vary.
๐ What Makes Dataspot Different?
| Traditional Clustering | Dataspot Analysis |
|---|---|
| Groups similar data points | Finds concentration patterns |
| Equal-sized clusters | Identifies where data accumulates |
| Distance-based | Percentage and count based |
| Hard to interpret | Business-friendly hierarchy |
| Generic approach | Built for real-world analysis |
Dataspot in action
See Dataspot in action as it discovers data concentration patterns and dataspots in real-time
๐ง Installation & Requirements
# Install from PyPI
pip install dataspot
# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"
Requirements:
- Python 3.9+
- No heavy dependencies (just standard library + optional speedups)
๐ ๏ธ Development Commands
The project includes a Makefile with useful development commands:
| Command | Description |
|---|---|
make lint |
Check code for style and quality issues |
make lint-fix |
Automatically fix linting issues where possible |
make tests |
Run all tests with coverage reporting |
make check |
Run both linting and tests |
make clean |
Remove cache files, build artifacts, and temporary files |
make venv-clean |
Remove the virtual environment |
make venv-create |
Create a new virtual environment with Python 3.9+ |
make venv-install |
Install the uv package manager |
make install |
Create virtual environment and install the dependencies |
๐ Documentation
- ๐ User Guide - Complete usage documentation
- ๐ก Examples - Real-world usage examples
- ๐ค Contributing - How to contribute
๐ Why Open Source?
Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:
- ๐ฏ Advance fraud detection across the industry
- ๐ค Enable collaboration on pattern analysis techniques
- ๐ Help companies spot issues in their data
- ๐ Improve data quality everywhere
๐ค Contributing
We welcome contributions! Whether you're:
- ๐ Reporting bugs
- ๐ก Suggesting features
- ๐ Improving documentation
- ๐ง Adding new analysis methods
See our Contributing Guide for details.
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Created by @eliosf27 - Original algorithm and implementation
- Sponsored by Frauddi - Production testing and open source support
- Inspired by real fraud detection challenges - Built to solve actual problems
๐ Links
- ๐ Homepage
- ๐ฆ PyPI Package (coming soon)
- ๐ Issue Tracker
Find your data's dataspots. Discover what others miss. Built with โค๏ธ by Frauddi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataspot-0.3.1.tar.gz.
File metadata
- Download URL: dataspot-0.3.1.tar.gz
- Upload date:
- Size: 302.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66f7fc288b19d9056b70ea6572dbf011f88c7360905b5340a7eb9d7ac4b6e197
|
|
| MD5 |
1cb837acfbcc85c2c3f331ffcb656321
|
|
| BLAKE2b-256 |
5a726023f5305b58b56af7fbd8ed7ef516955d6c4889c5fa053e75d500c36062
|
File details
Details for the file dataspot-0.3.1-py3-none-any.whl.
File metadata
- Download URL: dataspot-0.3.1-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b8c85d7a8f807a5594f695e68af7473cac3cea81bdf4b602f9966a21c44a7fc
|
|
| MD5 |
d7e7cb9f2186a48e4283f5ffe3ad05df
|
|
| BLAKE2b-256 |
f92bf6ef6e3ab25cb3c9df338754e51f4ae827ecb3e2a9cdc3ba47717d5051dd
|