Find data concentration patterns and dataspots. Built for fraud detection and risk analysis.
Project description
Dataspot ๐ฅ
Find data concentration patterns and dataspots in your datasets
Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.
โจ Why Dataspot?
- ๐ฏ Purpose-built for finding data concentrations, not just clustering
- ๐ Fraud detection ready - spot suspicious behavior patterns
- โก Simple API - get insights in 3 lines of code
- ๐ Hierarchical analysis - understand data at multiple levels
- ๐ง Flexible filtering - customize analysis with powerful options
- ๐ Field-tested - validated in real fraud detection systems
๐ Quick Start
pip install dataspot
from dataspot import Dataspot
from dataspot.models.finder import FindInput, FindOptions
# Sample transaction data
data = [
{"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
{"country": "US", "device": "mobile", "amount": "medium", "user_type": "premium"},
{"country": "EU", "device": "desktop", "amount": "low", "user_type": "free"},
{"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
]
# Find concentration patterns
dataspot = Dataspot()
result = dataspot.find(
FindInput(data=data, fields=["country", "device", "user_type"]),
FindOptions(min_percentage=10.0, limit=5)
)
# Results show where data concentrates
for pattern in result.patterns:
print(f"{pattern.path} โ {pattern.percentage}% ({pattern.count} records)")
# Output:
# country=US > device=mobile > user_type=premium โ 75.0% (3 records)
# country=US > device=mobile โ 75.0% (3 records)
# device=mobile โ 75.0% (3 records)
๐ฏ Real-World Use Cases
๐จ Fraud Detection
from dataspot.models.finder import FindInput, FindOptions
# Find suspicious transaction patterns
result = dataspot.find(
FindInput(
data=transactions,
fields=["country", "payment_method", "time_of_day"]
),
FindOptions(min_percentage=15.0, contains="crypto")
)
# Spot unusual concentrations that might indicate fraud
for pattern in result.patterns:
if pattern.percentage > 30:
print(f"โ ๏ธ High concentration: {pattern.path}")
๐ Business Intelligence
from dataspot.models.analyzer import AnalyzeInput, AnalyzeOptions
# Discover customer behavior patterns
insights = dataspot.analyze(
AnalyzeInput(
data=customer_data,
fields=["region", "device", "product_category", "tier"]
),
AnalyzeOptions(min_percentage=10.0)
)
print(f"๐ Found {len(insights.patterns)} concentration patterns")
print(f"๐ฏ Top opportunity: {insights.patterns[0].path}")
๐ Temporal Analysis
from dataspot.models.compare import CompareInput, CompareOptions
# Compare patterns between time periods
comparison = dataspot.compare(
CompareInput(
current_data=this_month_data,
baseline_data=last_month_data,
fields=["country", "payment_method"]
),
CompareOptions(
change_threshold=0.20,
statistical_significance=True
)
)
print(f"๐ Changes detected: {len(comparison.changes)}")
print(f"๐ New patterns: {len(comparison.new_patterns)}")
๐ค Auto Discovery
from dataspot.models.discovery import DiscoverInput, DiscoverOptions
# Automatically discover important patterns
discovery = dataspot.discover(
DiscoverInput(data=transaction_data),
DiscoverOptions(max_fields=3, min_percentage=15.0)
)
print(f"๐ฏ Top patterns discovered: {len(discovery.top_patterns)}")
for field_ranking in discovery.field_ranking[:3]:
print(f"๐ {field_ranking.field}: {field_ranking.score:.2f}")
๐ ๏ธ Core Methods
| Method | Purpose | Input Model | Options Model | Output Model |
|---|---|---|---|---|
find() |
Find concentration patterns | FindInput |
FindOptions |
FindOutput |
analyze() |
Statistical analysis | AnalyzeInput |
AnalyzeOptions |
AnalyzeOutput |
compare() |
Temporal comparison | CompareInput |
CompareOptions |
CompareOutput |
discover() |
Auto pattern discovery | DiscoverInput |
DiscoverOptions |
DiscoverOutput |
tree() |
Hierarchical visualization | TreeInput |
TreeOptions |
TreeOutput |
Advanced Filtering Options
# Complex analysis with multiple criteria
result = dataspot.find(
FindInput(
data=data,
fields=["country", "device", "payment"],
query={"country": ["US", "EU"]} # Pre-filter data
),
FindOptions(
min_percentage=10.0, # Only patterns with >10% concentration
max_depth=3, # Limit hierarchy depth
contains="mobile", # Must contain "mobile" in pattern
min_count=50, # At least 50 records
sort_by="percentage", # Sort by concentration strength
limit=20 # Top 20 patterns
)
)
โก Performance
Dataspot delivers consistent, predictable performance with exceptionally efficient memory usage and linear scaling.
๐ Real-World Performance
| Dataset Size | Processing Time | Memory Usage |
|---|---|---|
| 1K records | ~4ms | ~1MB |
| 10K records | ~40ms | ~2MB |
| 100K records | ~400ms | ~3MB |
| 1M records | ~4s | ~10MB |
Benchmark Details: Performance measured on standard hardware with realistic datasets (multiple fields, mixed data types). Memory usage is exceptionally efficient due to optimized algorithms. Times are averages of multiple runs for accuracy.
๐ก Performance Tips
# Optimize for speed
result = dataspot.find(
FindInput(data=large_dataset, fields=fields),
FindOptions(
min_percentage=10.0, # Skip low-concentration patterns
max_depth=3, # Limit hierarchy depth
limit=100 # Cap results
)
)
# Memory efficient processing
from dataspot.models.tree import TreeInput, TreeOptions
tree = dataspot.tree(
TreeInput(data=data, fields=["country", "device"]),
TreeOptions(min_value=10, top=5) # Simplified tree
)
๐ What Makes Dataspot Different?
| Traditional Clustering | Dataspot Analysis |
|---|---|
| Groups similar data points | Finds concentration patterns |
| Equal-sized clusters | Identifies where data accumulates |
| Distance-based | Percentage and count based |
| Hard to interpret | Business-friendly hierarchy |
| Generic approach | Built for real-world analysis |
๐ฌ Dataspot in Action
See Dataspot discover concentration patterns and dataspots in real-time with hierarchical analysis and statistical insights.
๐ API Structure
Input Models
FindInput- Data and fields for pattern findingAnalyzeInput- Statistical analysis configurationCompareInput- Current vs baseline data comparisonDiscoverInput- Automatic pattern discoveryTreeInput- Hierarchical tree visualization
Options Models
FindOptions- Filtering and sorting for patternsAnalyzeOptions- Statistical analysis parametersCompareOptions- Change detection thresholdsDiscoverOptions- Auto-discovery constraintsTreeOptions- Tree structure customization
Response Models
All methods return structured responses with:
patterns- Found concentration patternsstatistics- Analysis metricsmetadata- Processing information
๐ง Installation & Requirements
# Install from PyPI
pip install dataspot
# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"
Requirements:
- Python 3.9+
- No heavy dependencies (just standard library + optional speedups)
๐ ๏ธ Development Commands
| Command | Description |
|---|---|
make lint |
Check code for style and quality issues |
make lint-fix |
Automatically fix linting issues where possible |
make tests |
Run all tests with coverage reporting |
make check |
Run both linting and tests |
make clean |
Remove cache files, build artifacts, and temporary files |
make install |
Create virtual environment and install dependencies |
๐ Documentation & Examples
- ๐ User Guide - Complete usage documentation
- ๐ก Examples - Real-world usage examples:
01_basic_query_filtering.py- Query and filtering basics02_pattern_filtering_basic.py- Pattern-based filtering06_real_world_scenarios.py- Business use cases08_auto_discovery.py- Automatic pattern discovery09_temporal_comparison.py- A/B testing and change detection10_stats.py- Statistical analysis
- ๐ค Contributing - How to contribute
๐ Why Open Source?
Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:
- ๐ฏ Advance fraud detection across the industry
- ๐ค Enable collaboration on pattern analysis techniques
- ๐ Help companies spot issues in their data
- ๐ Improve data quality everywhere
๐ค Contributing
We welcome contributions! Whether you're:
- ๐ Reporting bugs
- ๐ก Suggesting features
- ๐ Improving documentation
- ๐ง Adding new analysis methods
See our Contributing Guide for details.
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Created by @eliosf27 - Original algorithm and implementation
- Sponsored by Frauddi - Field testing and open source support
- Inspired by real fraud detection challenges - Built to solve actual problems
๐ Links
- ๐ Homepage
- ๐ฆ PyPI Package
- ๐ Issue Tracker
Find your data's dataspots. Discover what others miss. Built with โค๏ธ by Frauddi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataspot-0.4.0.tar.gz.
File metadata
- Download URL: dataspot-0.4.0.tar.gz
- Upload date:
- Size: 320.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40443ef4174a9a2e2e5f624ec9b54954ade1893954d247fe91de8b469708889a
|
|
| MD5 |
aa4be7ae85544914a80859e443147c5e
|
|
| BLAKE2b-256 |
e8cee82cc8daa899cd4dbdbef8bfdea1459a49690f774be13c67c5d9b7628f26
|
File details
Details for the file dataspot-0.4.0-py3-none-any.whl.
File metadata
- Download URL: dataspot-0.4.0-py3-none-any.whl
- Upload date:
- Size: 76.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0143ed00ab8a94fe51e72cc4264ef65000b9508672117e491f88ef22b92a06cd
|
|
| MD5 |
e6c92b016d191abaca5460cd69601210
|
|
| BLAKE2b-256 |
9073c9c098ed985ffdb826ab721c98be91ff3aa0a1acb892e43bb17c882efa83
|