A unified framework for crawling and preparing ML-ready datasets
Project description
OpenML Crawler
A unified framework for crawling and preparing ML-ready datasets from various sources including web APIs, open data portals, and custom data sources.
Features
๐ Connectors (Free APIs + Curated Data Sources)
- Weather: Open-Meteo, OpenWeather, NOAA, Weather Underground
- Social Media: Twitter/X API, Reddit API, Facebook Graph API, Instagram
- Government Data: US data.gov, EU Open Data, UK data.gov.uk, Indian data.gov.in
- Finance: Yahoo Finance, Alpha Vantage, FRED, CoinMarketCap
- Knowledge: Wikipedia, Wikidata
- News: NewsAPI, Google News, Bing News, NY Times
- Social/Dev: GitHub, Stack Exchange
- Health: CDC, WHO, PubMed, ClinicalTrials.gov
- Agriculture: FAO, USDA, Government open data portals
- Energy: EIA, IEA
๐ท๏ธ Generic Web Crawling
- Support for CSV, JSON, XML, HTML parsing
- PDF parsing with pdfplumber/PyPDF2
- Async crawling with aiohttp
- Headless browser mode with Playwright/Selenium
- Auto format detection (mimetype, file extension)
๐งน Data Cleaning & Processing
- Deduplication and anomaly detection
- Missing value handling
- Auto type detection (int, float, datetime, category)
- Text cleaning (stopwords, stemming, lemmatization)
- NLP utilities: language detection, translation, NER
๐ค ML-Ready Dataset Preparation
- Schema detection (features/labels)
- Feature/target separation (
X,y) - Train/validation/test split
- Normalization & encoding (optional)
- Export to CSV, JSON, Parquet
- Ready-made loaders for scikit-learn, PyTorch, TensorFlow
- Streaming mode for big data (generator-based)
๐ Advanced Data Quality & Privacy
- Data Quality Assessment: Missing data analysis, duplicate detection, outlier analysis, trust scoring
- PII Detection: Automatic detection of personal identifiable information
- Data Anonymization: Hash, mask, redact methods for privacy protection
- Compliance Checking: GDPR, HIPAA compliance validation
- Quality Scoring: Automated data quality metrics and reporting
๐ Smart Search & Discovery
- AI-Powered Search: Vector embeddings and semantic matching
- Dataset Indexing: Automatic indexing with metadata and quality metrics
- Multi-Platform Search: Kaggle, Google Dataset Search, Zenodo, DataCite integration
- Relevance Ranking: Similarity scoring and quality-based ranking
โ๏ธ Cloud Integration
- Multi-Provider Support: AWS S3, Google Cloud Storage, Azure Blob Storage
- Unified API: Single interface for all cloud providers
- Auto-Detection: Automatic provider detection from URLs
- Batch Operations: Upload/download multiple files
โ๏ธ Workflow Orchestration
- YAML-Based Pipelines: Declarative workflow configuration
- Conditional Branching: Dynamic execution based on data conditions
- Error Handling: Robust error recovery and retry mechanisms
- Async Execution: Parallel workflow execution
๐ฏ Active Learning & Sampling
- Intelligent Sampling: Diversity, uncertainty, anomaly-based sampling
- Stratified Sampling: Maintain class/label distributions
- Quality-Based Sampling: Focus on data that improves quality
- Active Learning: Iterative model improvement through targeted sampling
๐ Distributed Processing
- Ray Integration: Distributed computing with Ray
- Dask Support: Large dataset processing with Dask
- Parallel Pipelines: Concurrent data processing
- Scalable Loading: Memory-efficient large file processing
๐ง ML Pipeline Integration
- AutoML: Automated model selection and training
- Feature Store: Centralized feature management
- ML Data Preparation: One-click ML-ready data preparation
- Model Evaluation: Automated model performance assessment
๐ ๏ธ Developer & User Tools
- CLI tool (
openmlcrawler fetch ...) - Config-driven pipelines (YAML/JSON configs)
- Local caching system
- Rate-limit + retry handling
- Logging + progress bars
- Dataset search:
search_open_data("air quality")
Installation
# Install from PyPI
pip install openmlcrawler
# Or install from source
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler
pip install -e .
Quick Start
Load Built-in Dataset
from openmlcrawler import load_dataset
# Weather data
df = load_dataset("weather", location="Delhi", days=7)
print(df.head())
# Twitter data
df = load_dataset("twitter", query="machine learning", max_results=50)
print(df.head())
# Reddit data
df = load_dataset("reddit", subreddit="MachineLearning", limit=25)
print(df.head())
# US Government data
df = load_dataset("us_gov", query="climate change", limit=20)
print(df.head())
Crawl Open Dataset
from openmlcrawler import crawl_and_prepare
# Crawl CSV dataset
df = crawl_and_prepare(
source="https://datahub.io/core/covid-19/countries.csv",
type="csv",
label_column="Country"
)
print(f"Loaded {len(df)} records")
Search Open Data
from openmlcrawler import search_open_data
# Search for datasets
results = search_open_data("climate change")
for result in results:
print(f"{result['title']}: {result['url']}")
ML-Ready Preparation
from openmlcrawler import prepare_for_ml
# Prepare for ML
X, y, X_train, X_test, y_train, y_test = prepare_for_ml(
df,
target_column="Confirmed",
test_size=0.2,
normalize=True
)
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")
Export Dataset
from openmlcrawler import export_dataset
# Export to different formats
export_dataset(df, "data.csv", format="csv")
export_dataset(df, "data.json", format="json")
export_dataset(df, "data.parquet", format="parquet")
Advanced Usage Examples
Data Quality Assessment
from openmlcrawler import assess_data_quality
# Assess data quality
quality_report = assess_data_quality(df)
print(f"Completeness: {quality_report['completeness_score']:.2f}")
print(f"Missing rate: {quality_report['missing_rate']:.2%}")
Privacy & PII Detection
from openmlcrawler import detect_pii, anonymize_data
# Detect PII
pii_report = detect_pii(df)
print("PII found in columns:", list(pii_report.keys()))
# Anonymize data
anonymized_df = anonymize_data(df, method='hash')
Smart Dataset Search
from openmlcrawler import SmartSearchEngine
# Index and search datasets
search_engine = SmartSearchEngine()
search_engine.index_dataset(df, "my_dataset")
# Search for similar datasets
results = search_engine.search_datasets("machine learning datasets")
for result in results:
print(f"Found: {result['dataset_id']} (similarity: {result['similarity_score']:.3f})")
Cloud Storage Integration
from openmlcrawler import create_aws_connector, create_gcs_connector
# AWS S3
aws_conn = create_aws_connector(bucket_name="my-bucket")
url = aws_conn.upload_dataset(df, "my_dataset")
# Google Cloud Storage
gcs_conn = create_gcs_connector(bucket_name="my-bucket")
url = gcs_conn.upload_dataset(df, "my_dataset")
Workflow Orchestration
from openmlcrawler import execute_workflow_from_file
# Execute YAML workflow
result = execute_workflow_from_file("workflow.yaml", input_data=df)
print(f"Workflow status: {result['status']}")
Intelligent Sampling
from openmlcrawler import smart_sample_dataset
# Sample diverse data points
sampled_df = smart_sample_dataset(df, sample_size=1000, strategy='diversity')
# Uncertainty-based sampling for active learning
uncertainty_sample = smart_sample_dataset(
df, sample_size=500, strategy='uncertainty', target_column='target'
)
ML Pipeline Integration
from openmlcrawler import prepare_dataset_for_ml, create_automl_pipeline
# Prepare data for ML
X_processed, y = prepare_dataset_for_ml(df, target_column='price')
# Run AutoML
automl = create_automl_pipeline()
results = automl.run_automl(X_processed, y)
print(f"Best model: {results['best_model'].__class__.__name__}")
External Data Platform Integration
from openmlcrawler import create_kaggle_connector, create_zenodo_connector
# Search Kaggle datasets
kaggle_conn = create_kaggle_connector()
results = kaggle_conn.search_datasets("machine learning")
# Search Zenodo research data
zenodo_conn = create_zenodo_connector()
results = zenodo_conn.search_datasets("climate data")
CLI Usage
# Load weather data
openmlcrawler load weather --location "Delhi" --days 7 --output weather.csv
# Crawl dataset
openmlcrawler crawl https://example.com/data.csv --type csv --output data.csv
# Search datasets
openmlcrawler search "climate change" --max-results 5
# Export dataset
openmlcrawler export data.csv --format json --output data.json
# NEW: Assess data quality
openmlcrawler quality data.csv --format text
# NEW: Check data privacy
openmlcrawler privacy data.csv --action detect
# NEW: Generate EDA report
openmlcrawler report data.csv --output report.html
# NEW: Smart search datasets
openmlcrawler smart-search "machine learning datasets"
# NEW: Sample dataset
openmlcrawler sample data.csv --method diversity --size 1000 --output sample.csv
# NEW: Prepare data for ML
openmlcrawler ml prepare data.csv --target price --output ml_data.csv
# NEW: Run AutoML
openmlcrawler ml automl data.csv --target price --output results.json
Configuration
Create a YAML configuration file for pipeline automation:
# config/pipeline.yaml
datasets:
- name: weather_delhi
connector: weather
params:
location: "Delhi"
days: 7
output: "weather_delhi.csv"
- name: covid_data
source: "https://datahub.io/core/covid-19/countries.csv"
type: csv
cleaning:
remove_duplicates: true
handle_missing: "drop"
output: "covid_clean.csv"
Advanced Features
Async Crawling
import asyncio
from openmlcrawler.core.crawler import Crawler
async def crawl_multiple():
crawler = Crawler()
urls = ["url1", "url2", "url3"]
tasks = [crawler.crawl_async(url) for url in urls]
results = await asyncio.gather(*tasks)
for result in results:
print(f"Crawled {len(result)} characters")
asyncio.run(crawl_multiple())
Custom Connectors
# openmlcrawler/connectors/custom.py
def load_custom_dataset(api_key, **kwargs):
# Your custom connector logic
return pd.DataFrame()
# Use it
from openmlcrawler.connectors.custom import load_custom_dataset
df = load_custom_dataset(api_key="your_key")
NLP Processing
from openmlcrawler.core.nlp import TextProcessor, extract_text_features
processor = TextProcessor()
# Process text column
df = processor.process_text_column(df, "description", lowercase=True, remove_stopwords=True)
# Extract features
df_features = extract_text_features(df, "text_column")
Real-time Data Monitoring
Monitor data streams with automated alerting, anomaly detection, and performance tracking.
from openmlcrawler.core.monitoring import create_real_time_monitor, setup_email_alerts
# Create monitor
monitor = create_real_time_monitor()
# Configure email alerts
email_config = setup_email_alerts(
smtp_server="smtp.gmail.com",
smtp_port=587,
username="your-email@gmail.com",
password="your-password",
from_email="your-email@gmail.com",
to_emails=["admin@example.com"]
)
monitor.configure_alerts(email_config=email_config)
# Set feature columns for anomaly detection
monitor.set_feature_columns(['feature1', 'feature2', 'feature3'])
# Start monitoring
monitor.start_monitoring()
# Process data points
for data_point in data_stream:
result = monitor.process_data_point(data_point)
print(f"Processed: {result}")
# Get monitoring status
status = monitor.get_monitoring_status()
print(f"Active alerts: {status['active_alerts']}")
# Stop monitoring
monitor.stop_monitoring()
**CLI Usage:**
```bash
# Start monitoring with email alerts
openmlcrawler monitor start --features col1 col2 col3 \
--email-smtp smtp.gmail.com --email-user user@gmail.com \
--email-pass password --email-from user@gmail.com \
--email-to admin@example.com
# Start with Slack alerts
openmlcrawler monitor start --slack-webhook https://hooks.slack.com/... \
--features feature1 feature2
# Get status
openmlcrawler monitor status
# View recent alerts
openmlcrawler monitor alerts --hours 24
Federated Learning
Enable distributed training across multiple datasets without centralizing data. Perfect for healthcare, finance, and multi-org collaborations with secure FedAvg aggregation.
from openmlcrawler.core.federated import (
create_federated_coordinator, create_federated_client,
FederatedConfig, load_federated_config
)
# Create federated configuration
config = FederatedConfig(
coordinator_host="localhost",
coordinator_port=8080,
num_rounds=10,
min_clients=3,
max_clients=5,
secure_aggregation=True
)
# Create coordinator
coordinator = create_federated_coordinator(config)
# Register nodes (hospitals, clinics, etc.)
nodes_config = [
{
"node_id": "hospital_a",
"host": "192.168.1.100",
"port": 8081,
"dataset_info": {
"name": "patient_data_a",
"size": 10000,
"features": ["age", "blood_pressure", "cholesterol"],
"target": "heart_disease"
}
}
]
for node_data in nodes_config:
from openmlcrawler.core.federated import FederatedNode
node = FederatedNode(**node_data)
await coordinator.register_node(node)
# Start federated training
initial_model = {"weights": np.random.randn(10, 1), "bias": np.random.randn(1)}
await coordinator.start_federated_training(initial_model)
# Get training status
status = coordinator.get_training_status()
print(f"Round: {status['current_round']}/{status['total_rounds']}")
**CLI Usage:**
```bash
# Start federated learning
openmlcrawler federated start --nodes config/nodes.json \
--model logistic_regression --rounds 10 --min-clients 3
# Get federated learning status
openmlcrawler federated status
# Stop federated learning
openmlcrawler federated stop
Architecture
openmlcrawler/
โโโ __init__.py # Main API with all advanced features
โโโ core/
โ โโโ crawler.py # Sync + async crawling
โ โโโ parsers.py # Data format parsers
โ โโโ cleaners.py # Data cleaning utilities
โ โโโ schema.py # Schema detection & ML prep
โ โโโ exporter.py # Export functions
โ โโโ nlp.py # NLP utilities
โ โโโ utils.py # Utilities & caching
โ โโโ quality.py # Data quality assessment
โ โโโ privacy.py # PII detection & anonymization
โ โโโ reporting.py # EDA reports & visualization
โ โโโ search.py # Smart search & discovery
โ โโโ cloud.py # Cloud storage integration
โ โโโ workflow.py # Workflow orchestration
โ โโโ external.py # External platform integration
โ โโโ sampling.py # Active learning & sampling
โ โโโ distributed.py # Distributed processing
โ โโโ ml_pipeline.py # ML pipeline integration
โโโ connectors/ # Built-in connectors
โ โโโ weather.py
โ โโโ finance.py
โ โโโ ...
โโโ plugins/ # Community plugins
โโโ datasets/ # Local cache
โโโ cli.py # Enhanced CLI with all commands
โโโ config/ # Pipeline configs
โโโ ...
Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Krishna Bajpai
- Vedanshi Gupta
Acknowledgments
- Open data providers and API maintainers
- Python data science community
- Contributors and users
Roadmap
โ Completed Features
- Plugin system for custom connectors
- Advanced NLP features (translation, NER)
- HuggingFace Datasets integration
- Cloud storage integration (S3, GCS, Azure)
- Data quality assessment and validation
- Privacy & PII detection/anonymization
- Smart search & discovery with AI
- Workflow orchestration with YAML
- Active learning & intelligent sampling
- Distributed processing (Ray, Dask)
- ML pipeline integration & AutoML
- External platform integration (Kaggle, Zenodo, DataCite)
- Enhanced CLI with all advanced commands
- Comprehensive data visualization & reporting
- Web UI for dataset exploration
- Streaming data processing
- Advanced ML model training pipelines
- Real-time data monitoring
- Social media connectors (Twitter/X, Reddit, Facebook)
- Government portal connectors (US, EU, UK, India)
- Federated learning support
- More built-in connectors (social media, government portals)
- Advanced time series analysis
- Automated data lineage tracking
- Integration with MLflow and other MLOps tools
- Support for graph databases and knowledge graphs
\x00
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openmlcrawler-1.0.0.tar.gz.
File metadata
- Download URL: openmlcrawler-1.0.0.tar.gz
- Upload date:
- Size: 37.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b33c92c13a74808e71d0c55c8ce0047ebaa863b5ec681ad6011136a0371b9e7
|
|
| MD5 |
e1c2e9ea0c9407628155aca8396616e1
|
|
| BLAKE2b-256 |
6aa3dcbf12992d6e787a5ab6bc0198400463da7eaa8652a04f9040bac7926495
|
File details
Details for the file openmlcrawler-1.0.0-py3-none-any.whl.
File metadata
- Download URL: openmlcrawler-1.0.0-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37e79f9f71987c8b5d132bcca2b8e0a7e5b69d08bb2b08fcbbc44d0e5a7f71f7
|
|
| MD5 |
ade6ed67530dd3828e78aa31401a2444
|
|
| BLAKE2b-256 |
ad5699b42ebfa922282cd8e2a5dcf0e568866633aad83c9fa87ddd28d60464d5
|