Skip to main content

A unified framework for crawling and preparing ML-ready datasets

Project description

OpenML Crawler

Python 3.8+ License: MIT Documentation DOI

A unified framework for crawling and preparing ML-ready datasets from various sources including web APIs, open data portals, and custom data sources.

Features

๐Ÿ”Œ Connectors (Free APIs + Curated Data Sources)

  • Weather: Open-Meteo, OpenWeather, NOAA, Weather Underground
  • Social Media: Twitter/X API, Reddit API, Facebook Graph API, Instagram
  • Government Data: US data.gov, EU Open Data, UK data.gov.uk, Indian data.gov.in
  • Finance: Yahoo Finance, Alpha Vantage, FRED, CoinMarketCap
  • Knowledge: Wikipedia, Wikidata
  • News: NewsAPI, Google News, Bing News, NY Times
  • Social/Dev: GitHub, Stack Exchange
  • Health: CDC, WHO, PubMed, ClinicalTrials.gov
  • Agriculture: FAO, USDA, Government open data portals
  • Energy: EIA, IEA

๐Ÿ•ท๏ธ Generic Web Crawling

  • Support for CSV, JSON, XML, HTML parsing
  • PDF parsing with pdfplumber/PyPDF2
  • Async crawling with aiohttp
  • Headless browser mode with Playwright/Selenium
  • Auto format detection (mimetype, file extension)

๐Ÿงน Data Cleaning & Processing

  • Deduplication and anomaly detection
  • Missing value handling
  • Auto type detection (int, float, datetime, category)
  • Text cleaning (stopwords, stemming, lemmatization)
  • NLP utilities: language detection, translation, NER

๐Ÿค– ML-Ready Dataset Preparation

  • Schema detection (features/labels)
  • Feature/target separation (X, y)
  • Train/validation/test split
  • Normalization & encoding (optional)
  • Export to CSV, JSON, Parquet
  • Ready-made loaders for scikit-learn, PyTorch, TensorFlow
  • Streaming mode for big data (generator-based)

๐Ÿ”’ Advanced Data Quality & Privacy

  • Data Quality Assessment: Missing data analysis, duplicate detection, outlier analysis, trust scoring
  • PII Detection: Automatic detection of personal identifiable information
  • Data Anonymization: Hash, mask, redact methods for privacy protection
  • Compliance Checking: GDPR, HIPAA compliance validation
  • Quality Scoring: Automated data quality metrics and reporting

๐Ÿ“Š Smart Search & Discovery

  • AI-Powered Search: Vector embeddings and semantic matching
  • Dataset Indexing: Automatic indexing with metadata and quality metrics
  • Multi-Platform Search: Kaggle, Google Dataset Search, Zenodo, DataCite integration
  • Relevance Ranking: Similarity scoring and quality-based ranking

โ˜๏ธ Cloud Integration

  • Multi-Provider Support: AWS S3, Google Cloud Storage, Azure Blob Storage
  • Unified API: Single interface for all cloud providers
  • Auto-Detection: Automatic provider detection from URLs
  • Batch Operations: Upload/download multiple files

โš™๏ธ Workflow Orchestration

  • YAML-Based Pipelines: Declarative workflow configuration
  • Conditional Branching: Dynamic execution based on data conditions
  • Error Handling: Robust error recovery and retry mechanisms
  • Async Execution: Parallel workflow execution

๐ŸŽฏ Active Learning & Sampling

  • Intelligent Sampling: Diversity, uncertainty, anomaly-based sampling
  • Stratified Sampling: Maintain class/label distributions
  • Quality-Based Sampling: Focus on data that improves quality
  • Active Learning: Iterative model improvement through targeted sampling

๐Ÿš€ Distributed Processing

  • Ray Integration: Distributed computing with Ray
  • Dask Support: Large dataset processing with Dask
  • Parallel Pipelines: Concurrent data processing
  • Scalable Loading: Memory-efficient large file processing

๐Ÿง  ML Pipeline Integration

  • AutoML: Automated model selection and training
  • Feature Store: Centralized feature management
  • ML Data Preparation: One-click ML-ready data preparation
  • Model Evaluation: Automated model performance assessment

๐Ÿ› ๏ธ Developer & User Tools

  • CLI tool (openmlcrawler fetch ...)
  • Config-driven pipelines (YAML/JSON configs)
  • Local caching system
  • Rate-limit + retry handling
  • Logging + progress bars
  • Dataset search: search_open_data("air quality")

Installation

# Install from PyPI
pip install openmlcrawler

# Or install from source
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler
pip install -e .

Quick Start

Load Built-in Dataset

from openmlcrawler import load_dataset

# Weather data
df = load_dataset("weather", location="Delhi", days=7)
print(df.head())

# Twitter data
df = load_dataset("twitter", query="machine learning", max_results=50)
print(df.head())

# Reddit data
df = load_dataset("reddit", subreddit="MachineLearning", limit=25)
print(df.head())

# US Government data
df = load_dataset("us_gov", query="climate change", limit=20)
print(df.head())

Crawl Open Dataset

from openmlcrawler import crawl_and_prepare

# Crawl CSV dataset
df = crawl_and_prepare(
    source="https://datahub.io/core/covid-19/countries.csv",
    type="csv",
    label_column="Country"
)
print(f"Loaded {len(df)} records")

Search Open Data

from openmlcrawler import search_open_data

# Search for datasets
results = search_open_data("climate change")
for result in results:
    print(f"{result['title']}: {result['url']}")

ML-Ready Preparation

from openmlcrawler import prepare_for_ml

# Prepare for ML
X, y, X_train, X_test, y_train, y_test = prepare_for_ml(
    df,
    target_column="Confirmed",
    test_size=0.2,
    normalize=True
)

print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")

Export Dataset

from openmlcrawler import export_dataset

# Export to different formats
export_dataset(df, "data.csv", format="csv")
export_dataset(df, "data.json", format="json")
export_dataset(df, "data.parquet", format="parquet")

Advanced Usage Examples

Data Quality Assessment

from openmlcrawler import assess_data_quality

# Assess data quality
quality_report = assess_data_quality(df)
print(f"Completeness: {quality_report['completeness_score']:.2f}")
print(f"Missing rate: {quality_report['missing_rate']:.2%}")

Privacy & PII Detection

from openmlcrawler import detect_pii, anonymize_data

# Detect PII
pii_report = detect_pii(df)
print("PII found in columns:", list(pii_report.keys()))

# Anonymize data
anonymized_df = anonymize_data(df, method='hash')

Smart Dataset Search

from openmlcrawler import SmartSearchEngine

# Index and search datasets
search_engine = SmartSearchEngine()
search_engine.index_dataset(df, "my_dataset")

# Search for similar datasets
results = search_engine.search_datasets("machine learning datasets")
for result in results:
    print(f"Found: {result['dataset_id']} (similarity: {result['similarity_score']:.3f})")

Cloud Storage Integration

from openmlcrawler import create_aws_connector, create_gcs_connector

# AWS S3
aws_conn = create_aws_connector(bucket_name="my-bucket")
url = aws_conn.upload_dataset(df, "my_dataset")

# Google Cloud Storage
gcs_conn = create_gcs_connector(bucket_name="my-bucket")
url = gcs_conn.upload_dataset(df, "my_dataset")

Workflow Orchestration

from openmlcrawler import execute_workflow_from_file

# Execute YAML workflow
result = execute_workflow_from_file("workflow.yaml", input_data=df)
print(f"Workflow status: {result['status']}")

Intelligent Sampling

from openmlcrawler import smart_sample_dataset

# Sample diverse data points
sampled_df = smart_sample_dataset(df, sample_size=1000, strategy='diversity')

# Uncertainty-based sampling for active learning
uncertainty_sample = smart_sample_dataset(
    df, sample_size=500, strategy='uncertainty', target_column='target'
)

ML Pipeline Integration

from openmlcrawler import prepare_dataset_for_ml, create_automl_pipeline

# Prepare data for ML
X_processed, y = prepare_dataset_for_ml(df, target_column='price')

# Run AutoML
automl = create_automl_pipeline()
results = automl.run_automl(X_processed, y)
print(f"Best model: {results['best_model'].__class__.__name__}")

External Data Platform Integration

from openmlcrawler import create_kaggle_connector, create_zenodo_connector

# Search Kaggle datasets
kaggle_conn = create_kaggle_connector()
results = kaggle_conn.search_datasets("machine learning")

# Search Zenodo research data
zenodo_conn = create_zenodo_connector()
results = zenodo_conn.search_datasets("climate data")

CLI Usage

# Load weather data
openmlcrawler load weather --location "Delhi" --days 7 --output weather.csv

# Crawl dataset
openmlcrawler crawl https://example.com/data.csv --type csv --output data.csv

# Search datasets
openmlcrawler search "climate change" --max-results 5

# Export dataset
openmlcrawler export data.csv --format json --output data.json

# NEW: Assess data quality
openmlcrawler quality data.csv --format text

# NEW: Check data privacy
openmlcrawler privacy data.csv --action detect

# NEW: Generate EDA report
openmlcrawler report data.csv --output report.html

# NEW: Smart search datasets
openmlcrawler smart-search "machine learning datasets"

# NEW: Sample dataset
openmlcrawler sample data.csv --method diversity --size 1000 --output sample.csv

# NEW: Prepare data for ML
openmlcrawler ml prepare data.csv --target price --output ml_data.csv

# NEW: Run AutoML
openmlcrawler ml automl data.csv --target price --output results.json

Configuration

Create a YAML configuration file for pipeline automation:

# config/pipeline.yaml
datasets:
  - name: weather_delhi
    connector: weather
    params:
      location: "Delhi"
      days: 7
    output: "weather_delhi.csv"

  - name: covid_data
    source: "https://datahub.io/core/covid-19/countries.csv"
    type: csv
    cleaning:
      remove_duplicates: true
      handle_missing: "drop"
    output: "covid_clean.csv"

Advanced Features

Async Crawling

import asyncio
from openmlcrawler.core.crawler import Crawler

async def crawl_multiple():
    crawler = Crawler()
    urls = ["url1", "url2", "url3"]

    tasks = [crawler.crawl_async(url) for url in urls]
    results = await asyncio.gather(*tasks)

    for result in results:
        print(f"Crawled {len(result)} characters")

asyncio.run(crawl_multiple())

Custom Connectors

# openmlcrawler/connectors/custom.py
def load_custom_dataset(api_key, **kwargs):
    # Your custom connector logic
    return pd.DataFrame()

# Use it
from openmlcrawler.connectors.custom import load_custom_dataset
df = load_custom_dataset(api_key="your_key")

NLP Processing

from openmlcrawler.core.nlp import TextProcessor, extract_text_features

processor = TextProcessor()

# Process text column
df = processor.process_text_column(df, "description", lowercase=True, remove_stopwords=True)

# Extract features
df_features = extract_text_features(df, "text_column")

Real-time Data Monitoring

Monitor data streams with automated alerting, anomaly detection, and performance tracking.

from openmlcrawler.core.monitoring import create_real_time_monitor, setup_email_alerts

# Create monitor
monitor = create_real_time_monitor()

# Configure email alerts
email_config = setup_email_alerts(
    smtp_server="smtp.gmail.com",
    smtp_port=587,
    username="your-email@gmail.com",
    password="your-password",
    from_email="your-email@gmail.com",
    to_emails=["admin@example.com"]
)
monitor.configure_alerts(email_config=email_config)

# Set feature columns for anomaly detection
monitor.set_feature_columns(['feature1', 'feature2', 'feature3'])

# Start monitoring
monitor.start_monitoring()

# Process data points
for data_point in data_stream:
    result = monitor.process_data_point(data_point)
    print(f"Processed: {result}")

# Get monitoring status
status = monitor.get_monitoring_status()
print(f"Active alerts: {status['active_alerts']}")

# Stop monitoring
monitor.stop_monitoring()

**CLI Usage:**
```bash
# Start monitoring with email alerts
openmlcrawler monitor start --features col1 col2 col3 \
  --email-smtp smtp.gmail.com --email-user user@gmail.com \
  --email-pass password --email-from user@gmail.com \
  --email-to admin@example.com

# Start with Slack alerts
openmlcrawler monitor start --slack-webhook https://hooks.slack.com/... \
  --features feature1 feature2

# Get status
openmlcrawler monitor status

# View recent alerts
openmlcrawler monitor alerts --hours 24

Federated Learning

Enable distributed training across multiple datasets without centralizing data. Perfect for healthcare, finance, and multi-org collaborations with secure FedAvg aggregation.

from openmlcrawler.core.federated import (
    create_federated_coordinator, create_federated_client,
    FederatedConfig, load_federated_config
)

# Create federated configuration
config = FederatedConfig(
    coordinator_host="localhost",
    coordinator_port=8080,
    num_rounds=10,
    min_clients=3,
    max_clients=5,
    secure_aggregation=True
)

# Create coordinator
coordinator = create_federated_coordinator(config)

# Register nodes (hospitals, clinics, etc.)
nodes_config = [
    {
        "node_id": "hospital_a",
        "host": "192.168.1.100",
        "port": 8081,
        "dataset_info": {
            "name": "patient_data_a",
            "size": 10000,
            "features": ["age", "blood_pressure", "cholesterol"],
            "target": "heart_disease"
        }
    }
]

for node_data in nodes_config:
    from openmlcrawler.core.federated import FederatedNode
    node = FederatedNode(**node_data)
    await coordinator.register_node(node)

# Start federated training
initial_model = {"weights": np.random.randn(10, 1), "bias": np.random.randn(1)}
await coordinator.start_federated_training(initial_model)

# Get training status
status = coordinator.get_training_status()
print(f"Round: {status['current_round']}/{status['total_rounds']}")

**CLI Usage:**
```bash
# Start federated learning
openmlcrawler federated start --nodes config/nodes.json \
  --model logistic_regression --rounds 10 --min-clients 3

# Get federated learning status
openmlcrawler federated status

# Stop federated learning
openmlcrawler federated stop

Architecture

openmlcrawler/
โ”œโ”€โ”€ __init__.py          # Main API with all advanced features
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ crawler.py       # Sync + async crawling
โ”‚   โ”œโ”€โ”€ parsers.py       # Data format parsers
โ”‚   โ”œโ”€โ”€ cleaners.py      # Data cleaning utilities
โ”‚   โ”œโ”€โ”€ schema.py        # Schema detection & ML prep
โ”‚   โ”œโ”€โ”€ exporter.py      # Export functions
โ”‚   โ”œโ”€โ”€ nlp.py          # NLP utilities
โ”‚   โ”œโ”€โ”€ utils.py        # Utilities & caching
โ”‚   โ”œโ”€โ”€ quality.py      # Data quality assessment
โ”‚   โ”œโ”€โ”€ privacy.py      # PII detection & anonymization
โ”‚   โ”œโ”€โ”€ reporting.py    # EDA reports & visualization
โ”‚   โ”œโ”€โ”€ search.py       # Smart search & discovery
โ”‚   โ”œโ”€โ”€ cloud.py        # Cloud storage integration
โ”‚   โ”œโ”€โ”€ workflow.py     # Workflow orchestration
โ”‚   โ”œโ”€โ”€ external.py     # External platform integration
โ”‚   โ”œโ”€โ”€ sampling.py     # Active learning & sampling
โ”‚   โ”œโ”€โ”€ distributed.py  # Distributed processing
โ”‚   โ””โ”€โ”€ ml_pipeline.py  # ML pipeline integration
โ”œโ”€โ”€ connectors/          # Built-in connectors
โ”‚   โ”œโ”€โ”€ weather.py
โ”‚   โ”œโ”€โ”€ finance.py
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ plugins/            # Community plugins
โ”œโ”€โ”€ datasets/           # Local cache
โ”œโ”€โ”€ cli.py             # Enhanced CLI with all commands
โ”œโ”€โ”€ config/            # Pipeline configs
โ””โ”€โ”€ ...

Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

  • Krishna Bajpai
  • Vedanshi Gupta

Acknowledgments

  • Open data providers and API maintainers
  • Python data science community
  • Contributors and users

Roadmap

โœ… Completed Features

  • Plugin system for custom connectors
  • Advanced NLP features (translation, NER)
  • HuggingFace Datasets integration
  • Cloud storage integration (S3, GCS, Azure)
  • Data quality assessment and validation
  • Privacy & PII detection/anonymization
  • Smart search & discovery with AI
  • Workflow orchestration with YAML
  • Active learning & intelligent sampling
  • Distributed processing (Ray, Dask)
  • ML pipeline integration & AutoML
  • External platform integration (Kaggle, Zenodo, DataCite)
  • Enhanced CLI with all advanced commands
  • Comprehensive data visualization & reporting
  • Web UI for dataset exploration
  • Streaming data processing
  • Advanced ML model training pipelines
  • Real-time data monitoring
  • Social media connectors (Twitter/X, Reddit, Facebook)
  • Government portal connectors (US, EU, UK, India)
  • Federated learning support
  • More built-in connectors (social media, government portals)
  • Advanced time series analysis
  • Automated data lineage tracking
  • Integration with MLflow and other MLOps tools
  • Support for graph databases and knowledge graphs

\x00

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openmlcrawler-1.0.0.tar.gz (37.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openmlcrawler-1.0.0-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file openmlcrawler-1.0.0.tar.gz.

File metadata

  • Download URL: openmlcrawler-1.0.0.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for openmlcrawler-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9b33c92c13a74808e71d0c55c8ce0047ebaa863b5ec681ad6011136a0371b9e7
MD5 e1c2e9ea0c9407628155aca8396616e1
BLAKE2b-256 6aa3dcbf12992d6e787a5ab6bc0198400463da7eaa8652a04f9040bac7926495

See more details on using hashes here.

File details

Details for the file openmlcrawler-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: openmlcrawler-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for openmlcrawler-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37e79f9f71987c8b5d132bcca2b8e0a7e5b69d08bb2b08fcbbc44d0e5a7f71f7
MD5 ade6ed67530dd3828e78aa31401a2444
BLAKE2b-256 ad5699b42ebfa922282cd8e2a5dcf0e568866633aad83c9fa87ddd28d60464d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page