Skip to main content

An AI-powered dataset search engine across Hugging Face, Kaggle, and Google Dataset Search

Project description

DataScouter

DataScouter is a Python library for searching datasets across Hugging Face, Kaggle, and Google Dataset Search using semantic similarity and fuzzy matching.

Features

  • Multi-Source Search: Fetch datasets from Hugging Face, Kaggle, and Google Dataset Search.
  • Semantic Search: Leverages NLP embeddings for improved relevance.
  • Fuzzy Matching: Enhances search results using string similarity techniques.
  • Optimized Performance: Utilizes PyTorch for efficient similarity calculations.
  • API Key Support: Supports authentication for Kaggle dataset access.

Installation

Install DataScouter via pip:

pip install datascouter

Ensure you have the required dependencies:

pip install requests beautifulsoup4 transformers fuzzywuzzy torch numpy

Usage

Basic Search

from datascouter import DataScouter

# Initialize DataScouter
search_engine = DataScouter(kaggle_api_key="your_kaggle_api_key")

# Search for datasets related to 'climate change'
results = search_engine.search_datasets("climate change")

# Print results
for dataset in results:
    print(f"{dataset['source']}: {dataset['name']} - {dataset['description']} (Score: {dataset['score']})")

Using Environment Variable for Kaggle API Key

Instead of passing the API key directly, set it as an environment variable:

export KAGGLE_API_KEY="your_kaggle_api_key"

Then initialize DataScouter without the API key:

search_engine = DataScouter()

How It Works

  1. Fetches dataset metadata from Hugging Face, Kaggle, and Google Dataset Search.
  2. Uses sentence-transformers to generate embeddings for semantic similarity.
  3. Applies fuzzy matching to refine search results.
  4. Filters and ranks datasets based on relevance threshold.

API Reference

DataScouter

Initialization

DataScouter(kaggle_api_key=None, relevance_threshold=0.3, source_filter=None)
  • kaggle_api_key (str, optional): API key for accessing Kaggle datasets.
  • relevance_threshold (float, default=0.3): Minimum similarity score required for results.
  • source_filter (str, optional): Filter results by a specific source ("Hugging Face", "Kaggle", "Google Dataset Search").

search_datasets

search_datasets(query: str) -> list
  • query (str): Search term for datasets.
  • Returns: A list of matching datasets sorted by relevance.

Contributing

We welcome contributions! To contribute:

  1. Fork the repository.
  2. Create a feature branch: git checkout -b feature-xyz.
  3. Commit changes: git commit -m 'Added new feature'.
  4. Push to your branch: git push origin feature-xyz.
  5. Submit a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Support

If you find DataScouter useful, consider starring the repository on GitHub. For issues, create a GitHub issue or contact us at your.email@example.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset-search-engine-1.0.0.tar.gz (4.3 kB view details)

Uploaded Source

File details

Details for the file dataset-search-engine-1.0.0.tar.gz.

File metadata

  • Download URL: dataset-search-engine-1.0.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for dataset-search-engine-1.0.0.tar.gz
Algorithm Hash digest
SHA256 290ce591d2cdb60cf4cd588d488c952214d264227d36fc58df572bdedfd7981f
MD5 d8d21e906c372058cb404c86448382fb
BLAKE2b-256 2baf83d63eae0056dbc4d0ef3eb53291b2885d5af49701325ae3b1fa0cbf6773

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page