Skip to main content

An AI-powered dataset search engine across Hugging Face, Kaggle, and Google Dataset Search

Project description

DataScouter

DataScouter is a Python library for searching datasets across Hugging Face, Kaggle, and Google Dataset Search using semantic similarity and fuzzy matching.

Features

  • Multi-Source Search: Fetch datasets from Hugging Face, Kaggle, and Google Dataset Search.
  • Semantic Search: Leverages NLP embeddings for improved relevance.
  • Fuzzy Matching: Enhances search results using string similarity techniques.
  • Optimized Performance: Utilizes PyTorch for efficient similarity calculations.
  • API Key Support: Supports authentication for Kaggle dataset access.

Installation

Install DataScouter via pip:

pip install datascouter

Ensure you have the required dependencies:

pip install requests beautifulsoup4 transformers fuzzywuzzy torch numpy

Usage

Basic Search

from datascouter import DataScouter

# Initialize DataScouter
search_engine = DataScouter(kaggle_api_key="your_kaggle_api_key")

# Search for datasets related to 'climate change'
results = search_engine.search_datasets("climate change")

# Print results
for dataset in results:
    print(f"{dataset['source']}: {dataset['name']} - {dataset['description']} (Score: {dataset['score']})")

Using Environment Variable for Kaggle API Key

Instead of passing the API key directly, set it as an environment variable:

export KAGGLE_API_KEY="your_kaggle_api_key"

Then initialize DataScouter without the API key:

search_engine = DataScouter()

How It Works

  1. Fetches dataset metadata from Hugging Face, Kaggle, and Google Dataset Search.
  2. Uses sentence-transformers to generate embeddings for semantic similarity.
  3. Applies fuzzy matching to refine search results.
  4. Filters and ranks datasets based on relevance threshold.

API Reference

DataScouter

Initialization

DataScouter(kaggle_api_key=None, relevance_threshold=0.3, source_filter=None)
  • kaggle_api_key (str, optional): API key for accessing Kaggle datasets.
  • relevance_threshold (float, default=0.3): Minimum similarity score required for results.
  • source_filter (str, optional): Filter results by a specific source ("Hugging Face", "Kaggle", "Google Dataset Search").

search_datasets

search_datasets(query: str) -> list
  • query (str): Search term for datasets.
  • Returns: A list of matching datasets sorted by relevance.

Contributing

We welcome contributions! To contribute:

  1. Fork the repository.
  2. Create a feature branch: git checkout -b feature-xyz.
  3. Commit changes: git commit -m 'Added new feature'.
  4. Push to your branch: git push origin feature-xyz.
  5. Submit a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Support

If you find DataScouter useful, consider starring the repository on GitHub. For issues, create a GitHub issue or contact us at your.email@example.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataScouter-1.1.0.tar.gz (4.3 kB view details)

Uploaded Source

File details

Details for the file DataScouter-1.1.0.tar.gz.

File metadata

  • Download URL: DataScouter-1.1.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for DataScouter-1.1.0.tar.gz
Algorithm Hash digest
SHA256 77fb8133e2b42ea8bb96fc139ee0423e2a25bc7112478d59d978fd1deb7053b9
MD5 8b97bc17817279dcd1a7150c3c4c3bc8
BLAKE2b-256 8bf6004a8c4ad9bc72ba88fe4e6bc03356b2e17524115aad94efa82237a5179a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page