An AI-powered dataset search engine across Hugging Face, Kaggle, and Google Dataset Search

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

DataScouter

DataScouter is a Python library for searching datasets across Hugging Face, Kaggle, and Google Dataset Search using semantic similarity and fuzzy matching.

Features

Multi-Source Search: Fetch datasets from Hugging Face, Kaggle, and Google Dataset Search.
Semantic Search: Leverages NLP embeddings for improved relevance.
Fuzzy Matching: Enhances search results using string similarity techniques.
Optimized Performance: Utilizes PyTorch for efficient similarity calculations.
API Key Support: Supports authentication for Kaggle dataset access.

Installation

Install DataScouter via pip:

pip install datascouter

Ensure you have the required dependencies:

pip install requests beautifulsoup4 transformers fuzzywuzzy torch numpy

Usage

Basic Search

from datascouter import DataScouter

# Initialize DataScouter
search_engine = DataScouter(kaggle_api_key="your_kaggle_api_key")

# Search for datasets related to 'climate change'
results = search_engine.search_datasets("climate change")

# Print results
for dataset in results:
    print(f"{dataset['source']}: {dataset['name']} - {dataset['description']} (Score: {dataset['score']})")

Using Environment Variable for Kaggle API Key

Instead of passing the API key directly, set it as an environment variable:

export KAGGLE_API_KEY="your_kaggle_api_key"

Then initialize DataScouter without the API key:

search_engine = DataScouter()

How It Works

Fetches dataset metadata from Hugging Face, Kaggle, and Google Dataset Search.
Uses sentence-transformers to generate embeddings for semantic similarity.
Applies fuzzy matching to refine search results.
Filters and ranks datasets based on relevance threshold.

API Reference

`DataScouter`

Initialization

DataScouter(kaggle_api_key=None, relevance_threshold=0.3, source_filter=None)

kaggle_api_key (str, optional): API key for accessing Kaggle datasets.
relevance_threshold (float, default=0.3): Minimum similarity score required for results.
source_filter (str, optional): Filter results by a specific source ("Hugging Face", "Kaggle", "Google Dataset Search").

`search_datasets`

search_datasets(query: str) -> list

query (str): Search term for datasets.
Returns: A list of matching datasets sorted by relevance.

Contributing

We welcome contributions! To contribute:

Fork the repository.
Create a feature branch: git checkout -b feature-xyz.
Commit changes: git commit -m 'Added new feature'.
Push to your branch: git push origin feature-xyz.
Submit a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Support

If you find DataScouter useful, consider starring the repository on GitHub. For issues, create a GitHub issue or contact us at your.email@example.com.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.0

Mar 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset-search-engine-1.0.0.tar.gz (4.3 kB view details)

Uploaded Mar 15, 2025 Source

File details

Details for the file dataset-search-engine-1.0.0.tar.gz.

File metadata

Download URL: dataset-search-engine-1.0.0.tar.gz
Upload date: Mar 15, 2025
Size: 4.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for dataset-search-engine-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`290ce591d2cdb60cf4cd588d488c952214d264227d36fc58df572bdedfd7981f`
MD5	`d8d21e906c372058cb404c86448382fb`
BLAKE2b-256	`2baf83d63eae0056dbc4d0ef3eb53291b2885d5af49701325ae3b1fa0cbf6773`

See more details on using hashes here.

dataset-search-engine 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataScouter

Features

Installation

Usage

Basic Search

Using Environment Variable for Kaggle API Key

How It Works

API Reference

`DataScouter`

Initialization

`search_datasets`

Contributing

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes