Skip to main content

A Python wrapper for the Semantic Scholar Dataset API that provides easy access to academic papers, citations, and related data

Project description

Semantic Scholar Dataset API Wrapper

A Python wrapper for the Semantic Scholar Dataset API that provides easy access to academic papers, citations, and related data.

Description

This library provides a simple interface to interact with the Semantic Scholar Dataset API, allowing you to:

  • Access various academic datasets (papers, citations, authors, etc.)
  • Download dataset releases
  • Get diffs between releases
  • Manage large dataset downloads efficiently

Installation

pip install semanticscholar-datasetapi

Requirements

  • Python 3.7+
  • requests

Basic Usage

from semanticscholar_datasetapi import SemanticScholarDataset
import os

# Initialize the client with your API key
api_key = os.getenv("SEMANTIC_SCHOLAR_API_KEY")
client = SemanticScholarDataset(api_key=api_key)

# List available datasets
datasets = client.get_available_datasets()
print(datasets)

# Get latest release information
releases = client.get_available_releases()
print(releases)

# Download latest release of a specific dataset
client.download_latest_release(datasetname="papers", save_dir="downloads")

# Get diffs between releases
client.download_diffs(
    start_release_id="2024-12-31",
    end_release_id="latest",
    datasetname="papers",
    save_dir="diffs"
)

Available Datasets

The API provides access to the following datasets:

  • abstracts
  • authors
  • citations
  • embeddings-specter_v1
  • embeddings-specter_v2
  • paper-ids
  • papers
  • publication-venues
  • s2orc
  • tldrs

API Reference

Main Methods

SemanticScholarDataset(api_key: Optional[str] = None)

Initialize the API client with an optional API key.

  • api_key: API key for accessing the Semantic Scholar Dataset API. Required for most operations.

get_available_releases() -> list

Get a list of all available dataset releases.

get_available_datasets() -> list

Get a list of all available datasets.

get_download_urls_from_release(datasetname: Optional[str] = None, release_id: str = "latest") -> Dict[str, Any]

Get download URLs for a specific release of a dataset.

  • datasetname: Name of the dataset to get URLs for
  • release_id: ID of the release (defaults to "latest")

get_download_urls_from_diffs(start_release_id: Optional[str], end_release_id: str = "latest", datasetname: Optional[str]) -> Dict[str, Any]

Get download URLs for differences between two releases.

  • start_release_id: Starting release ID
  • end_release_id: Ending release ID (defaults to "latest")
  • datasetname: Name of the dataset to get diff URLs for

download_latest_release(datasetname: Optional[str] = None, save_dir: Optional[str] = None, range: Optional[range] = None) -> None

Download the latest release of a specific dataset.

  • datasetname: Name of the dataset to download
  • save_dir: Directory to save downloaded files (defaults to current directory)
  • download_range: Optional range of indices to download from the list of files

download_past_release(release_id: str, datasetname: Optional[str] = None, save_dir: Optional[str] = None, range: Optional[range] = None) -> None

Download a specific past release of a dataset.

  • release_id: ID of the release to download
  • datasetname: Name of the dataset to download
  • save_dir: Directory to save downloaded files (defaults to current directory)
  • download_range: Optional range of indices to download from the list of files

download_diffs(start_release_id: str, end_release_id: str, datasetname: Optional[str] = None, save_dir: Optional[str] = None) -> None

Download the differences between two releases of a dataset.

  • start_release_id: Starting release ID
  • end_release_id: Ending release ID
  • datasetname: Name of the dataset to download diffs for
  • save_dir: Directory to save downloaded files (defaults to current directory)

Error Handling

The library includes comprehensive error handling for:

  • Invalid dataset names
  • Missing API keys
  • Network errors
  • Invalid release IDs

File Naming

Downloaded files follow these naming patterns:

  • Latest release: {datasetname}_latest_{index}.json.gz
  • Past release: {datasetname}_{release_id}_{index}.json.gz
  • Diffs:
    • Updates: {datasetname}_{from_release}_{to_release}_update_{index}.json.gz
    • Deletes: {datasetname}_{from_release}_{to_release}_delete_{index}.json.gz

Environment Variables

  • SEMANTIC_SCHOLAR_API_KEY: Your API key for the Semantic Scholar Dataset API

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

  • Semantic Scholar for providing the Dataset API
  • The academic community for maintaining and contributing to the datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semanticscholar_datasetapi-0.1.2.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semanticscholar_datasetapi-0.1.2-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file semanticscholar_datasetapi-0.1.2.tar.gz.

File metadata

File hashes

Hashes for semanticscholar_datasetapi-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b221f2af3596c9074b7a5cd3989fef10509559278b0e7bcf31d3902bf42511f7
MD5 5b30f410969ff6893dd3fc67b7ec0573
BLAKE2b-256 7c7bf9330fa576028da50199ec8ff400be965801bf9d760c8c509185c1b0c7fc

See more details on using hashes here.

File details

Details for the file semanticscholar_datasetapi-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for semanticscholar_datasetapi-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 72446c7dc2369e85281ded7bdc6ff960c6e0a7fbe976a8a77b8b3d5b02b87758
MD5 a9f05db8b107155d696f2643667dee3c
BLAKE2b-256 6fe985ee3f2042f25b0841c3b5c51ee427ca85b4ff114a785ed34615497fd7d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page