Skip to main content

A library for fetching and normalizing academic papers from various providers (Elsevier, arXiv, PRISM, etc.)

Project description

1. Project Overview

PapEx is a powerful Python library designed to streamline the process of retrieving and standardizing academic paper metadata from diverse sources. Instead of writing custom logic for each provider (Elsevier, arXiv, IEEE, etc.), PapEx offers a unified, normalized interface for fetching data.

This means you can focus on data analysis, not data cleaning.

Key Features

  • Unified API: Use a single, consistent set of commands to query multiple academic providers.
  • Normalized Output: All fetched paper metadata (titles, authors, abstracts, DOIs, publication dates) are mapped to a standard data structure, eliminating inconsistencies between sources.
  • Multi-Provider Support: Currently supports and normalizes data from:
    • Elsevier
    • arXiv
    • IEEE
    • PRISM
    • [Add any other providers you support]
  • Flexible Querying: Search papers by DOI, title, author, or specific metadata fields.

🛠️ Installation

Use pip to install the library:

pip install papex

High-level architecture:

  • Modular extractors for different data providers (Google Scholar, Scopus, Elsevier)
  • Shared abstraction layer for paper data
  • Paper extraction by query, retrieving meta data (author, title, ..) TODO extraction of section scraping (by summary, whole document etc..)

2. Getting Started

Prerequisites

  • Python 3.10+
  • pip package manager
  • API keys for SerpAPI, Scopus (configured for pybliometrics), Elsevier, and IEEE..

Installation

From PyPi

pip install papex
pip install -r review/requirements.txt
pip install -r review/requirements_dev.txt

(You may need to manually install proprietary/specialized libraries referenced in the code, such as serpapi, pybliometrics, or a specific Elsevier client. See code comments for guidance.)

Running Tests

  • tests/`

Important config files:

  • review/requirements.txt — Python dependencies
  • .env/local API key config — not directly present but referenced in code

3. How PaPex works

PapEx uses two main components to achieve normalization:

Abstraction Description
Provider An object dedicated to communicating with a single source (e.g., arXivProvider). It handles API-specific request formatting and initial data retrieval.
Adapter/Normalizer An object that takes the raw data from a Provider and transforms it into the standard Paper object, ensuring consistent field names and formats.

4. Key Concepts

  • Paper extraction: Modular, provider-driven approach to acquiring structured paper metadata
  • Abstraction layer: Interfaces and base classes minimize duplication/spaghetti code
  • LLM-based filtering: Large language model inference is used to handle ambiguous/subjective filtering. This one isn't included in the package. However, in the case of literature review I suggest filtering only the relevant journals before starting the papers retrieval, to avoid reaching the API calls quota.
  • Chunked processing: TODO Batch scraping

5. References


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papex-0.0.4-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file papex-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: papex-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for papex-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7b279d487389f9ae1026dc1fb660dd6a96e201eeed24013d42447db87768c091
MD5 d5f887128044eeb91b7cb8889a222729
BLAKE2b-256 bff126bc76fa8f351930c1164fca19ac9a428ef80dbb31c8e36a23e663639ac1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page