papex

A library for fetching and normalizing academic papers from various providers (Elsevier, arXiv, PRISM, etc.)

Project description

1. Project Overview

PapEx is a powerful Python library designed to streamline the process of retrieving and standardizing academic paper metadata from diverse sources. Instead of writing custom logic for each provider (Elsevier, arXiv, IEEE, etc.), PapEx offers a unified, normalized interface for fetching data.

This means you can focus on data analysis, not data cleaning.

Key Features

Unified API: Use a single, consistent set of commands to query multiple academic providers.
Normalized Output: All fetched paper metadata (titles, authors, abstracts, DOIs, publication dates) are mapped to a standard data structure, eliminating inconsistencies between sources.
Multi-Provider Support: Currently supports and normalizes data from:
- Elsevier
- arXiv
- IEEE
- PRISM
- [Add any other providers you support]
Flexible Querying: Search papers by DOI, title, author, or specific metadata fields.

🛠️ Installation

Use pip to install the library:

pip install papex

High-level architecture:

Modular extractors for different data providers (Google Scholar, Scopus, Elsevier)
Shared abstraction layer for paper data
Paper extraction by query, retrieving meta data (author, title, ..) TODO extraction of section scraping (by summary, whole document etc..)

2. Getting Started

Prerequisites

Python 3.10+
pip package manager
API keys for SerpAPI, Scopus (configured for pybliometrics), Elsevier, and IEEE..

Installation

From PyPi

pip install papex

pip install -r review/requirements.txt
pip install -r review/requirements_dev.txt

(You may need to manually install proprietary/specialized libraries referenced in the code, such as serpapi, pybliometrics, or a specific Elsevier client. See code comments for guidance.)

Running Tests

tests/`

Important config files:

review/requirements.txt — Python dependencies
.env/local API key config — not directly present but referenced in code

3. How PaPex works

PapEx uses two main components to achieve normalization:

Abstraction	Description
`Provider`	An object dedicated to communicating with a single source (e.g., `arXivProvider`). It handles API-specific request formatting and initial data retrieval.
`Adapter/Normalizer`	An object that takes the raw data from a `Provider` and transforms it into the standard `Paper` object, ensuring consistent field names and formats.

4. Key Concepts

Paper extraction: Modular, provider-driven approach to acquiring structured paper metadata
Abstraction layer: Interfaces and base classes minimize duplication/spaghetti code
LLM-based filtering: Large language model inference is used to handle ambiguous/subjective filtering. This one isn't included in the package. However, in the case of literature review I suggest filtering only the relevant journals before starting the papers retrieval, to avoid reaching the API calls quota.
Chunked processing: TODO Batch scraping

5. References

pandas Documentation
pybliometrics
SerpAPI (Google Scholar)
Elsevier API Docs: See client library documentation

Project details

Release history Release notifications | RSS feed

This version

0.0.4

Nov 4, 2025

0.0.3

Nov 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

papex-0.0.4-py3-none-any.whl (9.1 kB view details)

Uploaded Nov 4, 2025 Python 3

File details

Details for the file papex-0.0.4-py3-none-any.whl.

File metadata

Download URL: papex-0.0.4-py3-none-any.whl
Upload date: Nov 4, 2025
Size: 9.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for papex-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b279d487389f9ae1026dc1fb660dd6a96e201eeed24013d42447db87768c091`
MD5	`d5f887128044eeb91b7cb8889a222729`
BLAKE2b-256	`bff126bc76fa8f351930c1164fca19ac9a428ef80dbb31c8e36a23e663639ac1`

See more details on using hashes here.

papex 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta