A library for fetching and normalizing academic papers from various providers (Elsevier, arXiv, PRISM, etc.)
Project description
1. Project Overview
PapEx is a powerful Python library designed to streamline the process of retrieving and standardizing academic paper metadata from diverse sources. Instead of writing custom logic for each provider (Elsevier, arXiv, IEEE, etc.), PapEx offers a unified, normalized interface for fetching data.
This means you can focus on data analysis, not data cleaning.
Key Features
- Unified API: Use a single, consistent set of commands to query multiple academic providers.
- Normalized Output: All fetched paper metadata (titles, authors, abstracts, DOIs, publication dates) are mapped to a standard data structure, eliminating inconsistencies between sources.
- Multi-Provider Support: Currently supports and normalizes data from:
- Elsevier
- arXiv
- IEEE
- PRISM
- [Add any other providers you support]
- Flexible Querying: Search papers by DOI, title, author, or specific metadata fields.
🛠️ Installation
Use pip to install the library:
pip install papex
High-level architecture:
- Modular extractors for different data providers (Google Scholar, Scopus, Elsevier)
- Shared abstraction layer for paper data
- Paper extraction by query, retrieving meta data (author, title, ..) TODO extraction of section scraping (by summary, whole document etc..)
2. Getting Started
Prerequisites
- Python 3.10+
pippackage manager- API keys for SerpAPI, Scopus (configured for pybliometrics), Elsevier, and IEEE..
Installation
From PyPi
pip install papex
pip install -r review/requirements.txt
pip install -r review/requirements_dev.txt
(You may need to manually install proprietary/specialized libraries referenced in the code, such as serpapi, pybliometrics, or a specific Elsevier client. See code comments for guidance.)
Running Tests
- tests/`
Important config files:
review/requirements.txt— Python dependencies.env/local API key config — not directly present but referenced in code
3. How PaPex works
PapEx uses two main components to achieve normalization:
| Abstraction | Description |
|---|---|
Provider |
An object dedicated to communicating with a single source (e.g., arXivProvider). It handles API-specific request formatting and initial data retrieval. |
Adapter/Normalizer |
An object that takes the raw data from a Provider and transforms it into the standard Paper object, ensuring consistent field names and formats. |
4. Key Concepts
- Paper extraction: Modular, provider-driven approach to acquiring structured paper metadata
- Abstraction layer: Interfaces and base classes minimize duplication/spaghetti code
- LLM-based filtering: Large language model inference is used to handle ambiguous/subjective filtering. This one isn't included in the package. However, in the case of literature review I suggest filtering only the relevant journals before starting the papers retrieval, to avoid reaching the API calls quota.
- Chunked processing: TODO Batch scraping
5. References
- pandas Documentation
- pybliometrics
- SerpAPI (Google Scholar)
- Elsevier API Docs: See client library documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papex-0.0.4-py3-none-any.whl.
File metadata
- Download URL: papex-0.0.4-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b279d487389f9ae1026dc1fb660dd6a96e201eeed24013d42447db87768c091
|
|
| MD5 |
d5f887128044eeb91b7cb8889a222729
|
|
| BLAKE2b-256 |
bff126bc76fa8f351930c1164fca19ac9a428ef80dbb31c8e36a23e663639ac1
|