"Automate your ArXiv paper search, retrieval, and summarization process."
Project description
ArXiv Retriever
Status: Maintenance Mode
Note: This project is currently in maintenance mode. While I am not actively developing new features, I will continue to address critical issues and security vulnerabilities as time permits. Users are welcome to fork the repository if they wish to extend its functionality. Please refer to Maintenance Policy for more information.
Table of Contents
- Introduction
- Features
- Environment Setup
- Installation
- Usage
- LLM Providers
- Contributing
- Maintenance Policy
- License
- Acknowledgements
Introduction
arxiv_retriever is a lightweight command-line tool designed to automate the retrieval, downloading, and
summarization of research papers from ArXiv. The retrieval can be done using specified ArXiv
categories, full or partial titles of papers, or links to the papers. Paper retrieval can be refined by author.
Papers can be summarized using multiple LLM providers — Ollama (local, default), Claude (Anthropic), or Gemini (Google) — directly from the terminal.
NOTE: My tests indicate that when searching for a really long title, using the partial title and then refining by author yields better results, as opposed to searching with the full title or even searching with the full title and refining by author. However, the tests are not exhaustive.
This tool is built using Python and leverages the Typer library for the command-line interface, Rich for enhanced terminal output, and the Python ElementTree XML package for parsing XML responses from the arXiv API. It can be useful for researchers, engineers, or students who want to quickly retrieve an ArXiv paper or keep abreast of latest research in their field without leaving their terminal/workstation.
Although my current focus while building arxiv_retriever is the computer science archive, it can be easily
used with categories from other areas on arxiv, e.g., math.CO.
Features
- Fetch the most recent papers from specified ArXiv categories
- Search for papers on ArXiv using full or partial title
- Refine fetch and search by author(s) for more precise results
- Specify logic for combination of multiple authors ('AND' or 'OR') during retrieval
- Download papers after they are retrieved
- Summarize PDF papers using LLM providers (Ollama, Claude, Gemini)
- Batch summarization of multiple papers at once
- Save summaries to JSON files
- View paper details including title, authors, abstract, publication date, and links
- Rich terminal display with styled panels, Markdown rendering, and color-coded output
- Multi-provider LLM support with shorthand syntax (e.g.,
--model claude) - Configurable number of results to fetch
- Easy-to-use command-line interface built with Typer
Environment Setup
Environment variables are used to configure LLM providers for the paper summarization feature. Ollama is the default provider and requires no API keys (it runs locally).
Environment Variables
| Variable | Provider | Required | Default |
|---|---|---|---|
ANTHROPIC_API_KEY |
Claude | Yes (for Claude) | — |
GEMINI_API_KEY |
Gemini | Yes (for Gemini) | — |
OLLAMA_BASE_URL |
Ollama | No | http://localhost:11434 |
ARXIV_RETRIEVER_DEFAULT_MODEL |
All | No | ollama:llama3 |
Setting Environment Variables
On Unix-like systems (Linux, macOS)
In your terminal, run:
export ANTHROPIC_API_KEY=<your-anthropic-key>
export GEMINI_API_KEY=<your-gemini-key>
To ensure this works across all shell instances, add the above lines to your shell configuration file
(e.g., ~/.bashrc, ~/.zshrc, or ~/.profile).
On Windows
- Open the Start menu and search for "Environment Variables"
- Click on the "Edit system environment variables" option.
- In the System Properties window, click on the "Environment Variables" button
- Under "User variables", click "New"
- Set the variable name and value for each key.
NOTE: Keep your API keys confidential and do not share them publicly.
Installation
Install from PyPI (Recommended):
pip install --upgrade arxiv-retriever
Install from Source Distribution
If you need a specific version or want to install from a source distribution:
-
Download the source distribution (.tar.gz file) from PyPI or the GitHub releases page.
-
Install using pip:
pip install axiv-x.y.z.tar.gz
Replace
x.y.zwith the version number.
This method can be useful if you need a specific version or are in an environment without direct access to PyPI.
Install for Development and Testing
To install the latest development version from source:
- Ensure you have uv installed.
- Clone the repository:
git clone https://github.com/MimicTester1307/arxiv_retriever.git cd arxiv_retriever
- Install the project and its dependencies:
uv sync - Run tests to ensure everything is set up correctly:
uv run pytest
- Run the CLI:
uv run axiv --help
Usage
After installation, use the package via the axiv command. To view available commands: axiv --help or axiv
Note on Package and Command Names
- Package Name: The package is named
arxiv_retriever. This is the name you use when installing via pip or referring to the project. - Command Name: After installation, you interact with the tool using the
axivcommand in your terminal.
This distinction allows for a more concise command while maintaining a descriptive package name.
Basic Commands
fetch: Fetch papers from ArXiv based on categories, refined by options.search: Search for papers on ArXiv using title, refined by options.download: Download papers from ArXiv using their links (PDF or abstract links).summarize: Summarize one or more PDF papers using an LLM provider.version: Display version information for arxiv_retriever and core dependencies.
Sample Usage
Fetch
To retrieve the most recent computer science papers by categories, use the fetch command followed by the categories and
options:
axiv fetch [OPTIONS] CATEGORIES...
Search
To search for a paper by title, use the search command followed by the title and options:
axiv search [OPTIONS] TITLE
CLI Options
Due to how most CLI frameworks (including Typer) handle arguments vs options, if you want to specify multiple options (in this case, authors)
to refine your search or fetch command by, you will have to call the option multiple times. That is,
--author <author> --author <author> as opposed to --author <author> <author>. Alternatively, you can use -a rather
than --author
Downloading your research papers
There are multiple ways to download your research paper using axiv:
- use
axiv download [OPTIONS] LINKS...to download the paper directly from the link - confirm if you want to download the retrieved papers using
fetchorsearchwhen asked by the CLI
With option 1, the file is named using the URL's basename, e.g. 2407.09298v1.pdf.
With options 2, the file is named using the title retrieved from the XML data when parsing.
NOTE: If the file name exists, it is overwritten.
Examples
Fetch the latest 5 papers in the cs.AI OR cs.GL categories:
axiv fetch cs.AI cs.GL --limit 5
Outputs limit papers sorted by submittedDate in descending order, filtered by authors
Refine fetch using multiple authors
axiv fetch cs.AI -a omar -a matei
Add logic for creating query when multiple authors are supplied using --author-logic or -l:
axiv fetch cs.AI math.CO -a "John Doe" -a "Jane Smith" --author-logic AND
Fetch papers matching the title, "Attention is all you need", refined by author "Ashish":
axiv search "Attention is all you need" --limit 5 --author "Ashish"
Download papers using links:
- download using link to abstract:
axiv download https://arxiv.org/abs/2407.20214v1
- download using link to pdf:
axiv download https://arxiv.org/pdf/2407.20214v1
Summarize
Summarize downloaded PDF papers using an LLM:
# Summarize a single paper (uses Ollama by default — local, no rate limits)
axiv summarize paper.pdf
# Summarize multiple papers
axiv summarize paper1.pdf paper2.pdf
# Summarize all PDFs in a directory
axiv summarize ./arxiv_downloads/
# Use a specific provider
axiv summarize paper.pdf --model claude
axiv summarize paper.pdf --model gemini
# Use a specific model
axiv summarize paper.pdf --model claude:claude-sonnet-4-6
# Save summaries to JSON
axiv summarize ./arxiv_downloads/ --save
LLM Providers
arxiv_retriever supports multiple LLM providers for paper summarization. Ollama is the default — it runs
locally and has no API rate limits or costs.
Provider Format
Use --model provider:model_name or just --model provider (uses the default model for that provider):
| Provider | Default Model | Shorthand | Requires |
|---|---|---|---|
| Ollama | llama3 |
--model ollama |
Ollama running locally |
| Claude | claude-sonnet-4-6 |
--model claude |
ANTHROPIC_API_KEY env var |
| Gemini | gemini-3-flash-preview |
--model gemini |
GEMINI_API_KEY env var |
Examples
# Use default (Ollama)
axiv summarize paper.pdf
# Use Claude with default model
axiv summarize paper.pdf --model claude
# Use Gemini with a specific model
axiv summarize paper.pdf --model gemini:gemini-2.0-flash
# Set a custom default model via environment variable
export ARXIV_RETRIEVER_DEFAULT_MODEL=claude:claude-sonnet-4-6
axiv summarize paper.pdf
Contributing
Contributions are welcome! Please fork the repository and submit a pull request for any features, bug fixes, or enhancements.
Note on Testing
Currently, all 35 tests pass. Refactoring the tests for asynchrony was an interesting challenge. Discussions and contributions regarding the asynchronous implementation are particularly welcome.
uv run pytest
Contact me via email or leave a comment on the Notion project tracker.
Maintenance Policy
This project is currently in maintenance mode. Here is what you can expect:
- Security vulnerabilities and bugs will be addressed as time permits.
- Pull requests for bug fixes will be considered.
- Feature requested are unlikely to be implemented by the maintainer, but forks and extensions are encouraged.
For any questions, concerns, or comments, please open an issue in the GitHub repository.
License
This project is licensed under the MIT license. See the LICENSE file for more details.
Acknowledgements
- Typer for the command-line interface
- Rich for enhanced terminal output (panels, tables, Markdown rendering)
- ElementTree for XML parsing
- arXiv API for providing access to paper metadata via a well-designed API
- Trio and HTTPx for the asynchronous features
- pypdf for PDF text extraction
- Ollama for local LLM inference
- Anthropic and Google GenAI SDKs for cloud LLM providers
- Dead Simple Python for helping me advance my knowledge of Python
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxiv_retriever-1.4.0.tar.gz.
File metadata
- Download URL: arxiv_retriever-1.4.0.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19042f29eb464ab7ab7efdf6e7482741d21bc78a951536553003d8a1a243e791
|
|
| MD5 |
e3f4b68e24900c0a2b879df7b5930ae4
|
|
| BLAKE2b-256 |
52f01a683532fbc92587906b677b5ff14d3714b94ae5223a7cac4c62df8bfb10
|
Provenance
The following attestation bundles were made for arxiv_retriever-1.4.0.tar.gz:
Publisher:
test_and_publish.yml on eschukwu/arxiv_retriever
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arxiv_retriever-1.4.0.tar.gz -
Subject digest:
19042f29eb464ab7ab7efdf6e7482741d21bc78a951536553003d8a1a243e791 - Sigstore transparency entry: 961587930
- Sigstore integration time:
-
Permalink:
eschukwu/arxiv_retriever@4d691558f5c187095951c0c8b800a0bbe2d1ee9b -
Branch / Tag:
refs/tags/v1.4.0 - Owner: https://github.com/eschukwu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
test_and_publish.yml@4d691558f5c187095951c0c8b800a0bbe2d1ee9b -
Trigger Event:
push
-
Statement type:
File details
Details for the file arxiv_retriever-1.4.0-py3-none-any.whl.
File metadata
- Download URL: arxiv_retriever-1.4.0-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17ef95ec3a6904bde696f2c28bf1452620c7da3be8c24be004abf43e543741a9
|
|
| MD5 |
9b82a1a0fd409a481405e124e300cc66
|
|
| BLAKE2b-256 |
8cf58bd0dc736db04ad3d17359bb6dc5ae463929ab572d4cbea9860abf6e50a2
|
Provenance
The following attestation bundles were made for arxiv_retriever-1.4.0-py3-none-any.whl:
Publisher:
test_and_publish.yml on eschukwu/arxiv_retriever
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arxiv_retriever-1.4.0-py3-none-any.whl -
Subject digest:
17ef95ec3a6904bde696f2c28bf1452620c7da3be8c24be004abf43e543741a9 - Sigstore transparency entry: 961587982
- Sigstore integration time:
-
Permalink:
eschukwu/arxiv_retriever@4d691558f5c187095951c0c8b800a0bbe2d1ee9b -
Branch / Tag:
refs/tags/v1.4.0 - Owner: https://github.com/eschukwu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
test_and_publish.yml@4d691558f5c187095951c0c8b800a0bbe2d1ee9b -
Trigger Event:
push
-
Statement type: