A research automation tool for fetching, summarizing, and enhancing arXiv papers.
Project description
arxa
arxa is a Python package that helps you generate comprehensive research reviews from arXiv papers or local PDFs. It features a command‐line interface for searching arXiv and generating summaries, along with a FastAPI server for remote generation. arxa integrates with various LLM providers (OpenAI, Anthropic, Ollama) and includes tools for PDF processing, configuration management, and repository handling.
Installation
Easy Installation (from PyPI)
To install the package easily via pip:
pip install arxa
Installing from Source
If you prefer to install from the source repository:
-
Clone the repository:
git clone https://github.com/binaryninja/arxa.git -
Change into the repository directory:
cd arxa -
Install it in editable mode:
pip install -e .
This way you can make changes to the source code and test them immediately.
Repository Structure and File Overview
arxa/
├── arxa/
│ ├── __init__.py # Package version information.
│ ├── arxiv_utils.py # Functions to search and retrieve arXiv papers.
│ ├── cli.py # Command-line interface for generating reviews or starting the server.
│ ├── config.py # Utilities for loading configuration from a YAML file.
│ ├── llm_backends.py # Backend functions to interface with OpenAI, Anthropic, and Ollama APIs.
│ ├── pdf_utils.py # PDF processing utilities: sanitizing filenames, downloading PDFs, extracting text.
│ ├── prompts.py # Prompt templates used to instruct the language model.
│ ├── repo_utils.py # Functions to extract and clone GitHub repository URLs from generated reviews.
│ ├── research_review.py # Core functionality to generate a research review summary using an LLM.
│ └── server.py # Use this to configure your own private arxa server.
└── __pycache__/ # Contains Python bytecode cache files.
Detailed File & Function Descriptions
arxa/arxiv_utils.py
- Provides functions to interact with the arXiv API:
search_arxiv_by_author(author: str, max_results: int = 10): Searches for papers by author name.search_arxiv_by_keyword(keyword: str, max_results: int = 10): Searches for papers by keyword in the title.search_arxiv_by_id_list(id_list: list): Searches for papers given a list of arXiv IDs (handles batching if necessary).
arxa/cli.py
- The entry point for the command-line interface.
- Parses arguments and provides two main modes:
- Server Mode: Starts the FastAPI server by using the
--serverflag. remember to set your own API key(s) in theconfig.yamlfile. - Review Generation Mode: Accepts an arXiv ID (
-aid) or local PDF file (-pdf) to generate a research review.
- Server Mode: Starts the FastAPI server by using the
- Other options include:
-oor--output: Specify output file for the generated review.-por--provider: Choose the LLM provider (default: "arxa.richards.ai:8000" for community server, or use "anthropic", "openai"").-mor--model: Specify model identifier/version, such as "o3-mini".-gor--github: Enable GitHub cloning if a GitHub URL is detected in the generated review.-cor--config: Path to a YAML configuration file.--quiet: Disable rich output formatting.
Usage examples:
-
Generate a review from an arXiv ID:
arxa -aid 2301.00123v1 -o review.md -
Generate a review from a local PDF:
arxa -pdf /path/to/paper.pdf -
Generate a review using a specific provider and model:
arxa -aid 2301.00123v1 -o review.md -p openai -m o3-mini -
Start a private arxa server:
arxa --server
arxa/config.py
- Contains a helper function:
load_config(config_path: str = "config.yaml"): Loads configuration data from a YAML file. This file can specify defaults such as directories for papers and output.
arxa/llm_backends.py
- Provides functions to interface with various LLM backends:
openai_generate(...): Uses OpenAI’s API (with rate limit handling via the tenacity library) to generate output.anthropic_generate(...): Uses Anthropic’s API to generate completions with built-in retry logic.ollama_generate(...): Interfaces with a local Ollama inference server.truncate_text_to_token_limit(...): Truncates input text so that its token count does not exceed the specified maximum.
- Custom exceptions (
OpenAIAPIError,AnthropicAPIError,OllamaAPIError) handle backend-specific issues.
arxa/pdf_utils.py
- PDF-related utility functions:
sanitize_filename(filename: str): Removes invalid characters from filenames.download_pdf_from_arxiv(paper: arxiv.Result, output_path: str): Downloads a paper’s PDF to the specified path.extract_text_from_pdf(pdf_path: str): Extracts text from a PDF using PyPDF2.
arxa/prompts.py
- Contains prompt templates that guide the language model in generating the research review.
PROMPT_PREFIX: Instructions and initial context.PROMPT_SUFFIX: A markdown template into which paper information is inserted.
The templates instruct the model to generate content sections such as Paper Information, Summary, Methodology, Strengths, Weaknesses, and additional notes.
arxa/repo_utils.py
- Offers functions for handling GitHub repositories mentioned in a review:
extract_github_url(content: str): Uses regex to find a GitHub URL in the generated review.clone_repo(github_url: str, output_dir: str): Uses the GitHub CLI (gh) to fork/clone the repository to a local directory.
arxa/research_review.py
- Implements the generation of a research review summary using text extracted from a PDF and paper metadata.
- Key functions:
truncate_text_to_token_limit(...): Ensures the prompt does not exceed a maximum token limit (using tiktoken).generate_research_review(...): Constructs the full prompt (inserting the truncated PDF text and paper_info into the prompt template) and calls the appropriate LLM backend (OpenAI, Anthropic, or Ollama) to generate the review.
- It then extracts the relevant response section demarcated by
<research_notes>tags.
arxa/server.py
- Implements a FastAPI server to offer a RESTful interface. Key features:
- An endpoint
/generate-reviewthat accepts a JSON payload (with PDF text, paper information, provider, model) and returns the generated review. - A
/healthendpoint for a simple health check. - Custom middleware for logging and rate limiting (tracks request frequency per IP, blacklists abusive clients).
- Overrides provider/model parameters so that all requests are processed using (for example) the OpenAI API with a specific model ("o3-mini").
- An endpoint
Command-Line Options
When running the package via the command line (using the arxa CLI), you have various options:
-
--server Start the FastAPI server instead of processing a PDF or arXiv paper.
-
-aid Specify an arXiv ID (e.g.,
1234.5678) to fetch and generate a review. -
-pdf Provide the path to a local PDF to generate a research review from.
-
-o, --output Set the output file for saving the generated markdown review.
-
-p, --provider Choose the LLM provider. Options are: • arxa.richards.ai:8000 (default remote server) • anthropic • openai • ollama
-
-m, --model Define the model identifier/version (default "o3-mini" when using remote server mode).
-
-g, --github Enable GitHub cloning if a GitHub URL is present in the generated review.
-
-c, --config Path to an alternative YAML configuration file.
-
--quiet Disable enhanced output formatting (useful when rich output is not desired).
Example CLI command:
arxa -aid 2301.00123v1 -o my_review.md -p openai -m o3-mini --github
This command will search for the paper using the provided arXiv ID, download and process the PDF if necessary, generate a research review using OpenAI’s API, write the review to "my_review.md", and clone any detected GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxa-0.1.12.tar.gz.
File metadata
- Download URL: arxa-0.1.12.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69350001a01c4dbf62fa7e864f5f76c3aa95a239437f49f9aba37cde28359c1c
|
|
| MD5 |
6e370c455c4e17d44ad25aa0500b3554
|
|
| BLAKE2b-256 |
dc71ab791d124385066b7b09049cf0df4583eaa7eb1d5221ca024858a8c107f8
|
File details
Details for the file arxa-0.1.12-py3-none-any.whl.
File metadata
- Download URL: arxa-0.1.12-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
146bfb2b8bdac2d2c3eea64efaaffc5d0e9ed5c4a882469e66784d36d9ec3130
|
|
| MD5 |
e67b62b1b48d618d96c36c7d0b5d30f1
|
|
| BLAKE2b-256 |
f8634fe613880a6a7bf9dc191807e80a9e7b37eec4f33d8204d2d668711624f1
|