Automate arXiv paper tracking with LLM-powered metadata extraction and Google Sheets sync.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

arXivFlow 🚀

arXivFlow is a powerful Python-based automation tool designed to streamline the research paper discovery and tracking process. It autonomously fetches metadata from arXiv, performs AI-driven analysis using Ollama or the Gemini API, and synchronizes the results with Google Sheets and local databases.

✨ Features

Asynchronous API: Fully rewritten with asyncio for high-performance paper retrieval and PDF processing.
Automated Retrieval: Fetch the latest papers from specific arXiv categories (e.g., cs.AI, cs.LG, hep-ph) within any date range.
AI Analysis Options: Uses Ollama models for local/private extraction or Gemini models for cloud-backed extraction of keywords and contact information (emails/affiliations).
Intelligent PDF Handling: Automatically downloads PDFs and extracts text for deep analysis. Supports custom storage paths and atomic PDF writes.
Robust arXiv Requests: Built-in compliance with arXiv's API guidelines (3-second request intervals), paged metadata retrieval, 429 cooldown handling, retry backoff, and duplicate-result cleanup.
Multi-Format Export: Save your research data to CSV, JSON, Excel, or SQLite for flexible offline analysis.
Google Sheets Sync: Seamlessly push compiled research data to a shared Google Sheet for team collaboration.
Type-Safe & Modular: Clean, documented Python code with full type hinting and a class-based architecture.

🛠️ Prerequisites

Python 3.13+: Ensure you have a modern Python environment.
Choose an AI backend:
- For Ollama, install Ollama and download the required model (e.g., Llama 3.2):
```
ollama pull llama3.2
```
- For Gemini, create a Gemini API key and either pass it as gemini_api_key or set it as GOOGLE_AI_API.
Google Cloud Credentials for Google Sheets sync:
- Enable the Google Sheets and Google Drive APIs.
- Create a Service Account and download the JSON key as credentials.json.
- Ensure the service account has 'Editor' permissions on the sheet.

🚀 Installation

From PyPI (Recommended)

pip install arxivflow

From Source (For Development)

Clone the repository:

git clone https://github.com/zjzhao/arXivFlow.git
cd arXivFlow

Set up virtual environment:

python -m venv .
source bin/activate  # On Windows: Scripts\activate

Install dependencies:
```
pip install -e .
```

📖 Usage

Quick Start (Async)

import asyncio
import datetime
from arxivflow import arXivFlow

async def main():
    # 1. Initialize the flow with Ollama
    flow = arXivFlow(
        categories=["cs.AI", "cs.CV"], 
        ollama_model="llama3.2",
        max_results=20,
        start_date=datetime.datetime.now() - datetime.timedelta(days=7),
        request_timeout=60.0
    )

    # 2. Fetch data & Extract info (Keywords/Contacts)
    df = await flow.get_arxiv_data(download_pdfs=True)

    # 3. Save to your preferred formats
    flow.save_to_csv("my_research.csv")
    flow.save_to_sqlite("research.db")

    # 4. Sync with Google Sheets
    flow.save_to_google_sheet(
        sheet_id="YOUR_SHEET_ID", 
        credentials_file="credentials.json"
    )
    
    # 5. Close the client
    await flow.close()

if __name__ == "__main__":
    asyncio.run(main())

Gemini Backend

import asyncio
import datetime
import os
from arxivflow import arXivFlow

async def main():
    flow = arXivFlow(
        categories=["cs.AI", "cs.CV"],
        gemini_model="gemini-2.5-flash",
        gemini_api_key=os.getenv("GOOGLE_AI_API"),
        max_results=20,
        start_date=datetime.datetime.now() - datetime.timedelta(days=7),
    )

    df = await flow.get_arxiv_data(download_pdfs=True)
    flow.save_to_csv("my_research.csv")
    await flow.close()

if __name__ == "__main__":
    asyncio.run(main())

If both ollama_model and gemini_model are provided, Ollama takes precedence. When gemini_model is set, a Gemini API key is required; pass gemini_api_key directly or set the GOOGLE_AI_API environment variable.

🧱 Request Stability

arXiv can occasionally return slow responses, rate limits, or temporary service errors. arXivFlow now makes the request path more stable by:

Fetching arXiv metadata in smaller pages instead of relying on one large request.
Fetching metadata for all requested categories before starting PDF downloads, which avoids PDF download bursts interfering with the next category query.
Serializing arXiv requests and preserving the recommended 3-second interval.
Retrying transient failures (429, 500, 502, 503, 504, timeouts, and network errors) with exponential backoff and jitter.
Applying a longer cooldown after 429 rate-limit responses before making the next arXiv request.
Respecting Retry-After headers when arXiv provides them.
Using a default 60-second request timeout, configurable with request_timeout.
Writing PDFs to temporary .part files first, then atomically replacing the final file only after validating PDF-like content.
Deduplicating merged output by arXiv ID.

For especially large date ranges, prefer smaller max_results values or narrower date windows. arXivFlow will page requests internally, but smaller slices are still easier for arXiv and more reliable in practice.

🏗️ Architecture

The project follows a modular structure for easy extension:

src/arxivflow/arxivflow.py: The main orchestrator class (arXivFlow).
src/arxivflow/ollama_functions.py: Local LLM interface using the Ollama API.
src/arxivflow/gemini_functions.py: Gemini API interface for cloud-backed keyword and contact extraction.
src/arxivflow/arxiv_functions.py: Asynchronous arXiv API interaction layer, including paging, rate limiting, retries, and PDF downloads.
src/arxivflow/categories.py: arXiv category definitions.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.3.0

May 27, 2026

0.2.2

May 26, 2026

0.2.1

May 18, 2026

0.2.0

May 13, 2026

0.1.2

May 8, 2026

0.1.1

May 1, 2026

0.1.0

May 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivflow-0.3.0.tar.gz (24.5 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arxivflow-0.3.0-py3-none-any.whl (18.8 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file arxivflow-0.3.0.tar.gz.

File metadata

Download URL: arxivflow-0.3.0.tar.gz
Upload date: May 27, 2026
Size: 24.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivflow-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5a49fdab966ad2a5b563eac7e6f62c1f3fd7e307d62c02aa30418cc02a5dc9d4`
MD5	`acfc7613b9cec89f5a50638faa916e86`
BLAKE2b-256	`56a7c4db1b28d111321af878f7486f3d5438f44480e4672eea59d33cdee0f2ce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivflow-0.3.0.tar.gz:

Publisher: python-publish.yml on zjzhao1002/arXivFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arxivflow-0.3.0.tar.gz
- Subject digest: 5a49fdab966ad2a5b563eac7e6f62c1f3fd7e307d62c02aa30418cc02a5dc9d4
- Sigstore transparency entry: 1643542175
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: zjzhao1002/arXivFlow@a2bdcccdee2193d87b7ca53cf6e612b2ef0d6464
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/zjzhao1002
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@a2bdcccdee2193d87b7ca53cf6e612b2ef0d6464
- Trigger Event: release

File details

Details for the file arxivflow-0.3.0-py3-none-any.whl.

File metadata

Download URL: arxivflow-0.3.0-py3-none-any.whl
Upload date: May 27, 2026
Size: 18.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivflow-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2320f829c1053fb06341834b8f703e2e23e4fe23cd8b81a1ecebf3b2e04dd36e`
MD5	`29773557483ff6bcb31acde9e362ae95`
BLAKE2b-256	`9d86d0cf63da8aac03a2aec9ec22f71c8a30e06d4560053b1cdedb65912cfbe7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivflow-0.3.0-py3-none-any.whl:

Publisher: python-publish.yml on zjzhao1002/arXivFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arxivflow-0.3.0-py3-none-any.whl
- Subject digest: 2320f829c1053fb06341834b8f703e2e23e4fe23cd8b81a1ecebf3b2e04dd36e
- Sigstore transparency entry: 1643542298
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: zjzhao1002/arXivFlow@a2bdcccdee2193d87b7ca53cf6e612b2ef0d6464
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/zjzhao1002
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@a2bdcccdee2193d87b7ca53cf6e612b2ef0d6464
- Trigger Event: release

arxivflow 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

arXivFlow 🚀

✨ Features

🛠️ Prerequisites

🚀 Installation

From PyPI (Recommended)

From Source (For Development)

📖 Usage

Quick Start (Async)

Gemini Backend

🧱 Request Stability

🏗️ Architecture

📜 License

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance