Skip to main content

Academic Paper Extraction and Formatting Tool

Project description

PaperXtract

For Chinese documentation, please see README_ZH.md.

PaperXtract is a powerful tool for extracting and formatting academic papers from scholarly platforms such as OpenReview, making it easier to read and organize research papers.

Key Features

  • Paper Extraction: Extract paper information from OpenReview platform, supporting both URL and conference ID methods
  • Category Filtering: Filter papers by category (e.g., oral, spotlight, poster)
  • Formatted Output: Convert paper information into readable TXT format
  • Batch Processing: Support batch processing for increased efficiency
  • Command-line Interface: Provide a convenient CLI for easy integration into automated workflows

Installation

Via pip

pip install paperxtract

From source

git clone https://github.com/yuxiaoLeeMarks/paperxtract.git
cd paperxtract
pip install -e .

Usage

Command-line Tool

PaperXtract provides a command-line tool named paperxtract with several operation modes:

Extracting Papers

# Extract papers from URL
paperxtract extract --url "https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science" --output papers.json

# Extract papers from conference ID
paperxtract extract --venue-id "ICML.cc/2024/Workshop/AI4Science" --category oral --output papers.json

Formatting Papers

# Convert JSON file to TXT format
paperxtract format papers.json --output papers.txt

# Only convert papers of specific categories
paperxtract format papers.json --categories oral spotlight --output oral_spotlight_papers.txt

# List available paper categories
paperxtract format papers.json --list-categories

One-step Operation

# Extract papers from URL and format directly to TXT
paperxtract run --url "https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science" --output papers.txt --clean-temp

# Only extract and format papers of specific categories
paperxtract run --url "https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science" --output oral_papers.txt --categories oral --clean-temp

Python API

PaperXtract can also be used as a Python library:

from paperxtract.extractors.openreview import OpenReviewExtractor
from paperxtract.formatters.text_formatter import convert_papers_to_txt

# Extract papers
extractor = OpenReviewExtractor()
papers = extractor.get_papers_from_url("https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science")
extractor.save_to_json(papers, "papers.json")

# Format papers
convert_papers_to_txt("papers.json", "papers.txt", categories="oral")

Project Structure

paperxtract/
├── paperxtract/          # Main package
│   ├── __init__.py       # Package initialization
│   ├── __main__.py       # Entry point
│   ├── cli.py            # Command line interface
│   ├── extractors/       # Extractors subpackage
│   │   ├── __init__.py
│   │   └── openreview.py # OpenReview extractor
│   └── formatters/       # Formatters subpackage
│       ├── __init__.py
│       └── text_formatter.py # Text formatter
├── examples/             # Example code
│   └── extract_and_format.py
├── docs/                 # Documentation
├── tests/                # Tests
├── setup.py              # Installation configuration
├── requirements.txt      # Dependencies
└── README.md             # Documentation

Example Output

Example of a formatted TXT file:

ICML.2024 - Accept
   | Total: 45

#1 Efficient Vision-Language Pre-training by Cluster Masking
Authors: Zihao Wei, Zixuan Pan, Andrew Owens
Keywords: Vision-Language, Pre-training, Masking
Abstract: The quest for optimal vision-language pretraining strategies...

#2 MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction
Authors: Mude Hui, Zihao Wei, Hongru Zhu
Keywords: 3D Reconstruction, Diffusion Models, Microscopy
Abstract: Volumetric optical microscopy using non-diffracting beams...

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperxtract-0.1.0.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperxtract-0.1.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file paperxtract-0.1.0.tar.gz.

File metadata

  • Download URL: paperxtract-0.1.0.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for paperxtract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 74a9a1598b80e25545b02575f61df31cfecca7da87e1227ad137de67a899708d
MD5 cbe9a2ba897eda28c8aae7492ae1d0a0
BLAKE2b-256 6b622af6ac42efcf89e41f36b59adc6fc5746fd097e429be0915a53fa024d179

See more details on using hashes here.

File details

Details for the file paperxtract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paperxtract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for paperxtract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cdd1592f31977d27d7eb50a43890edfb53e97f60c7799ac039fde5952a7a9887
MD5 c273d86decd6ddde403558ba41056862
BLAKE2b-256 053daf06fa553da33109bf61da9ffda9110a0fe686b356e60bb6928115a2e476

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page