Academic Paper Extraction and Formatting Tool
Project description
PaperXtract
For Chinese documentation, please see README_ZH.md.
PaperXtract is a powerful tool for extracting and formatting academic papers from scholarly platforms such as OpenReview, making it easier to read and organize research papers.
Key Features
- Paper Extraction: Extract paper information from OpenReview platform, supporting both URL and conference ID methods
- Category Filtering: Filter papers by category (e.g., oral, spotlight, poster)
- Formatted Output: Convert paper information into readable TXT format
- Batch Processing: Support batch processing for increased efficiency
- Command-line Interface: Provide a convenient CLI for easy integration into automated workflows
Installation
Via pip
pip install paperxtract
From source
git clone https://github.com/yuxiaoLeeMarks/paperxtract.git
cd paperxtract
pip install -e .
Usage
Command-line Tool
PaperXtract provides a command-line tool named paperxtract with several operation modes:
Extracting Papers
# Extract papers from URL
paperxtract extract --url "https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science" --output papers.json
# Extract papers from conference ID
paperxtract extract --venue-id "ICML.cc/2024/Workshop/AI4Science" --category oral --output papers.json
Formatting Papers
# Convert JSON file to TXT format
paperxtract format papers.json --output papers.txt
# Only convert papers of specific categories
paperxtract format papers.json --categories oral spotlight --output oral_spotlight_papers.txt
# List available paper categories
paperxtract format papers.json --list-categories
One-step Operation
# Extract papers from URL and format directly to TXT
paperxtract run --url "https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science" --output papers.txt --clean-temp
# Only extract and format papers of specific categories
paperxtract run --url "https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science" --output oral_papers.txt --categories oral --clean-temp
Python API
PaperXtract can also be used as a Python library:
from paperxtract.extractors.openreview import OpenReviewExtractor
from paperxtract.formatters.text_formatter import convert_papers_to_txt
# Extract papers
extractor = OpenReviewExtractor()
papers = extractor.get_papers_from_url("https://openreview.net/group?id=ICML.cc/2024/Workshop/AI4Science")
extractor.save_to_json(papers, "papers.json")
# Format papers
convert_papers_to_txt("papers.json", "papers.txt", categories="oral")
Project Structure
paperxtract/
├── paperxtract/ # Main package
│ ├── __init__.py # Package initialization
│ ├── __main__.py # Entry point
│ ├── cli.py # Command line interface
│ ├── extractors/ # Extractors subpackage
│ │ ├── __init__.py
│ │ └── openreview.py # OpenReview extractor
│ └── formatters/ # Formatters subpackage
│ ├── __init__.py
│ └── text_formatter.py # Text formatter
├── examples/ # Example code
│ └── extract_and_format.py
├── docs/ # Documentation
├── tests/ # Tests
├── setup.py # Installation configuration
├── requirements.txt # Dependencies
└── README.md # Documentation
Example Output
Example of a formatted TXT file:
ICML.2024 - Accept
| Total: 45
#1 Efficient Vision-Language Pre-training by Cluster Masking
Authors: Zihao Wei, Zixuan Pan, Andrew Owens
Keywords: Vision-Language, Pre-training, Masking
Abstract: The quest for optimal vision-language pretraining strategies...
#2 MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction
Authors: Mude Hui, Zihao Wei, Hongru Zhu
Keywords: 3D Reconstruction, Diffusion Models, Microscopy
Abstract: Volumetric optical microscopy using non-diffracting beams...
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paperxtract-0.1.0.tar.gz.
File metadata
- Download URL: paperxtract-0.1.0.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74a9a1598b80e25545b02575f61df31cfecca7da87e1227ad137de67a899708d
|
|
| MD5 |
cbe9a2ba897eda28c8aae7492ae1d0a0
|
|
| BLAKE2b-256 |
6b622af6ac42efcf89e41f36b59adc6fc5746fd097e429be0915a53fa024d179
|
File details
Details for the file paperxtract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: paperxtract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdd1592f31977d27d7eb50a43890edfb53e97f60c7799ac039fde5952a7a9887
|
|
| MD5 |
c273d86decd6ddde403558ba41056862
|
|
| BLAKE2b-256 |
053daf06fa553da33109bf61da9ffda9110a0fe686b356e60bb6928115a2e476
|