A package for creating ML research assistant models through paper dataset creation and model fine-tuning
Project description
PaperTuner
PaperTuner is a Python package for creating research assistant models by processing academic papers and fine-tuning language models to provide methodology guidance and research approaches.
Features
- Automated extraction of research papers from arXiv
- Section extraction to identify problem statements, methodologies, and results
- Generation of high-quality question-answer pairs for research methodology
- Fine-tuning of language models with GRPO (Growing Rank Pruned Optimization)
- Integration with Hugging Face for dataset and model sharing
Installation
pip install papertuner
Basic Usage
As a Command-Line Tool
1. Create a dataset from research papers
# Set up your environment variables
export GEMINI_API_KEY="your-api-key"
export HF_TOKEN="your-huggingface-token" # Optional, for uploading to HF
# Run the dataset creation
papertuner-dataset --max-papers 100
2. Train a model
# Train using the created or an existing dataset
papertuner-train --model "Qwen/Qwen2.5-3B-Instruct" --dataset "densud2/ml_qa_dataset"
As a Python Library
from papertuner import ResearchPaperProcessor, ResearchAssistantTrainer
# Create a dataset
processor = ResearchPaperProcessor(
api_key="your-api-key",
hf_repo_id="your-username/dataset-name"
)
papers = processor.process_papers(max_papers=10)
# Train a model
trainer = ResearchAssistantTrainer(
model_name="Qwen/Qwen2.5-3B-Instruct",
lora_rank=64,
output_dir="./model_output"
)
results = trainer.train("your-username/dataset-name")
# Test the model
question = "How would you design a transformer model for time series forecasting?"
response = trainer.run_inference(
results["model"],
results["tokenizer"],
question,
results["lora_path"]
)
print(response)
Configuration
You can configure the tool using environment variables or when initializing the classes:
GEMINI_API_KEY: API key for generating QA pairsHF_TOKEN: Hugging Face token for uploading datasets and modelsHF_REPO_ID: Hugging Face repository ID for the datasetPAPERTUNER_DATA_DIR: Custom directory for storing data (default: ~/.papertuner/data)
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papertuner-0.0.2.tar.gz.
File metadata
- Download URL: papertuner-0.0.2.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
495e86c236dade670dd199954422f16e5cdb229aa75b8eecbbf9584370e906e4
|
|
| MD5 |
2aa636c8d17a05a63fc29bb2696ec205
|
|
| BLAKE2b-256 |
5899ca3233c167e26d7f49e1d2542f5769fe66d04af55e81c119b4aa06421469
|
File details
Details for the file papertuner-0.0.2-py3-none-any.whl.
File metadata
- Download URL: papertuner-0.0.2-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c7cd5f164be09880bf59819672e12307c9b9f0ce338a4e2e0a9cf5581e53b9e
|
|
| MD5 |
ca3920f94c846639528e9459a3093470
|
|
| BLAKE2b-256 |
e228aa0bc4d8c60efd094c23956dc51bdbb61f3d139ab17f59e2a6b926a40e4d
|