A package for creating ML research assistant models through paper dataset creation and model fine-tuning

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

PaperTuner

PaperTuner is a Python package for creating research assistant models by processing academic papers and fine-tuning language models to provide methodology guidance and research approaches.

Features

Automated extraction of research papers from arXiv
Section extraction to identify problem statements, methodologies, and results
Generation of high-quality question-answer pairs for research methodology
Fine-tuning of language models with GRPO (Growing Rank Pruned Optimization)
Integration with Hugging Face for dataset and model sharing

Installation

pip install papertuner

Basic Usage

As a Command-Line Tool

1. Create a dataset from research papers

# Set up your environment variables
export GEMINI_API_KEY="your-api-key"
export HF_TOKEN="your-huggingface-token"  # Optional, for uploading to HF

# Run the dataset creation
papertuner-dataset --max-papers 100

2. Train a model

# Train using the created or an existing dataset
papertuner-train --model "Qwen/Qwen2.5-3B-Instruct" --dataset "densud2/ml_qa_dataset"

As a Python Library

Here's a complete example of creating a specialized biology research model:

from papertuner import ResearchPaperProcessor, ResearchAssistantTrainer

# 1. Create a dataset from biology papers
processor = ResearchPaperProcessor(
    api_key="your-gemini-api-key",
    hf_repo_id="your-username/bio-research-qa"
)

# Use a biology-focused search query
bio_query = " OR ".join([
    "molecular biology",
    "cell biology",
    "genetics",
    "biochemistry",
    "systems biology",
    "synthetic biology",
    "bioinformatics",
    "genomics",
    "proteomics",
    "metabolomics"
])

# Process papers and create dataset
papers = processor.process_papers(
    max_papers=100,
    search_query=bio_query,
    clear_processed_data=True  # Start fresh
)

# 2. Train a specialized model
trainer = ResearchAssistantTrainer(
    model_name="Qwen/Qwen2.5-3B-Instruct",  # Base model
    lora_rank=64,
    output_dir="./bio_model",
    system_prompt="""You are a biology research assistant. Follow this format:
<think>
Analyze the biological research question step-by-step, considering:
- Relevant biological mechanisms
- Experimental approaches
- Key methodological considerations
- Potential limitations
</think>

Provide a clear, scientifically-grounded answer that explains both the 'how' and 'why'
of the biological approach or method."""
)

# Train the model
results = trainer.train("your-username/bio-research-qa")

# 3. Test the model with biology questions
questions = [
    "How would you design a CRISPR experiment to study gene function in mammalian cells?",
    "What approaches can be used to study protein-protein interactions in vivo?",
    "How would you analyze single-cell RNA sequencing data to identify cell types?"
]

for question in questions:
    response = trainer.run_inference(
        results["model"],
        results["tokenizer"],
        question,
        results["lora_path"]
    )
    print(f"\nQ: {question}")
    print(f"A: {response}\n")

Configuration

You can configure the tool using environment variables or when initializing the classes:

GEMINI_API_KEY: API key for generating QA pairs
HF_TOKEN: Hugging Face token for uploading datasets and models
HF_REPO_ID: Hugging Face repository ID for the dataset
PAPERTUNER_DATA_DIR: Custom directory for storing data (default: ~/.papertuner/data)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.2.27

Mar 28, 2025

0.2.26

Mar 28, 2025

0.2.25

Mar 26, 2025

0.2.24

Mar 26, 2025

0.2.23

Mar 26, 2025

0.2.22

Mar 26, 2025

0.2.21

Mar 25, 2025

0.2.1

Mar 25, 2025

0.2.0

Mar 25, 2025

0.1.4

Mar 24, 2025

0.1.3

Mar 24, 2025

This version

0.1.2

Mar 24, 2025

0.1.1

Mar 23, 2025

0.0.8

Mar 23, 2025

0.0.7

Mar 23, 2025

0.0.6

Mar 23, 2025

0.0.4

Mar 23, 2025

0.0.3

Mar 23, 2025

0.0.2

Mar 23, 2025

0.0.1

Mar 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papertuner-0.1.2.tar.gz (21.0 kB view details)

Uploaded Mar 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

papertuner-0.1.2-py3-none-any.whl (21.0 kB view details)

Uploaded Mar 24, 2025 Python 3

File details

Details for the file papertuner-0.1.2.tar.gz.

File metadata

Download URL: papertuner-0.1.2.tar.gz
Upload date: Mar 24, 2025
Size: 21.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for papertuner-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`59f5afd5300950da6ac43148a04f4d6959898d2163ddad04dbee70d0b3683691`
MD5	`ccb5d41dd4eaa7af139bcff3f8e5352a`
BLAKE2b-256	`42268a9b9d0ceba2b0efa6f7360a0f657d93228ae310278289f27a5a70f5dfbf`

See more details on using hashes here.

File details

Details for the file papertuner-0.1.2-py3-none-any.whl.

File metadata

Download URL: papertuner-0.1.2-py3-none-any.whl
Upload date: Mar 24, 2025
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for papertuner-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0e5f4cbba6b27c42a0fa8a4dc49fa9c01f9a14033057ff362b2aaf24d200574`
MD5	`cba557933f6d55a776eec56e408bfdac`
BLAKE2b-256	`a0179bb86199edeed12dd0ea582c592ee757f430241d9dc0e5fa6443724748ab`

See more details on using hashes here.

papertuner 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PaperTuner

Features

Installation

Basic Usage

As a Command-Line Tool

1. Create a dataset from research papers

2. Train a model

As a Python Library

Configuration

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes