Skip to main content

A package for creating ML research assistant models through paper dataset creation and model fine-tuning

Project description

PaperTuner

PaperTuner is a Python package for creating research assistant models by processing academic papers and fine-tuning language models to provide methodology guidance and research approaches.

Features

  • Automated extraction of research papers from arXiv
  • Section extraction to identify problem statements, methodologies, and results
  • Generation of high-quality question-answer pairs for research methodology
  • Fine-tuning of language models with GRPO (Growing Rank Pruned Optimization)
  • Integration with Hugging Face for dataset and model sharing

Installation

pip install papertuner

Basic Usage

As a Command-Line Tool

1. Create a dataset from research papers

# Set up your environment variables
export GEMINI_API_KEY="your-api-key"
export HF_TOKEN="your-huggingface-token"  # Optional, for uploading to HF

# Run the dataset creation
papertuner-dataset --max-papers 100

2. Train a model

# Train using the created or an existing dataset
papertuner-train --model "Qwen/Qwen2.5-3B-Instruct" --dataset "densud2/ml_qa_dataset"

As a Python Library

Here's a complete example of creating a specialized biology research model:

from papertuner import ResearchPaperProcessor, ResearchAssistantTrainer

# 1. Create a dataset from biology papers
processor = ResearchPaperProcessor(
    api_key="your-gemini-api-key",
    hf_repo_id="your-username/bio-research-qa"
)

# Use a biology-focused search query
bio_query = " OR ".join([
    "molecular biology",
    "cell biology",
    "genetics",
    "biochemistry",
    "systems biology",
    "synthetic biology",
    "bioinformatics",
    "genomics",
    "proteomics",
    "metabolomics"
])

# Process papers and create dataset
papers = processor.process_papers(
    max_papers=100,
    search_query=bio_query,
    clear_processed_data=True  # Start fresh
)

# 2. Train a specialized model
trainer = ResearchAssistantTrainer(
    model_name="Qwen/Qwen2.5-3B-Instruct",  # Base model
    lora_rank=64,
    output_dir="./bio_model",
    system_prompt="""You are a biology research assistant. Follow this format:
<think>
Analyze the biological research question step-by-step, considering:
- Relevant biological mechanisms
- Experimental approaches
- Key methodological considerations
- Potential limitations
</think>

Provide a clear, scientifically-grounded answer that explains both the 'how' and 'why'
of the biological approach or method."""
)

# Train the model
results = trainer.train("your-username/bio-research-qa")

# 3. Test the model with biology questions
questions = [
    "How would you design a CRISPR experiment to study gene function in mammalian cells?",
    "What approaches can be used to study protein-protein interactions in vivo?",
    "How would you analyze single-cell RNA sequencing data to identify cell types?"
]

for question in questions:
    response = trainer.run_inference(
        results["model"],
        results["tokenizer"],
        question,
        results["lora_path"]
    )
    print(f"\nQ: {question}")
    print(f"A: {response}\n")

Configuration

You can configure the tool using environment variables or when initializing the classes:

  • GEMINI_API_KEY: API key for generating QA pairs
  • HF_TOKEN: Hugging Face token for uploading datasets and models
  • HF_REPO_ID: Hugging Face repository ID for the dataset
  • PAPERTUNER_DATA_DIR: Custom directory for storing data (default: ~/.papertuner/data)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papertuner-0.1.4.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papertuner-0.1.4-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file papertuner-0.1.4.tar.gz.

File metadata

  • Download URL: papertuner-0.1.4.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for papertuner-0.1.4.tar.gz
Algorithm Hash digest
SHA256 a5640887daba73b101d444d9c9c8c4e92a4c920b65d8bf14209f5cbcca9ccf11
MD5 db7040bbf5e667d9bc7ca568d55c3671
BLAKE2b-256 77d5295f76fe0d33b5af731595d9ce3ca10c14bd91a754d182fd13b7e1b32d78

See more details on using hashes here.

Provenance

The following attestation bundles were made for papertuner-0.1.4.tar.gz:

Publisher: release.yaml on Lyra-Lab/papertuner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file papertuner-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: papertuner-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for papertuner-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2753d35c97cd0aef46fde2576e9841d23726a8572078fea137e157b29906fcc0
MD5 edb3eb9a91f7feb41de8a5b6d54bd868
BLAKE2b-256 ee6588f7f19b610a2ec8f2c4cd336a004a67fbd55875bf6230a5f3f3c2bf65e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for papertuner-0.1.4-py3-none-any.whl:

Publisher: release.yaml on Lyra-Lab/papertuner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page