Enzyme lineage analysis and sequence extraction package

These details have not been verified by PyPI

Project links

Project description

DEBase

Enzyme lineage analysis and sequence extraction package with advanced parallel processing capabilities.

Installation

pip install debase

For full functionality with chemical SMILES support:

pip install debase[rdkit]

Requirements

Python 3.8 or higher
A Gemini API key (set as environment variable GEMINI_API_KEY)

Recent Updates

Campaign-Aware Extraction: Automatically detects and processes multiple directed evolution campaigns in a single paper
Improved Model Support: Updated to use stable Gemini models for better reliability
Enhanced PDB Integration: Intelligent AI-based matching of PDB structures to enzyme variants
Better Filtering: Automatic removal of non-enzyme entries (buffers, controls, media)
Optimized Performance: Removed unnecessary rate limiting for faster processing
External Sequence Fetching: Automatic retrieval from PDB and UniProt databases when sequences aren't in papers
Improved SI Processing: Structure-aware extraction of supplementary information
Vision Support: Extracts data from figures and tables using multimodal AI capabilities

Quick Start

Basic Usage

# Run the full pipeline (sequential processing)
debase --manuscript manuscript.pdf --si supplementary.pdf --output output.csv

High-Performance Parallel Processing

# Use parallel individual processing for maximum speed + accuracy
debase --manuscript manuscript.pdf --si supplementary.pdf --output output.csv \
  --use-parallel-individual --max-workers 5

# Use batch processing for maximum speed (slight accuracy trade-off)
debase --manuscript manuscript.pdf --si supplementary.pdf --output output.csv \
  --use-optimized-reaction --reaction-batch-size 5

Processing Methods

DEBase offers three processing approaches optimized for different use cases:

1. Parallel Individual Processing (Recommended)

42 individual API calls (21 for reactions + 21 for substrate scope)
5 calls running simultaneously for 4-5x speedup
Maximum accuracy - each enzyme gets dedicated attention
Best for: Production use, important analyses

debase --manuscript paper.pdf --si si.pdf --use-parallel-individual --max-workers 5

2. Batch Processing (Fastest)

~8 total API calls (multiple enzymes per call)
Fastest processing - up to 8x speedup
Good accuracy - slight trade-off for complex chemical names
Best for: Quick analyses, large-scale processing

debase --manuscript paper.pdf --si si.pdf --use-optimized-reaction --reaction-batch-size 5

3. Sequential Processing (Most Accurate)

42 sequential API calls (one at a time)
Highest accuracy but slowest
Best for: Critical analyses, small datasets

debase --manuscript paper.pdf --si si.pdf  # Default method

Performance Comparison

Method	Total Time	API Calls	Accuracy	Best For
Sequential	~45 min	44 calls	Highest	Small datasets
Parallel Individual	~12 min	44 calls	High	Recommended
Batch Processing	~8 min	~8 calls	Good	Speed-critical

Advanced Usage

Skip Steps with Existing Data

# Skip lineage extraction if you already have it
debase --manuscript paper.pdf --si si.pdf --output output.csv \
  --skip-lineage --existing-lineage existing_lineage.csv \
  --use-parallel-individual

Direct Module Usage

# Run only reaction extraction with parallel processing
python -m debase.reaction_info_extractor_parallel \
  --manuscript paper.pdf --si si.pdf --lineage-csv lineage.csv \
  --max-workers 5 --output reactions.csv

# Run only substrate scope extraction with parallel processing  
python -m debase.substrate_scope_extractor_parallel \
  --manuscript paper.pdf --si si.pdf --lineage-csv lineage.csv \
  --max-workers 5 --output substrate_scope.csv

Python API

from debase.wrapper import run_pipeline

# Run full pipeline with parallel processing
run_pipeline(
    manuscript_path="paper.pdf",
    si_path="si.pdf", 
    output="output.csv",
    use_parallel_individual=True,
    max_workers=5
)

# For individual steps
from debase.reaction_info_extractor_parallel import extract_reaction_info_parallel
from debase.enzyme_lineage_extractor import setup_gemini_api

model = setup_gemini_api()
reaction_data = extract_reaction_info_parallel(
    model, manuscript_path, si_path, enzyme_csv_path, max_workers=5
)

Pipeline Architecture

The DEBase pipeline consists of 5 main steps:

Lineage Extraction (Sequential) - Identifies all enzymes and their relationships
- Extracts mutation information and evolutionary paths
- Detects multiple directed evolution campaigns automatically
- Fetches sequences from external databases (PDB, UniProt)
- Filters out non-enzyme entries automatically
Sequence Cleanup (Local) - Generates protein sequences from mutations
- Applies mutations to parent sequences
- Handles complex mutations and domain modifications
- Validates sequence integrity
Reaction Extraction (Parallel/Batch/Sequential) - Extracts reaction conditions and performance data
- Campaign-aware extraction for multi-lineage papers
- Vision-based extraction from figures and tables
- Automatic IUPAC name resolution
Substrate Scope Extraction (Parallel/Sequential) - Finds additional substrates tested
Data Formatting (Local) - Combines all data into final output

Features

Multi-processing modes: Sequential, parallel individual, and batch processing
Campaign detection: Automatically identifies and separates multiple directed evolution campaigns
Intelligent error handling: Automatic retries with exponential backoff
External database integration: Automatic sequence fetching from PDB and UniProt
AI-powered matching: Uses Gemini to intelligently match database entries to enzyme variants
Smart filtering: Automatically excludes non-enzyme entries (buffers, controls, etc.)
Progress tracking: Real-time status updates
Flexible output: CSV format with comprehensive chemical and performance data
Caching: PDF encoding cache for improved performance
Vision capabilities: Extracts data from both text and images in PDFs

Complete Command Reference

Core Arguments

--manuscript PATH           # Required: Path to manuscript PDF
--si PATH                  # Optional: Path to supplementary information PDF
--output PATH              # Output file path (default: manuscript_name_debase.csv)
--queries N                # Number of consensus queries (default: 2)

Performance Options

--use-parallel-individual  # Use parallel processing (recommended)
--max-workers N            # Number of parallel workers (default: 5)
--use-optimized-reaction   # Use batch processing for speed
--reaction-batch-size N    # Enzymes per batch (default: 5)
--no-parallel-queries      # Disable parallel processing

Pipeline Control

--skip-lineage             # Skip lineage extraction step
--skip-sequence            # Skip sequence cleanup step  
--skip-reaction            # Skip reaction extraction step
--skip-substrate-scope     # Skip substrate scope extraction step
--skip-lineage-format      # Skip final formatting step
--skip-validation          # Skip data validation step

Data Management

--existing-lineage PATH    # Use existing lineage data
--existing-sequence PATH   # Use existing sequence data
--existing-reaction PATH   # Use existing reaction data
--keep-intermediates       # Preserve intermediate files

Advanced Options

--model-name NAME          # Gemini model to use
--max-retries N            # Maximum retry attempts (default: 2)
--max-chars N              # Max characters from PDFs (default: 75000)
--debug-dir PATH           # Directory for debug output (prompts, API responses)

Tips for Best Performance

Use parallel individual processing for the best balance of speed and accuracy
Set max-workers to 5 to avoid API rate limits while maximizing throughput
Use batch processing only when speed is critical and some accuracy loss is acceptable
Skip validation (--skip-validation) for faster processing in production
Keep intermediates (--keep-intermediates) for debugging and incremental runs
Check external databases - Many sequences can be automatically fetched from PDB/UniProt
Verify enzyme entries - The system automatically filters out buffers and controls

Troubleshooting

No sequences found

The extractor will automatically search PDB and UniProt databases
Check the logs for which database IDs were found and attempted
Sequences with PDB structures will be fetched with high confidence

Incorrect enzyme extraction

Non-enzyme entries (buffers, controls, media) are automatically filtered
Check the log for entries marked as "Filtering out non-enzyme entry"

PDB matching issues

The system uses AI to match PDB IDs to specific enzyme variants
Increased context extraction ensures better matching accuracy
Check logs for "Gemini PDB matching" entries to see the matching process

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.1.0

Jul 28, 2025

2.0.0

Jul 26, 2025

1.0.0

Jul 24, 2025

0.8

Jul 23, 2025

0.7.1

Jul 22, 2025

0.7.0

Jul 22, 2025

0.6.2

Jul 18, 2025

0.6.1

Jul 17, 2025

0.6.0

Jul 16, 2025

0.5.1

Jul 16, 2025

0.5.0

Jul 16, 2025

0.4.5

Jul 15, 2025

0.4.4

Jul 14, 2025

0.4.3

Jul 11, 2025

0.4.2

Jul 11, 2025

0.4.1

Jul 10, 2025

0.4.0

Jul 10, 2025

0.1.19

Jul 8, 2025

0.1.18

Jul 8, 2025

0.1.17

Jul 8, 2025

0.1.16

Jul 4, 2025

0.1.11

Jul 3, 2025

0.1.9

Jul 3, 2025

0.1.8

Jul 3, 2025

0.1.7

Jul 3, 2025

0.1.6

Jul 3, 2025

0.1.5

Jul 3, 2025

0.1.4

Jul 3, 2025

0.1.3

Jul 3, 2025

0.1.2

Jul 3, 2025

0.1.1

Jul 3, 2025

This version

0.1.0

Jul 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

debase-0.1.0.tar.gz (101.1 kB view details)

Uploaded Jul 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

debase-0.1.0-py3-none-any.whl (98.9 kB view details)

Uploaded Jul 3, 2025 Python 3

File details

Details for the file debase-0.1.0.tar.gz.

File metadata

Download URL: debase-0.1.0.tar.gz
Upload date: Jul 3, 2025
Size: 101.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for debase-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3983d379b9302fd26bca16371f80dc799cb93ec059e4dd170e8901b43485a424`
MD5	`e0d02ea655b0cf6e8a777a81c38c726d`
BLAKE2b-256	`d9b10e4aff2cc5732975ebc69dee1e87c34c5b9ab9a4f44317c11774e9417637`

See more details on using hashes here.

File details

Details for the file debase-0.1.0-py3-none-any.whl.

File metadata

Download URL: debase-0.1.0-py3-none-any.whl
Upload date: Jul 3, 2025
Size: 98.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for debase-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`087b45c844ea11d4b009a1efd32f5a60d524625ea90ba16ff066821c09698855`
MD5	`1fa90488517d6dca567af82fe11f426a`
BLAKE2b-256	`2b1a5536016455aaa7d948f0e26ad91b77c986a962654f3cd2a8dd46e07c79d2`

See more details on using hashes here.

debase 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DEBase

Installation

Requirements

Recent Updates

Quick Start

Basic Usage

High-Performance Parallel Processing

Processing Methods

1. Parallel Individual Processing (Recommended)

2. Batch Processing (Fastest)

3. Sequential Processing (Most Accurate)

Performance Comparison

Advanced Usage

Skip Steps with Existing Data

Direct Module Usage

Python API

Pipeline Architecture

Features

Complete Command Reference

Core Arguments

Performance Options

Pipeline Control

Data Management

Advanced Options

Tips for Best Performance

Troubleshooting

No sequences found

Incorrect enzyme extraction

PDB matching issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes