Enzyme lineage analysis and sequence extraction package
Project description
DEBase
Enzyme lineage analysis and sequence extraction package with advanced parallel processing capabilities.
Installation
pip install debase
Requirements
- Python 3.8 or higher
- A Gemini API key (set as environment variable
GEMINI_API_KEY)
Recent Updates
- Campaign-Aware Extraction: Automatically detects and processes multiple directed evolution campaigns in a single paper
- Improved Model Support: Updated to use stable Gemini models for better reliability
- Enhanced PDB Integration: Intelligent AI-based matching of PDB structures to enzyme variants
- Better Filtering: Automatic removal of non-enzyme entries (buffers, controls, media)
- Optimized Performance: Removed unnecessary rate limiting for faster processing
- External Sequence Fetching: Automatic retrieval from PDB and UniProt databases when sequences aren't in papers
- Improved SI Processing: Structure-aware extraction of supplementary information
- Vision Support: Extracts data from figures and tables using multimodal AI capabilities
Quick Start
Basic Usage
# Run the full pipeline (sequential processing)
debase --manuscript manuscript.pdf --si supplementary.pdf --output output.csv
High-Performance Parallel Processing
# Use parallel individual processing for maximum speed + accuracy
debase --manuscript manuscript.pdf --si supplementary.pdf --output output.csv \
--use-parallel-individual --max-workers 5
# Use batch processing for maximum speed (slight accuracy trade-off)
debase --manuscript manuscript.pdf --si supplementary.pdf --output output.csv \
--use-optimized-reaction --reaction-batch-size 5
Processing Methods
DEBase offers three processing approaches optimized for different use cases:
1. Parallel Individual Processing (Recommended)
- 42 individual API calls (21 for reactions + 21 for substrate scope)
- 5 calls running simultaneously for 4-5x speedup
- Maximum accuracy - each enzyme gets dedicated attention
- Best for: Production use, important analyses
debase --manuscript paper.pdf --si si.pdf --use-parallel-individual --max-workers 5
2. Batch Processing (Fastest)
- ~8 total API calls (multiple enzymes per call)
- Fastest processing - up to 8x speedup
- Good accuracy - slight trade-off for complex chemical names
- Best for: Quick analyses, large-scale processing
debase --manuscript paper.pdf --si si.pdf --use-optimized-reaction --reaction-batch-size 5
3. Sequential Processing (Most Accurate)
- 42 sequential API calls (one at a time)
- Highest accuracy but slowest
- Best for: Critical analyses, small datasets
debase --manuscript paper.pdf --si si.pdf # Default method
Advanced Usage
Skip Steps with Existing Data
# Skip lineage extraction if you already have it
debase --manuscript paper.pdf --si si.pdf --output output.csv \
--skip-lineage --existing-lineage existing_lineage.csv \
--use-parallel-individual
Direct Module Usage
# Run only reaction extraction with parallel processing
python -m debase.reaction_info_extractor_parallel \
--manuscript paper.pdf --si si.pdf --lineage-csv lineage.csv \
--max-workers 5 --output reactions.csv
# Run only substrate scope extraction with parallel processing
python -m debase.substrate_scope_extractor_parallel \
--manuscript paper.pdf --si si.pdf --lineage-csv lineage.csv \
--max-workers 5 --output substrate_scope.csv
Pipeline Architecture
The DEBase pipeline consists of 5 main steps:
- Lineage Extraction (Sequential) - Identifies all enzymes and their relationships
- Extracts mutation information and evolutionary paths
- Detects multiple directed evolution campaigns automatically
- Fetches sequences from external databases (PDB, UniProt)
- Filters out non-enzyme entries automatically
- Sequence Cleanup (Local) - Generates protein sequences from mutations
- Applies mutations to parent sequences
- Handles complex mutations and domain modifications
- Validates sequence integrity
- Reaction Extraction (Parallel/Batch/Sequential) - Extracts reaction conditions and performance data
- Campaign-aware extraction for multi-lineage papers
- Vision-based extraction from figures and tables
- Automatic IUPAC name resolution
- Substrate Scope Extraction (Parallel/Sequential) - Finds additional substrates tested
- Data Formatting (Local) - Combines all data into final output
Features
- Multi-processing modes: Sequential, parallel individual, and batch processing
- Campaign detection: Automatically identifies and separates multiple directed evolution campaigns
- Intelligent error handling: Automatic retries with exponential backoff
- External database integration: Automatic sequence fetching from PDB and UniProt
- AI-powered matching: Uses Gemini to intelligently match database entries to enzyme variants
- Smart filtering: Automatically excludes non-enzyme entries (buffers, controls, etc.)
- Vision capabilities: Extracts data from both text and images in PDFs
Complete Command Reference
Core Arguments
--manuscript PATH # Required: Path to manuscript PDF
--si PATH # Optional: Path to supplementary information PDF
--output PATH # Output file path (default: manuscript_name_debase.csv)
Performance Options
--use-parallel-individual # Use parallel processing (recommended)
--max-workers N # Number of parallel workers (default: 5)
--use-optimized-reaction # Use batch processing for speed
--reaction-batch-size N # Enzymes per batch (default: 5)
--no-parallel-queries # Disable parallel processing
Pipeline Control
--skip-lineage # Skip lineage extraction step
--skip-sequence # Skip sequence cleanup step
--skip-reaction # Skip reaction extraction step
--skip-substrate-scope # Skip substrate scope extraction step
--skip-lineage-format # Skip final formatting step
--skip-validation # Skip data validation step
Data Management
--existing-lineage PATH # Use existing lineage data
--existing-sequence PATH # Use existing sequence data
--existing-reaction PATH # Use existing reaction data
--keep-intermediates # Preserve intermediate files
Advanced Options
--model-name NAME # Gemini model to use
--max-retries N # Maximum retry attempts (default: 2)
--max-chars N # Max characters from PDFs (default: 75000)
--debug-dir PATH # Directory for debug output (prompts, API responses)
Tips for Best Performance
- Use parallel individual processing for the best balance of speed and accuracy
- Set max-workers to 5 to avoid API rate limits while maximizing throughput
- Use batch processing only when speed is critical and some accuracy loss is acceptable
- Skip validation (
--skip-validation) for faster processing in production - Keep intermediates (
--keep-intermediates) for debugging and incremental runs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
debase-0.1.2.tar.gz
(99.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
debase-0.1.2-py3-none-any.whl
(98.4 kB
view details)
File details
Details for the file debase-0.1.2.tar.gz.
File metadata
- Download URL: debase-0.1.2.tar.gz
- Upload date:
- Size: 99.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f56b0fcc3b79097366cc69e92ff3ed385449c4f3d5802ec3b025337a1b8c135
|
|
| MD5 |
0992519532cbdc280a264a9a3328d52f
|
|
| BLAKE2b-256 |
85d0c15cf3a69d8ad51dce8079a72ba9ff4ba59dd0277e1009900514eddd5902
|
File details
Details for the file debase-0.1.2-py3-none-any.whl.
File metadata
- Download URL: debase-0.1.2-py3-none-any.whl
- Upload date:
- Size: 98.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
794f2563dbe44679ddbc1c3b39e90311b142417564bc6620cc80ed8cb456be49
|
|
| MD5 |
a146fa85dc3985369721dec27b2f427f
|
|
| BLAKE2b-256 |
50fd850d8e61f8c4078acce57873193cff6d827bd4bbe1c9291c039621290c07
|