A pipeline framework for python
Project description
A pipeline framework for python
Documentation | ChangeLog | Examples | API
Why pipen?
pipen is designed for data scientists, bioinformaticians, and researchers who need to create reproducible, scalable computational pipelines without the complexity of traditional workflow systems.
Target Audience
- Data Scientists: Process large datasets with automatic parallelization and caching
- Bioinformaticians: Build reproducible analysis pipelines for genomics data
- Researchers: Create transparent, reproducible workflows for computational research
- DevOps Engineers: Orchestrate batch jobs across different schedulers (SLURM, SGE, Google Cloud)
Key Benefits
1. Zero Configuration
- Get started immediately with sensible defaults
- Configure only what you need, when you need it
- Profile-based configuration for different environments
2. Reproducibility Built-In
- Automatic job caching based on input/output signatures
- Full audit trail of pipeline runs and parameters
- Dependency tracking ensures processes run in correct order
3. Flexible Scheduling
- Run locally for development
- Scale to HPC clusters (SLURM, SGE)
- Deploy to cloud (Google Cloud Batch, SSH)
- Run in containers for reproducibility
4. Developer-Friendly
- Define pipelines as Python classes
- Use familiar Python syntax and tools
- Extensible plugin system for custom functionality
- Rich, informative logging and progress tracking
5. Data Flow Management
- Automatic data passing between pipeline stages
- Support for files, directories, and in-memory data
- Built-in operations for transforming and aggregating data
Comparison with Alternatives
| Feature | pipen | Snakemake | Nextflow | Airflow |
|---|---|---|---|---|
| Target Audience | Data Scientists, Bioinformaticians, Researchers, DevOps | Bioinformaticians | Bioinformaticians | Data Engineers |
| Learning Curve | Low | Medium | High | High |
| Python Integration | Native | Limited | Limited | Native |
| Scheduler Support | 6+ (Local, SGE, SLURM, SSH, Container, Gbatch) | Limited | Limited | Plugin-based |
| Caching | Built-in, automatic | Manual | Manual | Plugin-based |
| Cloud Native Support | Yes (Google Cloud Batch) | Partial | Yes | Yes |
| Interactive Debugging | Yes | Limited | No | No |
| Easy to Use | Define pipelines as Python classes, familiar syntax | Workflow DSL, separate config files | DAG definition in Python, complex UI | |
| Zero Configuration | Sensible defaults, configure only what needed | Many configuration options | Heavy configuration required | Complex setup |
| Nice Logging | Rich, informative, color-coded, progress bars | Text-based | Text-based | Basic logging |
| Highly Extensible | Simple plugin system, hook-based | Custom rules/scripts | Custom operators | Custom operators/providers |
| Data Flow Management | Built-in channel operations (expand_dir, collapse_files) | Manual handling | Channel system | XCom system |
| Reproducibility | Built-in caching, full audit trail | Manual | Versioned containers | DAG versioning |
| Flexible Scheduling | Switch schedulers without code changes | Config-based | Config-based | Config-based |
Installation
pip install -U pipen
Quickstart
example.py
from pipen import Proc, Pipen, run
class P1(Proc):
"""Sort input file"""
input = "infile"
input_data = ["/tmp/data.txt"]
output = "outfile:file:intermediate.txt"
script = "cat {{in.infile}} | sort > {{out.outfile}}"
class P2(Proc):
"""Paste line number"""
requires = P1
input = "infile:file"
output = "outfile:file:result.txt"
script = "paste <(seq 1 3) {{in.infile}} > {{out.outfile}}"
# class MyPipeline(Pipen):
# starts = P1
if __name__ == "__main__":
# MyPipeline().run()
run("MyPipeline", starts=P1)
> echo -e "3\n2\n1" > /tmp/data.txt
> python example.py
04-17 16:19:35 I core _____________________________________ __
04-17 16:19:35 I core ___ __ \___ _/__ __ \__ ____/__ | / /
04-17 16:19:35 I core __ /_/ /__ / __ /_/ /_ __/ __ |/ /
04-17 16:19:35 I core _ ____/__/ / _ ____/_ /___ _ /| /
04-17 16:19:35 I core /_/ /___/ /_/ /_____/ /_/ |_/
04-17 16:19:35 I core
04-17 16:19:35 I core version: 1.1.8
04-17 16:19:35 I core
04-17 16:19:35 I core ╔═══════════════════════════ MYPIPELINE ════════════════════════════╗
04-17 16:19:35 I core ║ My pipeline ║
04-17 16:19:35 I core ╚═══════════════════════════════════════════════════════════════════╝
04-17 16:19:35 I core plugins : verbose v1.1.1
04-17 16:19:35 I core # procs : 2
04-17 16:19:35 I core profile : default
04-17 16:19:35 I core outdir : /path/to/cwd/MyPipeline-output
04-17 16:19:35 I core cache : True
04-17 16:19:35 I core dirsig : 1
04-17 16:19:35 I core error_strategy : ignore
04-17 16:19:35 I core forks : 1
04-17 16:19:35 I core lang : bash
04-17 16:19:35 I core loglevel : info
04-17 16:19:35 I core num_retries : 3
04-17 16:19:35 I core scheduler : local
04-17 16:19:35 I core submission_batch: 8
04-17 16:19:35 I core template : liquid
04-17 16:19:35 I core workdir : /path/to/cwd/.pipen/MyPipeline
04-17 16:19:35 I core plugin_opts :
04-17 16:19:35 I core template_opts : filters={'realpath': <function realpath at 0x7fc3eba12...
04-17 16:19:35 I core : globals={'realpath': <function realpath at 0x7fc3eba12...
04-17 16:19:35 I core Initializing plugins ...
04-17 16:19:36 I core
04-17 16:19:36 I core ╭─────────────────────────────── P1 ────────────────────────────────╮
04-17 16:19:36 I core │ Sort input file │
04-17 16:19:36 I core ╰───────────────────────────────────────────────────────────────────╯
04-17 16:19:36 I core P1: Workdir: '/path/to/cwd/.pipen/MyPipeline/P1'
04-17 16:19:36 I core P1: <<< [START]
04-17 16:19:36 I core P1: >>> ['P2']
04-17 16:19:36 I verbose P1: in.infile: /tmp/data.txt
04-17 16:19:36 I verbose P1: out.outfile: /path/to/cwd/.pipen/MyPipeline/P1/0/output/intermediate.txt
04-17 16:19:38 I verbose P1: Time elapsed: 00:00:02.051s
04-17 16:19:38 I core
04-17 16:19:38 I core ╭═══════════════════════════════ P2 ════════════════════════════════╮
04-17 16:19:38 I core ║ Paste line number ║
04-17 16:19:38 I core ╰═══════════════════════════════════════════════════════════════════╯
04-17 16:19:38 I core P2: Workdir: '/path/to/cwd/.pipen/MyPipeline/P2'
04-17 16:19:38 I core P2: <<< ['P1']
04-17 16:19:38 I core P2: >>> [END]
04-17 16:19:38 I verbose P2: in.infile: /path/to/cwd/.pipen/MyPipeline/P1/0/output/intermediate.txt
04-17 16:19:38 I verbose P2: out.outfile: /path/to/cwd/MyPipeline-output/P2/result.txt
04-17 16:19:41 I verbose P2: Time elapsed: 00:00:02.051s
04-17 16:19:41 I core
MYPIPELINE: 100%|██████████████████████████████| 2/2 [00:06<00:00, 0.35 procs/s]
> cat ./MyPipeline-output/P2/result.txt
1 1
2 2
3 3
Examples
See more examples at examples/ and a more realcase example at:
https://github.com/pwwang/pipen-report/tree/master/example
Plugin gallery
Plugins make pipen even better.
pipen-annotate: Use docstring to annotate pipen processespipen-args: Command line argument parser for pipenpipen-board: Visualize configuration and running of pipen pipelines on the webpipen-diagram: Draw pipeline diagrams for pipenpipen-dry: Dry runner for pipen pipelinespipen-filters: Add a set of useful filters for pipen templates.pipen-lock: Process lock for pipen to prevent multiple runs at the same time.pipen-log2file: Save running logs to file for pipenpipen-poplog: Populate logs from jobs to running log of the pipelinepipen-report: Generate report for pipenpipen-runinfo: Save running information to file for pipenpipen-verbose: Add verbosal information in logs for pipen.pipen-gcs: A plugin for pipen to handle files in Google Cloud Storage.pipen-deprecated: A pipen plugin to mark processes as deprecated.pipen-cli-init: A pipen CLI plugin to create a pipen project (pipeline)pipen-cli-ref: Make reference documentation for processespipen-cli-require: A pipen cli plugin check the requirements of a pipelinepipen-cli-run: A pipen cli plugin to run a process or a pipelinepipen-cli-gbatch: A pipen cli plugin to submit pipeline to Google Batch Jobs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipen-1.1.8.tar.gz.
File metadata
- Download URL: pipen-1.1.8.tar.gz
- Upload date:
- Size: 52.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.0 CPython/3.12.3 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ff9a6561acfb5b264a9fe046f160d3b38f1cb1d5584a7069c9fb92146054829
|
|
| MD5 |
5c4edb4d772044d803f3a62a3951101d
|
|
| BLAKE2b-256 |
b0f7777ec0d86d75d23e8c6b40fc8ecf7b23780622cacc54f2739bcde79eec0e
|
File details
Details for the file pipen-1.1.8-py3-none-any.whl.
File metadata
- Download URL: pipen-1.1.8-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.0 CPython/3.12.3 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fd7d9ac09ae1e5ee49378c1b7cfb98d77067efe198af9911264025119c33953
|
|
| MD5 |
f95b144486ec9992b9d7713800202387
|
|
| BLAKE2b-256 |
c8cc57e9fe2a6c0e9dd45151cf215f26438d5b4a87d179b7138d8297c4530a05
|