Skip to main content

Biofilter: cloud-ready biological knowledge system

Project description

Biofilter 4

Biofilter 4 is a persistent, entity-centric biological knowledge platform designed to support gene-centric annotation, filtering, and modeling workflows through a unified and extensible data architecture.

This branch (biofilter3r) contains the active development of Biofilter 4, representing a major evolution of the Biofilter framework with a redesigned schema, modern ETL architecture, and multiple interaction layers.

๐Ÿ“š Documentation:
๐Ÿ‘‰ https://biofilter.readthedocs.io/en/latest/


Quick Start

Install via pip:

pip install biofilter
biofilter --help

Connect to a database (existing instance or local) and run your first report:

export DATABASE_URL="postgresql+psycopg2://user:password@host:5432/biofilter_prod"
biofilter report list
biofilter report run --report-name etl_status --output etl_status.csv

From Python:

from biofilter import Biofilter

bf = Biofilter()
df = bf.report.run("entity_filter", input_data=["BRCA1", "TP53", "APOE"])
df.head()

For Docker, source install, or bootstrapping a local database, see the Getting Started guide.


What is Biofilter 4?

Biofilter 4 provides a persistent, versioned biological knowledge base that replaces traditional file-based annotation workflows with a reusable, query-driven platform.

Instead of repeatedly generating transient annotation files, Biofilter 4 enables users to:

  • ingest curated biological knowledge once,
  • store it in a normalized, entity-based schema,
  • reuse and query that knowledge across analyses, projects, and environments.

Biofilter 4 is designed to support both exploratory research and production-scale workflows.


Core Concepts: Entities, Domains, and Relationships

Biofilter organizes biological knowledge around three core concepts:

  • Entities

    • Canonical biological objects (for example Gene, Variant, Disease, Protein, Pathway).
  • Domains

    • Functional/omics contexts used to structure and interpret entities and their links.
  • Entity Relationships

    • A relational layer that connects entities across domains and behaves like a graph traversal surface while staying in a SQL-native environment.

This design lets users recover cross-omics relationships and reuse them directly in reports for:

  • annotation workflows,
  • filtering and prioritization workflows,
  • relationship-driven analyses that support downstream statistical modeling.

Key Features

  • Entity-centric data model

    • Canonical entities (Gene, Variant, Disease, Protein, Pathway, etc.)
    • Rich alias and cross-reference support
  • Persistent knowledge layer

    • Versioned ETL packages
    • Full provenance tracking by data source and load
  • Modular ETL architecture

    • Data Transformation Packages (DTPs)
    • Explicit separation of master data and relationships
  • High-performance ingestion

    • Managed indexing strategy
    • Optimized for large-scale sources (e.g. dbSNP, UniProt)
  • Multiple interaction layers

    • Python API
    • ORM-based data access
    • Reusable Reports
    • Command-line interface (CLI)
  • Multi-database support

    • SQLite (local development)
    • PostgreSQL (production and large-scale deployments)

Architecture Overview

At a high level, Biofilter 4 consists of:

  • ETL Layer

    • Ingests external biological sources into a normalized schema
    • Tracks execution via ETL Packages
  • Core Schema

    • Entity, Alias, Relationship, and Domain Master tables
    • Designed for extensibility and long-term evolution
  • Data Access Layer

    • ORM-backed, Python-first access to the knowledge base
    • Foundation for reports and advanced analysis
  • Report Layer

    • Curated, reusable biological queries
    • Standardized outputs as pandas DataFrames

Repository Structure (simplified)

biofilter/
โ”œโ”€โ”€ alembic/                   # Database migrations
โ”œโ”€โ”€ api/
โ”‚   โ””โ”€โ”€ cli/                   # CLI commands and entrypoints
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ components/            # db, etl, report, settings components
โ”‚   โ””โ”€โ”€ settings_manager.py
โ”œโ”€โ”€ modules/
โ”‚   โ”œโ”€โ”€ db/                    # ORM models, seeds, schema
โ”‚   โ”œโ”€โ”€ etl/                   # ETL framework and DTPs
โ”‚   โ”œโ”€โ”€ io/                    # Input/output utilities
โ”‚   โ””โ”€โ”€ report/                # Report framework and reports
โ”œโ”€โ”€ utils/                     # Shared helpers
โ””โ”€โ”€ biofilter.py               # Python API facade

docs/
โ””โ”€โ”€ source/                    # Sphinx documentation source

notebooks/
โ””โ”€โ”€ Templates/                 # Ready-to-use report tutorials

tests/
โ”œโ”€โ”€ unit/
โ””โ”€โ”€ integration/

Documentation

The full User Guide and Developer Guide are hosted on Read the Docs:

๐Ÿ“– https://biofilter.readthedocs.io/en/latest/

The documentation covers:

  • Installation and setup
  • Data sources and ETL design
  • Writing DTPs
  • Managed indexes
  • Entity and alias registration
  • Data access and report internals
  • Writing and extending reports
  • Developer tooling and project structure

Resources

  • ๐Ÿค– GPT Assistant โ€” conversational guidance for picking and using reports: Biofilter 4 Assistant
  • ๐Ÿ““ Notebook tutorials โ€” ready-to-run examples for every report: notebooks/Templates/
  • ๐Ÿ“‹ Report Catalog โ€” full index of available reports with descriptions: Read the Docs

Run with Docker (Container)

Biofilter 4 can be executed as an application-only container, using an external database via DATABASE_URL.

Build from this repository:

docker build -t biofilter:bf4 -f docker/Dockerfile .

Run CLI with external DB:

docker run --rm \
  -e DATABASE_URL="postgresql+psycopg2://user:password@host:5432/biofilter_prod" \
  biofilter:bf4

Run a report and save output to your local machine:

docker run --rm \
  -e DATABASE_URL="postgresql+psycopg2://user:password@host:5432/biofilter_prod" \
  -v "$(pwd)/outputs:/workspace/outputs" \
  biofilter:bf4 \
  biofilter report run \
    --report-name etl_status \
    --output /workspace/outputs/etl_status.csv

Open an interactive shell in the container:

docker run --rm -it \
  -e DATABASE_URL="postgresql+psycopg2://user:password@host:5432/biofilter_prod" \
  -v "$(pwd):/workspace" \
  --entrypoint /bin/bash \
  biofilter:bf4

For full container documentation (publishing, multi-arch, GitHub Actions), see:


Status

  • Current version: 4.1.2
  • Schema: Entity-centric, versioned (4.1.x)
  • ETL: Modular DTP-based ingestion
  • Stability: Actively evolving; APIs and schema may continue to change between minor releases

Contributing

Contributions, feedback, and design discussions are welcome.

When contributing:

  • Follow existing architectural patterns (Entities, DTPs, Reports).
  • Keep provenance and reproducibility as first-class concerns.
  • Prefer ORM-based logic over raw SQL when possible.
  • Document new features in the appropriate section of the docs.

License

MIT License. See LICENSE.


Acknowledgements

Biofilter builds on years of development and scientific usage across multiple generations of the framework. Biofilter 4 represents a continuation of this work, redesigned to support modern data volumes, richer biological relationships, and long-term sustainability.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biofilter-4.1.3.tar.gz (385.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biofilter-4.1.3-py3-none-any.whl (509.2 kB view details)

Uploaded Python 3

File details

Details for the file biofilter-4.1.3.tar.gz.

File metadata

  • Download URL: biofilter-4.1.3.tar.gz
  • Upload date:
  • Size: 385.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.3 Darwin/25.3.0

File hashes

Hashes for biofilter-4.1.3.tar.gz
Algorithm Hash digest
SHA256 61e98e4baa7e5f9bb5cc1b1bff4f6f92ca64b02afa75299872e5881466e6d46c
MD5 2f097d30e70694ca41c4656bd6d17684
BLAKE2b-256 c2c5a8a9be4256a36a54316770b5d61d63d8b2ce3abc315394a4708a8c3580c6

See more details on using hashes here.

File details

Details for the file biofilter-4.1.3-py3-none-any.whl.

File metadata

  • Download URL: biofilter-4.1.3-py3-none-any.whl
  • Upload date:
  • Size: 509.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.3 Darwin/25.3.0

File hashes

Hashes for biofilter-4.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4acd787021e8d88885387d4b4831cd53ee3906ff0186dd3ca692cb3bcefc73b9
MD5 fe3db775a62faf03da424183e9f3d584
BLAKE2b-256 0ffba5a5bfea8bfa0e8e4f54dff85a9acd4b27faeec073bb31c95f613c62ccc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page