Variant literature assessment pipeline with AI extraction

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lgruen-vcgs

These details have not been verified by PyPI

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

Flowa

Variant literature assessment pipeline with AI extraction.

Flowa's interactive evidence viewer: paper list on the left, aggregated assessment with inline citations in the centre, and the source PDF with bounding-box highlights on the right.

Each citation in the aggregated assessment links back to the exact highlighted quote in the source paper's PDF.

Architecture

Flowa is a single async pipeline that processes genetic variant literature:

query → download → convert → extract → aggregate

Query: Search Mastermind/LitVar for papers, resolve PMIDs to DOIs via PubMed
Download: Fetch PDFs from PMC (main article + supplements)
Convert: PDF → Markdown via anchorite (LLM-based conversion)
Extract: Per-paper evidence extraction via LLM
Aggregate: Cross-paper synthesis via LLM, resolving citation quotes to PDF bounding boxes via anchorite

Papers are processed in parallel. LLM concurrency is controlled via --llm-concurrency.

Installation

Install from PyPI, opting into the provider extras you need (one of anthropic, bedrock, google, openai):

pip install 'flowapy[bedrock]==0.1.0'
# or
uv pip install 'flowapy[bedrock,anthropic]==0.1.0'

The flowa CLI is exposed as a console script. See Configuration for credentials and storage setup.

Usage

# Full pipeline
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

# Individual steps (for debugging)
flowa query --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
flowa download --doi '10.1038/s41586-020-2308-7'
flowa convert --doi '10.1038/s41586-020-2308-7'
flowa extract --variant-id VAR123 --doi '10.1038/s41586-020-2308-7'
flowa aggregate --variant-id VAR123

Configuration

Environment Variables

Variable	Description	Example
`FLOWA_STORAGE_BASE`	Storage path for PDFs, extractions, results	`s3://bucket`, `gs://bucket`, `file:///path`
`FLOWA_CONVERT_MODEL`	LLM for PDF→Markdown conversion (anchorite)	`bedrock:au.anthropic.claude-sonnet-4-6`
`FLOWA_EXTRACTION_MODEL`	LLM for extraction and aggregation	`bedrock:au.anthropic.claude-opus-4-6`

LLM Providers

Models use pydantic-ai format. Examples:

AWS Bedrock: bedrock:au.anthropic.claude-sonnet-4-6 (convert), bedrock:au.anthropic.claude-opus-4-6 (extraction)
Google Gemini: google-gla:gemini-3-pro
OpenAI: openai:gpt-5.2

Provider credentials:

Provider	Required Variables
AWS Bedrock	`AWS_PROFILE` + `AWS_REGION`, or `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `AWS_REGION`
Google Gemini	`GOOGLE_API_KEY`
OpenAI	`OPENAI_API_KEY`

Storage Backends

Backend	`FLOWA_STORAGE_BASE`	Additional Variables
AWS S3	`s3://bucket-name`	AWS credentials (see above)
Google Cloud Storage	`gs://bucket-name`	`GOOGLE_APPLICATION_CREDENTIALS` or workload identity
S3-compatible (MinIO)	`s3://bucket-name`	`FSSPEC_S3_ENDPOINT_URL`, `FSSPEC_S3_KEY`, `FSSPEC_S3_SECRET`
Local filesystem	`file:///path`	—

Prompt Customization

Flowa supports site-specific prompt sets. Each prompt set is a directory under prompts/ containing prompt templates and Pydantic schema modules.

Variable	Description	Default
`FLOWA_PROMPT_SET`	Name of the prompt set directory to use	`generic`

Prompt Set Structure

prompts/{prompt_set}/
├── extraction_prompt.txt      # Prompt template for individual paper extraction
├── extraction_schema.py       # Pydantic model defining ExtractionResult
├── aggregation_prompt.txt     # Prompt template for cross-paper aggregation
└── aggregation_schema.py      # Pydantic model defining AggregationResult

Interface Requirements

Schema modules must define Pydantic models with specific fields that Flowa's validation logic depends on:

extraction_schema.py must define ExtractionResult with:

evidence[].citations[].quote (str) — verbatim quote from the paper

aggregation_schema.py must define AggregationResult with:

results[].citations[].paper_id (str) — paper identifier
results[].citations[].quote (str) — verbatim quote resolved to PDF bounding boxes

All other fields can be customized freely. See prompts/generic/ for the default implementation.

Citation Format

The pipeline uses a unified citation format:

[display text](#cite:paperId "verbatim quote to highlight")

paperId = AuthorYear label (e.g., Smith2024) from paper_id_mapping
The title attribute carries a verbatim quote that scopes the PDF highlight
Display text is free-form

During aggregation, quotes are resolved against each paper's source PDF (via anchorite.PdfIndex) to produce bounding box coordinates. The aggregate output contains pre-resolved bboxes arrays for each citation. Quotes that cannot be resolved get empty bboxes.

Storage Layout

papers/{encoded_doi}/
  source.pdf              # Downloaded PDF
  markdown.md             # LLM-generated Markdown
  metadata.json           # PubMed metadata (title, authors, date, etc.)

assessments/{variant_id}/
  workflow.json            # Pipeline run metadata
  variant_details.json     # VariantValidator output
  query.json               # Query results (DOI list)
  aggregation.json         # Aggregated assessment with pre-resolved bboxes
  aggregation_raw.json     # Raw LLM conversation
  extractions/
    {encoded_doi}.json     # Per-paper extraction (quotes + commentary)
    {encoded_doi}_raw.json # Raw LLM conversation

Development

This repo is a polyglot monorepo: a Python pipeline under src/flowa/, TypeScript packages under packages/, and worked examples under examples/. Each piece has its own dependency closure, and each Python project (the library and examples/demo-gateway/) is an independent uv project. Running pytest from the repo root would walk into the sibling project and fail on its venv-specific imports — always run pytest from the project that owns the tests, scoping it to the local tests/ directory:

# Library tests
uv run pytest tests/

# Demo-gateway tests
cd examples/demo-gateway && uv run pytest tests/

The TypeScript packages and examples share one pnpm workspace, so the JS/TS test runner is a single recursive invocation:

pnpm -r typecheck
pnpm -r test

Lint and format checks are unified under pre-commit; CI invokes the same hook so local and CI behaviour match:

uv run pre-commit run --all-files

Releasing

Bump [project].version in pyproject.toml, commit, then push a matching tag:

git tag flowapy-v0.1.0
git push origin flowapy-v0.1.0

The tag-driven workflow (.github/workflows/release-flowapy.yaml) builds the package and publishes to PyPI via OIDC trusted publishing. The pypi GitHub environment requires manual approval before the publish step runs.

Deployment

Local Development

export FLOWA_STORAGE_BASE=file:///tmp/flowa
export FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6
export FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6
uv run flowa run --variant-id test --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

Docker

docker build --build-arg LLM_EXTRA=bedrock -t flowa .
docker run \
  -e FLOWA_STORAGE_BASE=s3://bucket \
  -e FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6 \
  -e FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6 \
  -e AWS_REGION=ap-southeast-2 \
  flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

AWS Batch

Create a job definition with the flowa container image. A typical run processes up to 50 papers with LLM calls for conversion, extraction, and aggregation — allow sufficient time and retries.

aws batch register-job-definition \
  --job-definition-name flowa-worker \
  --type container \
  --container-properties '{
    "image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/flowa:latest",
    "resourceRequirements": [
      {"type": "VCPU", "value": "2"},
      {"type": "MEMORY", "value": "8192"}
    ],
    "environment": [
      {"name": "FLOWA_STORAGE_BASE", "value": "s3://flowa-data"},
      {"name": "FLOWA_CONVERT_MODEL", "value": "bedrock:au.anthropic.claude-sonnet-4-6"},
      {"name": "FLOWA_EXTRACTION_MODEL", "value": "bedrock:au.anthropic.claude-opus-4-6"}
    ]
  }' \
  --retry-strategy '{"attempts": 2}' \
  --timeout '{"attemptDurationSeconds": 3600}'

Submit a job:

aws batch submit-job \
  --job-name "flowa-VAR123" \
  --job-definition flowa-worker \
  --job-queue flowa-queue \
  --container-overrides '{
    "command": ["run", "--variant-id", "VAR123", "--gene", "GAA", "--hgvs-c", "NM_000152.5:c.2238G>C", "--source", "litvar"]
  }'

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lgruen-vcgs

These details have not been verified by PyPI

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3.13
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flowapy-0.1.0.tar.gz (24.4 MB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flowapy-0.1.0-py3-none-any.whl (55.1 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file flowapy-0.1.0.tar.gz.

File metadata

Download URL: flowapy-0.1.0.tar.gz
Upload date: May 22, 2026
Size: 24.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flowapy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b70d13100ba2bf434f555bcdc3b50485ee10e14f307362afb789e644800b7fe8`
MD5	`f3451997d40e5e415fe054dd8098b933`
BLAKE2b-256	`2571dd7b0e04c77e9e442120a8cd3eaf2c124103d14f35655631d6af781b7cf8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flowapy-0.1.0.tar.gz:

Publisher: release-flowapy.yaml on populationgenomics/flowa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flowapy-0.1.0.tar.gz
- Subject digest: b70d13100ba2bf434f555bcdc3b50485ee10e14f307362afb789e644800b7fe8
- Sigstore transparency entry: 1600851091
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: populationgenomics/flowa@7cfa954d4968cf2ee436e621999c9097ad0fac24
- Branch / Tag: refs/tags/flowapy-v0.1.0
- Owner: https://github.com/populationgenomics
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-flowapy.yaml@7cfa954d4968cf2ee436e621999c9097ad0fac24
- Trigger Event: push

File details

Details for the file flowapy-0.1.0-py3-none-any.whl.

File metadata

Download URL: flowapy-0.1.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 55.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flowapy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`622fa6094f24c9cde4f324f5e9479e96dcc324b846aef792e57a4ada96c76d57`
MD5	`9a85c1ab3d663e3c2affff59f81be980`
BLAKE2b-256	`83bb70b2a108b74c35da390c0b534c1b1fa2b71998d338e1b9ac40ea2585a038`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flowapy-0.1.0-py3-none-any.whl:

Publisher: release-flowapy.yaml on populationgenomics/flowa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flowapy-0.1.0-py3-none-any.whl
- Subject digest: 622fa6094f24c9cde4f324f5e9479e96dcc324b846aef792e57a4ada96c76d57
- Sigstore transparency entry: 1600851561
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: populationgenomics/flowa@7cfa954d4968cf2ee436e621999c9097ad0fac24
- Branch / Tag: refs/tags/flowapy-v0.1.0
- Owner: https://github.com/populationgenomics
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-flowapy.yaml@7cfa954d4968cf2ee436e621999c9097ad0fac24
- Trigger Event: push

flowapy 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Flowa

Architecture

Installation

Usage

Configuration

Environment Variables

LLM Providers

Storage Backends

Prompt Customization

Prompt Set Structure

Interface Requirements

Citation Format

Storage Layout

Development

Releasing

Deployment

Local Development

Docker

AWS Batch

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance