Skip to main content

Variant literature assessment pipeline with AI extraction

Project description

Flowa

Variant literature assessment pipeline with AI extraction.

Flowa's interactive evidence viewer: paper list on the left, aggregated assessment with inline citations in the centre, and the source PDF with bounding-box highlights on the right.

Each citation in the aggregated assessment links back to the exact highlighted quote in the source paper's PDF.

Architecture

Flowa is a single async pipeline that processes genetic variant literature:

query → download → convert → extract → aggregate
  • Query: Search Mastermind/LitVar for papers, resolve PMIDs to DOIs via PubMed
  • Download: Fetch PDFs from PMC (main article + supplements)
  • Convert: PDF → Markdown via anchorite (LLM-based conversion)
  • Extract: Per-paper evidence extraction via LLM
  • Aggregate: Cross-paper synthesis via LLM, resolving citation quotes to PDF bounding boxes via anchorite

Papers are processed in parallel. LLM concurrency is controlled via --llm-concurrency.

Installation

Install from PyPI, opting into the provider extras you need (one of anthropic, bedrock, google, openai):

pip install 'flowapy[bedrock]==0.1.0'
# or
uv pip install 'flowapy[bedrock,anthropic]==0.1.0'

The flowa CLI is exposed as a console script. See Configuration for credentials and storage setup.

Usage

# Full pipeline
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

# Individual steps (for debugging)
flowa query --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
flowa download --doi '10.1038/s41586-020-2308-7'
flowa convert --doi '10.1038/s41586-020-2308-7'
flowa extract --variant-id VAR123 --doi '10.1038/s41586-020-2308-7'
flowa aggregate --variant-id VAR123

Configuration

Environment Variables

Variable Description Example
FLOWA_STORAGE_BASE Storage path for PDFs, extractions, results s3://bucket, gs://bucket, file:///path
FLOWA_CONVERT_MODEL LLM for PDF→Markdown conversion (anchorite) bedrock:au.anthropic.claude-sonnet-4-6
FLOWA_EXTRACTION_MODEL LLM for extraction and aggregation bedrock:au.anthropic.claude-opus-4-6

LLM Providers

Models use pydantic-ai format. Examples:

  • AWS Bedrock: bedrock:au.anthropic.claude-sonnet-4-6 (convert), bedrock:au.anthropic.claude-opus-4-6 (extraction)
  • Google Gemini: google-gla:gemini-3-pro
  • OpenAI: openai:gpt-5.2

Provider credentials:

Provider Required Variables
AWS Bedrock AWS_PROFILE + AWS_REGION, or AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_REGION
Google Gemini GOOGLE_API_KEY
OpenAI OPENAI_API_KEY

Storage Backends

Backend FLOWA_STORAGE_BASE Additional Variables
AWS S3 s3://bucket-name AWS credentials (see above)
Google Cloud Storage gs://bucket-name GOOGLE_APPLICATION_CREDENTIALS or workload identity
S3-compatible (MinIO) s3://bucket-name FSSPEC_S3_ENDPOINT_URL, FSSPEC_S3_KEY, FSSPEC_S3_SECRET
Local filesystem file:///path

Prompt Customization

Flowa supports site-specific prompt sets. Each prompt set is a directory under prompts/ containing prompt templates and Pydantic schema modules.

Variable Description Default
FLOWA_PROMPT_SET Name of the prompt set directory to use generic

Prompt Set Structure

prompts/{prompt_set}/
├── extraction_prompt.txt      # Prompt template for individual paper extraction
├── extraction_schema.py       # Pydantic model defining ExtractionResult
├── aggregation_prompt.txt     # Prompt template for cross-paper aggregation
└── aggregation_schema.py      # Pydantic model defining AggregationResult

Interface Requirements

Schema modules must define Pydantic models with specific fields that Flowa's validation logic depends on:

extraction_schema.py must define ExtractionResult with:

  • evidence[].citations[].quote (str) — verbatim quote from the paper

aggregation_schema.py must define AggregationResult with:

  • results[].citations[].paper_id (str) — paper identifier
  • results[].citations[].quote (str) — verbatim quote resolved to PDF bounding boxes

All other fields can be customized freely. See prompts/generic/ for the default implementation.

Citation Format

The pipeline uses a unified citation format:

[display text](#cite:paperId "verbatim quote to highlight")
  • paperId = AuthorYear label (e.g., Smith2024) from paper_id_mapping
  • The title attribute carries a verbatim quote that scopes the PDF highlight
  • Display text is free-form

During aggregation, quotes are resolved against each paper's source PDF (via anchorite.PdfIndex) to produce bounding box coordinates. The aggregate output contains pre-resolved bboxes arrays for each citation. Quotes that cannot be resolved get empty bboxes.

Storage Layout

papers/{encoded_doi}/
  source.pdf              # Downloaded PDF
  markdown.md             # LLM-generated Markdown
  metadata.json           # PubMed metadata (title, authors, date, etc.)

assessments/{variant_id}/
  workflow.json            # Pipeline run metadata
  variant_details.json     # VariantValidator output
  query.json               # Query results (DOI list)
  aggregation.json         # Aggregated assessment with pre-resolved bboxes
  aggregation_raw.json     # Raw LLM conversation
  extractions/
    {encoded_doi}.json     # Per-paper extraction (quotes + commentary)
    {encoded_doi}_raw.json # Raw LLM conversation

Development

This repo is a polyglot monorepo: a Python pipeline under src/flowa/, TypeScript packages under packages/, and worked examples under examples/. Each piece has its own dependency closure, and each Python project (the library and examples/demo-gateway/) is an independent uv project. Running pytest from the repo root would walk into the sibling project and fail on its venv-specific imports — always run pytest from the project that owns the tests, scoping it to the local tests/ directory:

# Library tests
uv run pytest tests/

# Demo-gateway tests
cd examples/demo-gateway && uv run pytest tests/

The TypeScript packages and examples share one pnpm workspace, so the JS/TS test runner is a single recursive invocation:

pnpm -r typecheck
pnpm -r test

Lint and format checks are unified under pre-commit; CI invokes the same hook so local and CI behaviour match:

uv run pre-commit run --all-files

Releasing

Bump [project].version in pyproject.toml, commit, then push a matching tag:

git tag flowapy-v0.1.0
git push origin flowapy-v0.1.0

The tag-driven workflow (.github/workflows/release-flowapy.yaml) builds the package and publishes to PyPI via OIDC trusted publishing. The pypi GitHub environment requires manual approval before the publish step runs.

Deployment

Local Development

export FLOWA_STORAGE_BASE=file:///tmp/flowa
export FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6
export FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6
uv run flowa run --variant-id test --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

Docker

docker build --build-arg LLM_EXTRA=bedrock -t flowa .
docker run \
  -e FLOWA_STORAGE_BASE=s3://bucket \
  -e FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6 \
  -e FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6 \
  -e AWS_REGION=ap-southeast-2 \
  flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

AWS Batch

Create a job definition with the flowa container image. A typical run processes up to 50 papers with LLM calls for conversion, extraction, and aggregation — allow sufficient time and retries.

aws batch register-job-definition \
  --job-definition-name flowa-worker \
  --type container \
  --container-properties '{
    "image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/flowa:latest",
    "resourceRequirements": [
      {"type": "VCPU", "value": "2"},
      {"type": "MEMORY", "value": "8192"}
    ],
    "environment": [
      {"name": "FLOWA_STORAGE_BASE", "value": "s3://flowa-data"},
      {"name": "FLOWA_CONVERT_MODEL", "value": "bedrock:au.anthropic.claude-sonnet-4-6"},
      {"name": "FLOWA_EXTRACTION_MODEL", "value": "bedrock:au.anthropic.claude-opus-4-6"}
    ]
  }' \
  --retry-strategy '{"attempts": 2}' \
  --timeout '{"attemptDurationSeconds": 3600}'

Submit a job:

aws batch submit-job \
  --job-name "flowa-VAR123" \
  --job-definition flowa-worker \
  --job-queue flowa-queue \
  --container-overrides '{
    "command": ["run", "--variant-id", "VAR123", "--gene", "GAA", "--hgvs-c", "NM_000152.5:c.2238G>C", "--source", "litvar"]
  }'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flowapy-0.1.0.tar.gz (24.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flowapy-0.1.0-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file flowapy-0.1.0.tar.gz.

File metadata

  • Download URL: flowapy-0.1.0.tar.gz
  • Upload date:
  • Size: 24.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flowapy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b70d13100ba2bf434f555bcdc3b50485ee10e14f307362afb789e644800b7fe8
MD5 f3451997d40e5e415fe054dd8098b933
BLAKE2b-256 2571dd7b0e04c77e9e442120a8cd3eaf2c124103d14f35655631d6af781b7cf8

See more details on using hashes here.

Provenance

The following attestation bundles were made for flowapy-0.1.0.tar.gz:

Publisher: release-flowapy.yaml on populationgenomics/flowa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flowapy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: flowapy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flowapy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 622fa6094f24c9cde4f324f5e9479e96dcc324b846aef792e57a4ada96c76d57
MD5 9a85c1ab3d663e3c2affff59f81be980
BLAKE2b-256 83bb70b2a108b74c35da390c0b534c1b1fa2b71998d338e1b9ac40ea2585a038

See more details on using hashes here.

Provenance

The following attestation bundles were made for flowapy-0.1.0-py3-none-any.whl:

Publisher: release-flowapy.yaml on populationgenomics/flowa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page