Variant literature assessment pipeline with AI extraction
Project description
Flowa
Variant literature assessment pipeline with AI extraction.
Each citation in the aggregated assessment links back to the exact highlighted quote in the source paper's PDF.
Architecture
Flowa is a single async pipeline that processes genetic variant literature:
query → download → convert → extract → aggregate
- Query: Search Mastermind/LitVar for papers, resolve PMIDs to DOIs via PubMed
- Download: Fetch PDFs from PMC (main article + supplements)
- Convert: PDF → Markdown via anchorite (LLM-based conversion)
- Extract: Per-paper evidence extraction via LLM
- Aggregate: Cross-paper synthesis via LLM, resolving citation quotes to PDF bounding boxes via anchorite
Papers are processed in parallel. LLM concurrency is controlled via --llm-concurrency.
Installation
Install from PyPI, opting into the provider extras you need (one of anthropic, bedrock, google, openai):
pip install 'flowapy[bedrock]==0.1.0'
# or
uv pip install 'flowapy[bedrock,anthropic]==0.1.0'
The flowa CLI is exposed as a console script. See Configuration for credentials and storage setup.
Usage
# Full pipeline
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
# Individual steps (for debugging)
flowa query --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
flowa download --doi '10.1038/s41586-020-2308-7'
flowa convert --doi '10.1038/s41586-020-2308-7'
flowa extract --variant-id VAR123 --doi '10.1038/s41586-020-2308-7'
flowa aggregate --variant-id VAR123
Configuration
Environment Variables
| Variable | Description | Example |
|---|---|---|
FLOWA_STORAGE_BASE |
Storage path for PDFs, extractions, results | s3://bucket, gs://bucket, file:///path |
FLOWA_CONVERT_MODEL |
LLM for PDF→Markdown conversion (anchorite) | bedrock:au.anthropic.claude-sonnet-4-6 |
FLOWA_EXTRACTION_MODEL |
LLM for extraction and aggregation | bedrock:au.anthropic.claude-opus-4-6 |
LLM Providers
Models use pydantic-ai format. Examples:
- AWS Bedrock:
bedrock:au.anthropic.claude-sonnet-4-6(convert),bedrock:au.anthropic.claude-opus-4-6(extraction) - Google Gemini:
google-gla:gemini-3-pro - OpenAI:
openai:gpt-5.2
Provider credentials:
| Provider | Required Variables |
|---|---|
| AWS Bedrock | AWS_PROFILE + AWS_REGION, or AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_REGION |
| Google Gemini | GOOGLE_API_KEY |
| OpenAI | OPENAI_API_KEY |
Storage Backends
| Backend | FLOWA_STORAGE_BASE |
Additional Variables |
|---|---|---|
| AWS S3 | s3://bucket-name |
AWS credentials (see above) |
| Google Cloud Storage | gs://bucket-name |
GOOGLE_APPLICATION_CREDENTIALS or workload identity |
| S3-compatible (MinIO) | s3://bucket-name |
FSSPEC_S3_ENDPOINT_URL, FSSPEC_S3_KEY, FSSPEC_S3_SECRET |
| Local filesystem | file:///path |
— |
Prompt Customization
Flowa supports site-specific prompt sets. Each prompt set is a directory under prompts/ containing prompt templates and Pydantic schema modules.
| Variable | Description | Default |
|---|---|---|
FLOWA_PROMPT_SET |
Name of the prompt set directory to use | generic |
Prompt Set Structure
prompts/{prompt_set}/
├── extraction_prompt.txt # Prompt template for individual paper extraction
├── extraction_schema.py # Pydantic model defining ExtractionResult
├── aggregation_prompt.txt # Prompt template for cross-paper aggregation
└── aggregation_schema.py # Pydantic model defining AggregationResult
Interface Requirements
Schema modules must define Pydantic models with specific fields that Flowa's validation logic depends on:
extraction_schema.py must define ExtractionResult with:
evidence[].citations[].quote(str) — verbatim quote from the paper
aggregation_schema.py must define AggregationResult with:
results[].citations[].paper_id(str) — paper identifierresults[].citations[].quote(str) — verbatim quote resolved to PDF bounding boxes
All other fields can be customized freely. See prompts/generic/ for the default implementation.
Citation Format
The pipeline uses a unified citation format:
[display text](#cite:paperId "verbatim quote to highlight")
paperId= AuthorYear label (e.g.,Smith2024) frompaper_id_mapping- The title attribute carries a verbatim quote that scopes the PDF highlight
- Display text is free-form
During aggregation, quotes are resolved against each paper's source PDF (via anchorite.PdfIndex) to produce bounding box coordinates. The aggregate output contains pre-resolved bboxes arrays for each citation. Quotes that cannot be resolved get empty bboxes.
Storage Layout
papers/{encoded_doi}/
source.pdf # Downloaded PDF
markdown.md # LLM-generated Markdown
metadata.json # PubMed metadata (title, authors, date, etc.)
assessments/{variant_id}/
workflow.json # Pipeline run metadata
variant_details.json # VariantValidator output
query.json # Query results (DOI list)
aggregation.json # Aggregated assessment with pre-resolved bboxes
aggregation_raw.json # Raw LLM conversation
extractions/
{encoded_doi}.json # Per-paper extraction (quotes + commentary)
{encoded_doi}_raw.json # Raw LLM conversation
Development
This repo is a polyglot monorepo: a Python pipeline under src/flowa/,
TypeScript packages under packages/, and worked examples under
examples/. Each piece has its own dependency closure, and each Python
project (the library and examples/demo-gateway/) is an independent
uv project. Running pytest from the repo root would walk into the
sibling project and fail on its venv-specific imports — always run
pytest from the project that owns the tests, scoping it to the local
tests/ directory:
# Library tests
uv run pytest tests/
# Demo-gateway tests
cd examples/demo-gateway && uv run pytest tests/
The TypeScript packages and examples share one pnpm workspace, so the JS/TS test runner is a single recursive invocation:
pnpm -r typecheck
pnpm -r test
Lint and format checks are unified under pre-commit; CI invokes the same hook so local and CI behaviour match:
uv run pre-commit run --all-files
Releasing
Bump [project].version in pyproject.toml, commit, then push a matching tag:
git tag flowapy-v0.1.0
git push origin flowapy-v0.1.0
The tag-driven workflow (.github/workflows/release-flowapy.yaml) builds the package and publishes to PyPI via OIDC trusted publishing. The pypi GitHub environment requires manual approval before the publish step runs.
Deployment
Local Development
export FLOWA_STORAGE_BASE=file:///tmp/flowa
export FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6
export FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6
uv run flowa run --variant-id test --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
Docker
docker build --build-arg LLM_EXTRA=bedrock -t flowa .
docker run \
-e FLOWA_STORAGE_BASE=s3://bucket \
-e FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6 \
-e FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6 \
-e AWS_REGION=ap-southeast-2 \
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
AWS Batch
Create a job definition with the flowa container image. A typical run processes up to 50 papers with LLM calls for conversion, extraction, and aggregation — allow sufficient time and retries.
aws batch register-job-definition \
--job-definition-name flowa-worker \
--type container \
--container-properties '{
"image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/flowa:latest",
"resourceRequirements": [
{"type": "VCPU", "value": "2"},
{"type": "MEMORY", "value": "8192"}
],
"environment": [
{"name": "FLOWA_STORAGE_BASE", "value": "s3://flowa-data"},
{"name": "FLOWA_CONVERT_MODEL", "value": "bedrock:au.anthropic.claude-sonnet-4-6"},
{"name": "FLOWA_EXTRACTION_MODEL", "value": "bedrock:au.anthropic.claude-opus-4-6"}
]
}' \
--retry-strategy '{"attempts": 2}' \
--timeout '{"attemptDurationSeconds": 3600}'
Submit a job:
aws batch submit-job \
--job-name "flowa-VAR123" \
--job-definition flowa-worker \
--job-queue flowa-queue \
--container-overrides '{
"command": ["run", "--variant-id", "VAR123", "--gene", "GAA", "--hgvs-c", "NM_000152.5:c.2238G>C", "--source", "litvar"]
}'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flowapy-0.1.0.tar.gz.
File metadata
- Download URL: flowapy-0.1.0.tar.gz
- Upload date:
- Size: 24.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b70d13100ba2bf434f555bcdc3b50485ee10e14f307362afb789e644800b7fe8
|
|
| MD5 |
f3451997d40e5e415fe054dd8098b933
|
|
| BLAKE2b-256 |
2571dd7b0e04c77e9e442120a8cd3eaf2c124103d14f35655631d6af781b7cf8
|
Provenance
The following attestation bundles were made for flowapy-0.1.0.tar.gz:
Publisher:
release-flowapy.yaml on populationgenomics/flowa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flowapy-0.1.0.tar.gz -
Subject digest:
b70d13100ba2bf434f555bcdc3b50485ee10e14f307362afb789e644800b7fe8 - Sigstore transparency entry: 1600851091
- Sigstore integration time:
-
Permalink:
populationgenomics/flowa@7cfa954d4968cf2ee436e621999c9097ad0fac24 -
Branch / Tag:
refs/tags/flowapy-v0.1.0 - Owner: https://github.com/populationgenomics
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-flowapy.yaml@7cfa954d4968cf2ee436e621999c9097ad0fac24 -
Trigger Event:
push
-
Statement type:
File details
Details for the file flowapy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: flowapy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 55.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
622fa6094f24c9cde4f324f5e9479e96dcc324b846aef792e57a4ada96c76d57
|
|
| MD5 |
9a85c1ab3d663e3c2affff59f81be980
|
|
| BLAKE2b-256 |
83bb70b2a108b74c35da390c0b534c1b1fa2b71998d338e1b9ac40ea2585a038
|
Provenance
The following attestation bundles were made for flowapy-0.1.0-py3-none-any.whl:
Publisher:
release-flowapy.yaml on populationgenomics/flowa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flowapy-0.1.0-py3-none-any.whl -
Subject digest:
622fa6094f24c9cde4f324f5e9479e96dcc324b846aef792e57a4ada96c76d57 - Sigstore transparency entry: 1600851561
- Sigstore integration time:
-
Permalink:
populationgenomics/flowa@7cfa954d4968cf2ee436e621999c9097ad0fac24 -
Branch / Tag:
refs/tags/flowapy-v0.1.0 - Owner: https://github.com/populationgenomics
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-flowapy.yaml@7cfa954d4968cf2ee436e621999c9097ad0fac24 -
Trigger Event:
push
-
Statement type: